The code for this post is located here. To run the code you will need a CUDA-enabled GPU and an installation of the CUDA toolkit
Python is easy. GPUs are fast. Combining Python and CUDA is a fantastic way to leverage the performance benefits and parallelism inherent in GPUs with all of the syntactic niceties of Python. However, there is a potential snag. If the package is written purely in Python, C, or C++, distribution is easily accomplished using setuptools
and users can install your package and its dependencies with either pip install
or through the setup.py
script. Unfortunately, setuptools
is not currently compatible with NVIDIA’s compiler, nvcc
, which is necessary for compiling CUDA code. So how can we make a Python package that involves CUDA? One solution to this problem is to compile the CUDA code into a shared library which is then linked against by the Python package just like any other C/C++ library. In this tutorial I will walk through all of the steps necessary to write a C++/CUDA library and then to create a Python extension package that can be uploaded to PyPi and subsequently installed by users with pip
. I’m a big fan of CMake, and will use it to build the shared library.
The functions implemented in this example will be simple to illustrate the methodology, but if you are interested in “real-world” example, I used this technique when developing PyPrismatic
, a GPU-accelerated package for image simulation in electron microscopy as part of my Ph.D. Full details of this can be found at www.prism-em.com, which includes a walkthrough of the source code (you can also read a free-print version of the associated academic paper here).
The CUDA/C++ Code
Our library is going to be a simple calculator that can add, subtract, multiply, and divide two integers. Be aware that this is actually a bad application of GPUs because nothing is happening in parallel, but this intended as a simple example for learning purposes. If you were to instead add two arrays of integers, GPUs are great.
What we will do is expose some C++ functions that our Python package will be able to invoke, and the implementation of those functions will invoke CUDA code. The key is that Python never directly sees CUDA. It sees a C++ function signature (which it knows how to work with), but the actual implementation of that function is delayed until the linking process, and by that point the CUDA code will have been compiled.
First, let’s prototype our C++/CUDA functions
// "myCuda.cuh"
#ifndef MYCUDA_H
#define MYCUDA_H
namespace pipcudemo{
int add(int a, int b);
int subtract(int a, int b);
int multiply(int a, int b);
int divide(int a, int b);
}
#endif //MYCUDA_H
Super simple. I wrapped it in a namespace pipcudemo
to prevent any name conflicts with the standard library and because I think it is clearer.
Error checking
CUDA API calls return error codes, and it is the responsibility of the programmer to monitor them. A very helpful macro is discussed in this StackOverflow thread. You just wrap CUDA API calls in this macro and if a failure occurs at runtime it will gracefully exit and print a useful error message.
// define a helper function for checking CUDA errors.
#define cudaErrchk(ans) { GPUAssert((ans), __FILE__, __LINE__); }
inline void GPUAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
Next let’s focus on implementing one of our functions, add
. In the __host__
function, we must allocate memory on the GPU, copy our inputs, launch a CUDA kernel to perform the calculation, and then copy the result back. Our function takes in two arguments, and we must also allocate a third for the result. The kernel, which is indicated by the __global__
keyword, is trivial – we just make thread 0 perform the arithmetic. In a proper GPU application, you would normally specify many thousands or more threads in the launch configuration that would do work in parallel, but here we just create one thread.
// within myCuda.cu
__global__ void add_d(int *a_d, int *b_d, int *c_d){
if (threadIdx.x == 0){
*c_d = *a_d + *b_d;
}
}
__host__ int add(int a, int b){
int *a_d, *b_d, *c_d, result;
// allocate memory on device
cudaErrchk(cudaMalloc(&a_d, sizeof(int)));
cudaErrchk(cudaMalloc(&b_d, sizeof(int)));
cudaErrchk(cudaMalloc(&c_d, sizeof(int)));
// copy memory to device
cudaErrchk(cudaMemcpy(a_d, &a, sizeof(int), cudaMemcpyHostToDevice));
cudaErrchk(cudaMemcpy(b_d, &b, sizeof(int), cudaMemcpyHostToDevice));
// do the calculation
add_d<<<1,1>>>(a_d, b_d, c_d);
// copy result back
cudaErrchk(cudaMemcpy(&result, c_d, sizeof(int), cudaMemcpyDeviceToHost));
return result;
}
The “_d” suffix is commonly used to indicate names of objects associated with the device. The rest of the implementations for our calculator are virtually identical, and here is the full implementation.
//myCuda.cu
#include "myCuda.cuh"
#include <cstdio>
// define a helper function for checking CUDA errors. See this thread: https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
#define cudaErrchk(ans) { GPUAssert((ans), __FILE__, __LINE__); }
inline void GPUAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
namespace pipcudemo{
// implement the CUDA kernels
__global__ void add_d(int *a_d, int *b_d, int *c_d){
if (threadIdx.x == 0){
*c_d = *a_d + *b_d;
}
}
__global__ void subtract_d(int *a_d, int *b_d, int *c_d){
if (threadIdx.x == 0){
*c_d = *a_d - *b_d;
}
}
__global__ void multiply_d(int *a_d, int *b_d, int *c_d){
if (threadIdx.x == 0){
*c_d = *a_d * *b_d;
}
}
__global__ void divide_d(int *a_d, int *b_d, int *c_d){
if (threadIdx.x == 0){
*c_d = *a_d / *b_d;
}
}
// implement the wrappers that copy memory and invoke the kernels
__host__ int add(int a, int b){
int *a_d, *b_d, *c_d, result;
// allocate memory on device
cudaErrchk(cudaMalloc(&a_d, sizeof(int)));
cudaErrchk(cudaMalloc(&b_d, sizeof(int)));
cudaErrchk(cudaMalloc(&c_d, sizeof(int)));
// copy memory to device
cudaErrchk(cudaMemcpy(a_d, &a, sizeof(int), cudaMemcpyHostToDevice));
cudaErrchk(cudaMemcpy(b_d, &b, sizeof(int), cudaMemcpyHostToDevice));
// do the calculation
add_d<<<1,1>>>(a_d, b_d, c_d);
// copy result back
cudaErrchk(cudaMemcpy(&result, c_d, sizeof(int), cudaMemcpyDeviceToHost));
return result;
}
__host__ int subtract(int a, int b){
int *a_d, *b_d, *c_d, result;
// allocate memory on device
cudaErrchk(cudaMalloc(&a_d, sizeof(int)));
cudaErrchk(cudaMalloc(&b_d, sizeof(int)));
cudaErrchk(cudaMalloc(&c_d, sizeof(int)));
// copy memory to device
cudaErrchk(cudaMemcpy(a_d, &a, sizeof(int), cudaMemcpyHostToDevice));
cudaErrchk(cudaMemcpy(b_d, &b, sizeof(int), cudaMemcpyHostToDevice));
// do the calculation
subtract_d<<<1,1>>>(a_d, b_d, c_d);
// copy result back
cudaErrchk(cudaMemcpy(&result, c_d, sizeof(int), cudaMemcpyDeviceToHost));
return result;
}
__host__ int multiply(int a, int b){
int *a_d, *b_d, *c_d, result;
// allocate memory on device
cudaErrchk(cudaMalloc(&a_d, sizeof(int)));
cudaErrchk(cudaMalloc(&b_d, sizeof(int)));
cudaErrchk(cudaMalloc(&c_d, sizeof(int)));
// copy memory to device
cudaErrchk(cudaMemcpy(a_d, &a, sizeof(int), cudaMemcpyHostToDevice));
cudaErrchk(cudaMemcpy(b_d, &b, sizeof(int), cudaMemcpyHostToDevice));
// do the calculation
multiply_d<<<1,1>>>(a_d, b_d, c_d);
// copy result back
cudaErrchk(cudaMemcpy(&result, c_d, sizeof(int), cudaMemcpyDeviceToHost));
return result;
}
__host__ int divide(int a, int b){
int *a_d, *b_d, *c_d, result;
// allocate memory on device
cudaErrchk(cudaMalloc(&a_d, sizeof(int)));
cudaErrchk(cudaMalloc(&b_d, sizeof(int)));
cudaErrchk(cudaMalloc(&c_d, sizeof(int)));
// copy memory to device
cudaErrchk(cudaMemcpy(a_d, &a, sizeof(int), cudaMemcpyHostToDevice));
cudaErrchk(cudaMemcpy(b_d, &b, sizeof(int), cudaMemcpyHostToDevice));
// do the calculation
divide_d<<<1,1>>>(a_d, b_d, c_d);
// copy result back
cudaErrchk(cudaMemcpy(&result, c_d, sizeof(int), cudaMemcpyDeviceToHost));
return result;
}
}
Creating the shared library with CMake
We will next create a shared library called mylib
that exposes these 4 functions add
, subtract
, multiply
, and divide
. Good practice is to have a single header file for the library. In our case it is just a single #include
statement
// mylib.h
#ifndef MYLIB_H
#define MYLIB_H
#include "myCuda.cuh"
#endif //MYLIB_H
For the CMake infrastructure, we make use of FindCUDA to create a minimalistic CMakeLists.txt
:
# CMakeLists.txt
cmake_minimum_required(VERSION 3.0)
project(pipcudemo)
find_package(CUDA REQUIRED)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -arch=sm_30)
cuda_add_library(mylib SHARED myCuda.cu)
The line find_package(CUDA REQUIRED)
attempts to locate a CUDA installation on the user’s system and populates several variables containing include/library paths. I add -arch=sm_30
to CUDA_NVCC_FLAGS
to avoid a current warning about future deprecation of older architectures. Last, we create the shared library with the call to cuda_add_library
.
At this point, it would be smart to make a test program in C++ to make sure everything works. Here is a small driver program that will test our calculator functionality with a few assert
statements.
// testDriver.cpp
#include <cassert>
#include <iostream>
#include "myLib.h"
int main(){
assert(pipcudemo::add(3, 3) == 6);
std::cout << "3 + 3 = 6" << std::endl;
assert(pipcudemo::subtract(3, 3) == 0);
std::cout << "3 - 3 = 0" << std::endl;
assert(pipcudemo::multiply(3, 3) == 9);
std::cout << "3 * 3 = 9" << std::endl;
assert(pipcudemo::divide(3, 3) == 1);
std::cout << "3 / 3 = 1" << std::endl;
std::cout << "Test passed." << std::endl;
}
We can also compile and link this executable by adding two more lines to our CMakeLists.txt
# CMakeLists.txt
cmake_minimum_required(VERSION 3.0)
project(pipcudemo)
find_package(CUDA REQUIRED)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -arch=sm_30)
cuda_add_library(mylib SHARED myCuda.cu)
cuda_add_executable(test_mylib testDriver.cpp)
target_link_libraries(test_mylib mylib)
The call to cuda_add_executable
will create an executable called test_mylib
using the driver program we just wrote. We must remember to link with our shared library using target_link_libraries
, otherwise the driver won’t have access to the calculator implementations.
A little more detail on what is going on during compilation might be helpful. The #include
statements are only pulling in the function prototypes, which is telling our program “these are functions that I promise will be implemented”. The compiler is going to then demand that it finds implementations of the function prototypes it was promised. Otherwise, a compilation error will be thrown. These implementations can either be found within source files that are compiled into the executable or can exist in libraries against which the executable is linked. If the library is linked statically, then the function definitions are “hard-coded” into the executable just as if you had provided the function definitions as source files. Linking instead against a shared library (called “dynamic” linking) will tell our program “find me at runtime, and I promise I will have these function definitions ready for you”, and that is enough to satisfy the compiler. Static linking results in fewer files, but the executables are larger and recompilation of the entire program is required if any part of it changes. Dynamic linking is more modular, but requires that the shared libraries be found at runtime. There are advantages and disadvantages to both. I’ve always run into issues when attempting to statically link CUDA libraries, so I chose the dynamic approach here.
To build everything, we follow normal CMake practice and make a build directory.
mkdir build
cd build
Then invoke cmake
cmake ../
make
If your CUDA installation is in a standard place like /usr/local/ on *nix systems then this may be all that is necessary. Otherwise, you may need to manually specify the location of CUDA and reconfigure.
If everything went smoothly, you can run the test program with
./test_mylib
and should see some text print including “Test passed.” Shared libraries have different names depending on your operating system, but will be something like “libmylib.so”, “libmylib.dylib”, “mylib.dll”, etc. The executable needs to find the shared library, and thus it may be necessary to add the library to the search path. The variable containing paths to shared libraries is LD_LIBRARY_PATH
on Linux, DYLD_LIBRARY_PATH
on Mac, and PATH
on Windows.
Now that we have a working CUDA library, we can move on to creating a Python wrapper for it.
Creating the Python package
There are a few necessary ingredients for making a Python C++ extension
- C/C++ function(s) that parse arguments passed from Python, invoke the desired C/C++ code, and construct Python return values.
- A PyInit_modulename method that initializes the module name “modulename”
- A method definition table that maps the desired name of the function in Python to the actual C/C++ methods and defines the calling signature
- A module definition structure that connects the method definition table to the module initialization function
and a few ingredients for making a Python package
- A
setup.py
script - A folder of the same name as the package
- An
__init__.py
file within the package folder - A manifest file that indicates what files to ship with the package
The way I like to organize such a project is to have the package, in this case called pipcudemo
, contain a module called core
which is the C++ extension module. Then the __init__.py
file just contains a single line from pipcudemo.core import *
which will allow us to ultimately call our calculator functions like so
import pipcudemo as pcd
pcd.add(1,2)
The C++ extension
Although the above may sound like a lot of steps, it’s actually quite simple using some Python API calls, defined in Python.h
// pipcudemo/core.cpp
#include "Python.h"
#include "mylib.h"
static PyObject* pipcudemo_core_add(PyObject *self, PyObject *args){
int a,b;
if (!PyArg_ParseTuple(args, "ii", &a, &b))return NULL;
return Py_BuildValue("i", pipcudemo::add(a, b));
}
static PyObject* pipcudemo_core_subtract(PyObject *self, PyObject *args){
int a,b;
if (!PyArg_ParseTuple(args, "ii", &a, &b))return NULL;
return Py_BuildValue("i", pipcudemo::subtract(a, b));
}
static PyObject* pipcudemo_core_multiply(PyObject *self, PyObject *args){
int a,b;
if (!PyArg_ParseTuple(args, "ii", &a, &b))return NULL;
return Py_BuildValue("i", pipcudemo::multiply(a, b));
}
static PyObject* pipcudemo_core_divide(PyObject *self, PyObject *args){
int a,b;
if (!PyArg_ParseTuple(args, "ii", &a, &b))return NULL;
return Py_BuildValue("i", pipcudemo::divide(a, b));
}
static PyMethodDef pipcudemo_core_methods[] = {
{"add",(PyCFunction)pipcudemo_core_add, METH_VARARGS, "Add two integers"},
{"subtract",(PyCFunction)pipcudemo_core_subtract, METH_VARARGS, "Subtract two integers"},
{"multiply",(PyCFunction)pipcudemo_core_multiply, METH_VARARGS, "Multiply two integers"},
{"divide",(PyCFunction)pipcudemo_core_divide, METH_VARARGS, "Divide two integers"},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef module_def = {
PyModuleDef_HEAD_INIT,
"pipcudemo.core",
"An example project showing how to build a pip-installable Python package that invokes custom CUDA/C++ code.",
-1,
pipcudemo_core_methods
};
PyMODINIT_FUNC PyInit_core(){
return PyModule_Create(&module_def);
}
All our functions that are called from Python need to have the signature: static PyObject* function_name(PyObject *self, PyObject *args)
. This function signature is given it’s own name in Python, PyCFunction
. Each of these use the function PyArg_ParseTuple
to copy the two integer arguments into the variables a
and b
. The helper function Py_BuildValue
then returns an integer whose value is determined by invoking one of the arithmetic functions from our CUDA library. For organizational purposes, I prefixed the named of all of these functions with pipcudemo_core_
, but you could name them anything. The PyMethodDef
contains one entry per function plus a null entry for termination. Each entry contains a string to be used as the function name within Python, the function to call, a keyword representing the argument type within Python (see METH_VARARGS), and the doctstring which is visible when using Python’s help()
. The PyModuleDef object is then constructed. This always starts with PyModuleDef_HEAD_INIT
, then the string name of the module, the docstring, -1 (don’t worry about this argument), and lastly the method table.
Using this PyModuleDef
object, the module initialization method is just one line and returns a value created with PyModule_Create
.
And that’s all the glue we need to connect Python with our CUDA/C++ code. Now to package up this for distribution
The setup.py
script
The special setup.py
script is used by setuptools
to distribute Python packages and their dependencies. Here’s a simple one.
// setup.py
from setuptools import setup, Extension
pipcudemo_core = Extension('pipcudemo.core',
sources=['pipcudemo/core.cpp'],
include_dirs = ['.'],
libraries=['mylib'])
setup(name = 'pipcudemo',
author = 'Alan (AJ) Pryor, Jr.',
author_email='apryor6@gmail.com',
version = '1.0.0',
description="An example project showing how to build a pip-installable Python package that invokes custom CUDA/C++ code.",
ext_modules=[pipcudemo_core],
packages=['pipcudemo'])
We define our extension module by including the source file and, critically, indicating that it should be linked against mylib
. Then the package is assembled with the setup
method. There are many other arguments to these functions that can be explored with Python’s help()
or, better yet, through Google searches.
The MANIFEST.in
MANIFEST.in
indicates what extra files need to be included in the package that ships to PyPi. We might have files on our local version that we don’t want to distribute. The include
keyword is used in the manifest file like so
// Manifest.in
include myCuda.cu
include myCuda.cuh
include mylib.h
include testDriver.cpp
include pipcudemo/core.cpp
Registering the package to PyPi
To upload the package so that it may be pip
installed, you need to create an account on PyPi. For ease of use, you can create a ~/.pypirc
file on your local machine to automatically authenticate your uploads (process described here). The upload command is then performed in one line:
python3 setup.py sdist upload -r pypi
Installing the package
We can then discover the package
pip search pipcudemo
pipcudemo (1.0.3) - An example project showing how to build a pip-installable Python package that invokes custom CUDA/C++ code.
And it can be installed with pip install
provided that mylib
has been built and can be found. The way I prefer to make the library locatable is to modify environmental variables. On Linux that is accomplished using LD_LIBRARY_PATH
and LIBRARY_PATH
as follows
export LD_LIBRARY_PATH=/path/to/mylib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/path/to/mylib:$LD_LIBRARY_PATH
LIBRARY_PATH
is used for finding libraries during the linking phase at compilation, and LD_LIBRARY_PATH
is used for finding shared libraries when a dynamically-linked executable is launched (“LD” for link dynamically). On Mac it is almost the same
export DYLD_LIBRARY_PATH=/path/to/mylib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/path/to/mylib:$LD_LIBRARY_PATH
and on Windows you set LIB
for linking and PATH
for runtime. This is best done by graphically editing the environmental variables.
With the environmental variables set, you can install the package
pip install pipcudemo
and test it as follows
import pipcudemo as pcd
pcd.add(1,2)
pcd.subtract(2,4)
pcd.multiply(4,5)
pcd.divide(999,9)
3
-2
20
111
Troubleshooting
If you receive any errors about being unable to find a library file, unresolved references, etc the problem is almost certainly an issue with how the paths are setup. If you don’t want to setup evironmental variables, you can also use --global-option
, build_ext
, and the -L
tag with pip
to specify include and library paths as follows
pip install pipcudemo --global-option=build_ext --global-option="-L/path/to/mylib"
and you could also download the package and use the setup.py
script manually
python3 setup.py build_ext --library-dirs="/path/to/mylib" install
Conclusion
Python and CUDA are an extremely powerful combination. Although the code in this example was kept simple, one can imagine the extension to much more complicated CUDA/C++ code. I hope you found this interesting or helpful, and as always I welcome feedback/comments/discussion.