Hermes defines some macros to help you distinguish device code from host code.
#define HERMES_HOST_FUNCTION __host__
#define HERMES_DEVICE_CALLABLE __device__ __host__
#define HERMES_DEVICE_FUNCTION __device__
#define HERMES_DEVICE_ENABLED
#define HERMES_CUDA_CODE(CODE) {CODE}
When you build hermes
without CUDA
support, all macros listed in this section get empty or are simply not defined.
An important tip to make sure a piece of code runs only in device is to use the following preprocessor condition:
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ > 0))
#else
#endif
Here is how you can declare a CUDA
kernel:
}
#define HERMES_CUDA_KERNEL(NAME)
Defines a CUDA kernel function.
Definition: defs.h:53
Note that the macro appends _k
to your kernel's name
Usually your kernel will use thread indices that can be 1-dimensional, 2-dimensional or 3-dimensional depending on you launch configurations. Here are some macros that create those indices for you:
HERMES_CUDA_THREAD_INDEX_IK;
#define HERMES_CUDA_THREAD_INDEX3_LT(IJK, BOUNDS)
Creates a 3-dimensional index and tests it against bounds.
Definition: cuda_utils.h:321
#define HERMES_CUDA_THREAD_INDEX_IJK_LT(BOUNDS)
Creates a 3-dimensional index variable ijk and tests it against bounds.
Definition: cuda_utils.h:335
#define HERMES_CUDA_THREAD_INDEX_IJ
Creates a 2-dimensional index based on current cuda thread index.
Definition: cuda_utils.h:286
#define HERMES_CUDA_THREAD_INDEX_I
Creates a 1-dimensional index based on current cuda thread index.
Definition: cuda_utils.h:282
#define HERMES_CUDA_THREAD_INDEX_LT(I, BOUNDS)
Creates a 1-dimensional index and tests it against bounds.
Definition: cuda_utils.h:306
#define HERMES_CUDA_THREAD_INDEX_I_LT(BOUNDS)
Creates a 1-dimensional index variable i and tests it against bounds.
Definition: cuda_utils.h:329
#define HERMES_CUDA_THREAD_INDEX2_LT(IJ, BOUNDS)
Creates a 2-dimensional index and tests it against bounds.
Definition: cuda_utils.h:313
#define HERMES_CUDA_THREAD_INDEX_IJ_LT(BOUNDS)
Creates a 2-dimensional index variable ij and tests it against bounds.
Definition: cuda_utils.h:332
For debugging purposes, you may want to quickly make the first thread the only thread to execute, then you can use:
#define HERMES_CUDA_RETURN_IF_NOT_THREAD_0
Ensures just thread of index 0 is run.
Definition: cuda_utils.h:297
The most important thing you want to do is to check for errors, hermes
lets you use:
#define HERMES_CHECK_LAST_CUDA_CALL
Checks (and logs) the last CUDA call for errors.
Definition: cuda_utils.h:351
#define HERMES_CHECK_CUDA_CALL(err)
Checks (and logs) a CUDA method return code for errors.
Definition: cuda_utils.h:341
When launching kernels, hermes::cuda_utils::LaunchInfo
holds launch information such as number of threads, blocks, shared memory size and stream. It also redistributes threads for you trying to optimize occupancy. The following macro can be used to launch kernels:
#define HERMES_CUDA_LAUNCH_AND_SYNC(LAUNCH_INFO, NAME,...)
Launches a CUDA kernel given its parameters and synchronizes with host.
Definition: cuda_utils.h:273
In this case, LAUNCH_INFO is the constructor parameters, surrounded by ()
, of hermes::cuda_utils::LaunchInfo
. Here is a complete example:
#include <vector>
c[i] = a[i] + b[i];
}
int main() {
size_t n = 1000;
std::vector<float> host_a(n), host_b(n), host_c(n, 0);
float *device_a, *device_b, *device_c;
return 0;
}
Auxiliary classes and macros for CUDA calls.
Please check storage classes for device arrays.