OpenCL Basic Example - Vector Addition Explanation
This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c.
This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c.
This example demonstrates the OpenCL equivalent of the CUDA vector addition, showcasing the differences and similarities between CUDA and OpenCL programming models.
Prerequisites
To run this example, you need:
- OpenCL-compatible device (GPU, CPU, or other accelerator)
- OpenCL runtime and headers installed
- A C compiler (gcc, clang, etc.)
- GNU Make (for building with the provided Makefile)
Installing OpenCL
Ubuntu/Debian:
# For NVIDIA GPUs
sudo apt-get install nvidia-opencl-dev
# For AMD GPUs
sudo apt-get install amdgpu-pro-opencl-dev
# For Intel GPUs/CPUs
sudo apt-get install intel-opencl-icd
# Generic OpenCL headers
sudo apt-get install opencl-headers ocl-icd-opencl-devCentOS/RHEL:
# Install OpenCL headers and loader
sudo yum install opencl-headers ocl-icd-devel
# For NVIDIA GPUs, install CUDA toolkit
# For AMD GPUs, install ROCmmacOS: OpenCL is included with the system (no additional installation needed).
Building and Running
- Build the example:
make 15-opencl-vector-addition- Run the program:
./15-opencl-vector-additionCode Structure and Explanation
1. Header Files and Platform Detection
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endifOpenCL headers are located differently on macOS vs. other platforms:
- macOS:
<OpenCL/opencl.h> - Linux/Windows:
<CL/cl.h>
2. OpenCL Kernel Source
const char* kernelSource =
"__kernel void vectorAdd(__global const float* A,\n"
" __global const float* B,\n"
" __global float* C,\n"
" const int numElements) {\n"
" int i = get_global_id(0);\n"
" if (i < numElements) {\n"
" C[i] = A[i] + B[i];\n"
" }\n"
"}\n";Key differences from CUDA:
__kernelinstead of__global____globalmemory space qualifier for pointersget_global_id(0)instead of manual thread index calculation- OpenCL kernels are compiled at runtime from source strings
3. Error Handling
OpenCL requires extensive error checking. The example includes:
void checkError(cl_int error, const char* operation) {
if (error != CL_SUCCESS) {
printf("Error during %s: %d\n", operation, error);
exit(1);
}
}And a comprehensive error string function for debugging.
4. Platform and Device Discovery
Unlike CUDA which automatically uses NVIDIA GPUs, OpenCL requires explicit platform and device discovery:
// Get platform
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
checkError(ret, "getting platform IDs");
// Get device (prefer GPU, fallback to any device)
ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);
if (ret != CL_SUCCESS) {
printf("No GPU found, trying any device type...\n");
ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 1, &device_id, &ret_num_devices);
checkError(ret, "getting device IDs");
}This code:
- Finds the first available OpenCL platform
- Tries to get a GPU device
- Falls back to any available device if no GPU is found
5. Context and Command Queue Creation
// Create OpenCL context
cl_context context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret);
checkError(ret, "creating context");
// Create command queue
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
checkError(ret, "creating command queue");OpenCL uses:
- Context: Manages devices and memory objects
- Command Queue: Queues operations for execution on a device
6. Memory Management
// Create memory buffers on device
cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, dataSize, NULL, &ret);
// Copy data to device buffers
ret = clEnqueueWriteBuffer(command_queue, d_A, CL_TRUE, 0, dataSize, h_A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, d_B, CL_TRUE, 0, dataSize, h_B, 0, NULL, NULL);Key differences from CUDA:
clCreateBuffer()instead ofcudaMalloc()- Memory access patterns specified at creation (
CL_MEM_READ_ONLY,CL_MEM_WRITE_ONLY) clEnqueueWriteBuffer()instead ofcudaMemcpy()- All operations are queued on command queues
7. Runtime Compilation
// Create program from source
cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &ret);
checkError(ret, "creating program");
// Build program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
if (ret != CL_SUCCESS) {
// Get build log for debugging
size_t log_size;
clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
char *log = (char*)malloc(log_size);
clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
printf("Build log:\n%s\n", log);
free(log);
exit(1);
}OpenCL compiles kernels at runtime, allowing for:
- Platform-specific optimizations
- Runtime kernel generation
- Better portability across vendors
8. Kernel Execution
// Create kernel
cl_kernel kernel = clCreateKernel(program, "vectorAdd", &ret);
// Set kernel arguments
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_A);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_B);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_C);
ret = clSetKernelArg(kernel, 3, sizeof(int), (void*)&numElements);
// Execute kernel
size_t globalWorkSize = numElements;
size_t localWorkSize = 256; // Work group size
// Adjust global work size to be multiple of local work size
if (globalWorkSize % localWorkSize != 0) {
globalWorkSize = ((globalWorkSize / localWorkSize) + 1) * localWorkSize;
}
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);Key concepts:
- Global Work Size: Total number of work items (similar to total threads in CUDA)
- Local Work Size: Work group size (similar to block size in CUDA)
- Global work size must be multiple of local work size
- Arguments are set individually with type and size information
9. Synchronization and Results
// Wait for kernel to complete
ret = clFinish(command_queue);
checkError(ret, "waiting for kernel to finish");
// Read result back to host
ret = clEnqueueReadBuffer(command_queue, d_C, CL_TRUE, 0, dataSize, h_C, 0, NULL, NULL);clFinish()waits for all queued operations to completeclEnqueueReadBuffer()withCL_TRUEperforms blocking read
CUDA vs OpenCL Comparison
| Aspect | CUDA | OpenCL |
|---|---|---|
| Vendor | NVIDIA only | Cross-platform (NVIDIA, AMD, Intel, etc.) |
| Language | C++ with extensions | C99 with extensions |
| Compilation | Compile-time (nvcc) | Runtime compilation |
| Memory Model | Implicit global memory | Explicit memory spaces (__global, __local, etc.) |
| Thread Indexing | Manual calculation | Built-in functions (get_global_id()) |
| Error Handling | Return codes + cudaGetLastError() | Return codes for all functions |
| Kernel Launch | <<<blocks, threads>>> syntax | clEnqueueNDRangeKernel() |
| Memory Management | cudaMalloc, cudaMemcpy | clCreateBuffer, clEnqueueWriteBuffer |
Performance Considerations
-
Work Group Size
- Similar to CUDA block size
- Should be multiple of 32 (warp size) on NVIDIA GPUs
- Should be multiple of 64 (wavefront size) on AMD GPUs
-
Memory Access Patterns
- Coalesced access still important
- OpenCL provides more explicit control over memory spaces
-
Kernel Compilation
- Runtime compilation adds overhead
- Can cache compiled binaries for production use
Common Issues and Debugging
-
No OpenCL Platforms Found
Solution: Install OpenCL runtime for your hardware - NVIDIA: Install CUDA toolkit - AMD: Install ROCm or Adrenalin drivers - Intel: Install Intel OpenCL runtime -
Kernel Compilation Failures
Solution: Check build log output - The example prints detailed compilation errors - Common issues: syntax errors, unsupported features -
Work Size Errors
Solution: Ensure global work size is multiple of local work size - The example automatically adjusts work sizes -
Memory Errors
Solution: Check buffer creation and data transfer - Verify sufficient device memory - Check buffer access patterns in kernel
Expected Output
When running successfully, you should see:
OpenCL Vector addition of 50000 elements
Using OpenCL platform: NVIDIA CUDA
Using device: Tesla P40
Device type: GPU
Global memory: 22906 MB
Compute units: 60
Max work group size: 1024
OpenCL kernel launch with global work size 50176 and local work size 256
Verifying results...
Test PASSED
Done
Advanced Features
This basic example can be extended to explore:
- Multiple Devices: Run on multiple GPUs/CPUs simultaneously
- Asynchronous Execution: Use events for fine-grained synchronization
- Image Processing: Use OpenCL image objects and samplers
- Local Memory: Utilize
__localmemory for shared data - Profiling: Enable command queue profiling for performance analysis
Building for Different Platforms
The example includes conditional compilation for different platforms and can be adapted for:
- NVIDIA GPUs (via CUDA OpenCL implementation)
- AMD GPUs (via ROCm or proprietary drivers)
- Intel CPUs/GPUs (via Intel OpenCL runtime)
- ARM Mali GPUs (via ARM Compute Library)
This makes OpenCL an excellent choice for cross-platform GPU computing applications.
Continue exploring
Back to index
Ecosystem & Other Projects
Explore the eunomia-bpf ecosystem with additional tools for eBPF benchmarking, AI monitoring agents, and compatibility libraries.
Previous
CUDA Function Type Annotations: A Comprehensive Guide
CUDA provides several function type annotations that specify where functions can be called from and where they execute. Understanding these annotations is crucial for effective CUDA programming as they determine the exec
Next
PTX File Loading and Execution Demo
This demo shows how to load a PTX (Parallel Thread Execution) file and call it from CUDA C++ code using the CUDA Driver API.
- Last updated
- May 28, 2025
- First published
- May 28, 2025
- Contributors
- github-actions[bot]
Was this page helpful?