OpenCL Basic Example - Vector Addition Explanation

This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c.

This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c.

This example demonstrates the OpenCL equivalent of the CUDA vector addition, showcasing the differences and similarities between CUDA and OpenCL programming models.

Prerequisites

To run this example, you need:

OpenCL-compatible device (GPU, CPU, or other accelerator)
OpenCL runtime and headers installed
A C compiler (gcc, clang, etc.)
GNU Make (for building with the provided Makefile)

Installing OpenCL

Ubuntu/Debian:

# For NVIDIA GPUs
sudo apt-get install nvidia-opencl-dev
 
# For AMD GPUs
sudo apt-get install amdgpu-pro-opencl-dev
 
# For Intel GPUs/CPUs
sudo apt-get install intel-opencl-icd
 
# Generic OpenCL headers
sudo apt-get install opencl-headers ocl-icd-opencl-dev

CentOS/RHEL:

# Install OpenCL headers and loader
sudo yum install opencl-headers ocl-icd-devel
 
# For NVIDIA GPUs, install CUDA toolkit
# For AMD GPUs, install ROCm

macOS: OpenCL is included with the system (no additional installation needed).

Building and Running

Build the example:

make 15-opencl-vector-addition

Run the program:

./15-opencl-vector-addition

Code Structure and Explanation

1. Header Files and Platform Detection

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

OpenCL headers are located differently on macOS vs. other platforms:

macOS: <OpenCL/opencl.h>
Linux/Windows: <CL/cl.h>

2. OpenCL Kernel Source

const char* kernelSource = 
"__kernel void vectorAdd(__global const float* A,\n"
"                       __global const float* B,\n"
"                       __global float* C,\n"
"                       const int numElements) {\n"
"    int i = get_global_id(0);\n"
"    if (i < numElements) {\n"
"        C[i] = A[i] + B[i];\n"
"    }\n"
"}\n";

Key differences from CUDA:

__kernel instead of __global__
__global memory space qualifier for pointers
get_global_id(0) instead of manual thread index calculation
OpenCL kernels are compiled at runtime from source strings

3. Error Handling

OpenCL requires extensive error checking. The example includes:

void checkError(cl_int error, const char* operation) {
    if (error != CL_SUCCESS) {
        printf("Error during %s: %d\n", operation, error);
        exit(1);
    }
}

And a comprehensive error string function for debugging.

4. Platform and Device Discovery

Unlike CUDA which automatically uses NVIDIA GPUs, OpenCL requires explicit platform and device discovery:

// Get platform
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
checkError(ret, "getting platform IDs");
 
// Get device (prefer GPU, fallback to any device)
ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);
if (ret != CL_SUCCESS) {
    printf("No GPU found, trying any device type...\n");
    ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 1, &device_id, &ret_num_devices);
    checkError(ret, "getting device IDs");
}

This code:

Finds the first available OpenCL platform
Tries to get a GPU device
Falls back to any available device if no GPU is found

5. Context and Command Queue Creation

// Create OpenCL context
cl_context context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret);
checkError(ret, "creating context");
 
// Create command queue
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
checkError(ret, "creating command queue");

OpenCL uses:

Context: Manages devices and memory objects
Command Queue: Queues operations for execution on a device

6. Memory Management

// Create memory buffers on device
cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, dataSize, NULL, &ret);
 
// Copy data to device buffers
ret = clEnqueueWriteBuffer(command_queue, d_A, CL_TRUE, 0, dataSize, h_A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, d_B, CL_TRUE, 0, dataSize, h_B, 0, NULL, NULL);

Key differences from CUDA:

clCreateBuffer() instead of cudaMalloc()
Memory access patterns specified at creation (CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY)
clEnqueueWriteBuffer() instead of cudaMemcpy()
All operations are queued on command queues

7. Runtime Compilation

// Create program from source
cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &ret);
checkError(ret, "creating program");
 
// Build program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
if (ret != CL_SUCCESS) {
    // Get build log for debugging
    size_t log_size;
    clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
    char *log = (char*)malloc(log_size);
    clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
    printf("Build log:\n%s\n", log);
    free(log);
    exit(1);
}

OpenCL compiles kernels at runtime, allowing for:

Platform-specific optimizations
Runtime kernel generation
Better portability across vendors

8. Kernel Execution

// Create kernel
cl_kernel kernel = clCreateKernel(program, "vectorAdd", &ret);
 
// Set kernel arguments
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_A);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_B);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_C);
ret = clSetKernelArg(kernel, 3, sizeof(int), (void*)&numElements);
 
// Execute kernel
size_t globalWorkSize = numElements;
size_t localWorkSize = 256; // Work group size
 
// Adjust global work size to be multiple of local work size
if (globalWorkSize % localWorkSize != 0) {
    globalWorkSize = ((globalWorkSize / localWorkSize) + 1) * localWorkSize;
}
 
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);

Key concepts:

Global Work Size: Total number of work items (similar to total threads in CUDA)
Local Work Size: Work group size (similar to block size in CUDA)
Global work size must be multiple of local work size
Arguments are set individually with type and size information

9. Synchronization and Results

// Wait for kernel to complete
ret = clFinish(command_queue);
checkError(ret, "waiting for kernel to finish");
 
// Read result back to host
ret = clEnqueueReadBuffer(command_queue, d_C, CL_TRUE, 0, dataSize, h_C, 0, NULL, NULL);

clFinish() waits for all queued operations to complete
clEnqueueReadBuffer() with CL_TRUE performs blocking read

CUDA vs OpenCL Comparison

Aspect	CUDA	OpenCL
Vendor	NVIDIA only	Cross-platform (NVIDIA, AMD, Intel, etc.)
Language	C++ with extensions	C99 with extensions
Compilation	Compile-time (`nvcc`)	Runtime compilation
Memory Model	Implicit global memory	Explicit memory spaces (`__global`, `__local`, etc.)
Thread Indexing	Manual calculation	Built-in functions (`get_global_id()`)
Error Handling	Return codes + `cudaGetLastError()`	Return codes for all functions
Kernel Launch	`<<<blocks, threads>>>` syntax	`clEnqueueNDRangeKernel()`
Memory Management	`cudaMalloc`, `cudaMemcpy`	`clCreateBuffer`, `clEnqueueWriteBuffer`

Performance Considerations

Work Group Size
- Similar to CUDA block size
- Should be multiple of 32 (warp size) on NVIDIA GPUs
- Should be multiple of 64 (wavefront size) on AMD GPUs
Memory Access Patterns
- Coalesced access still important
- OpenCL provides more explicit control over memory spaces
Kernel Compilation
- Runtime compilation adds overhead
- Can cache compiled binaries for production use

Common Issues and Debugging

No OpenCL Platforms Found

Solution: Install OpenCL runtime for your hardware
- NVIDIA: Install CUDA toolkit
- AMD: Install ROCm or Adrenalin drivers
- Intel: Install Intel OpenCL runtime

Kernel Compilation Failures

Solution: Check build log output
- The example prints detailed compilation errors
- Common issues: syntax errors, unsupported features

Work Size Errors

Solution: Ensure global work size is multiple of local work size
- The example automatically adjusts work sizes

Memory Errors

Solution: Check buffer creation and data transfer
- Verify sufficient device memory
- Check buffer access patterns in kernel

Expected Output

When running successfully, you should see:

OpenCL Vector addition of 50000 elements
Using OpenCL platform: NVIDIA CUDA
Using device: Tesla P40
Device type: GPU
Global memory: 22906 MB
Compute units: 60
Max work group size: 1024
OpenCL kernel launch with global work size 50176 and local work size 256
Verifying results...
Test PASSED
Done

Advanced Features

This basic example can be extended to explore:

Multiple Devices: Run on multiple GPUs/CPUs simultaneously
Asynchronous Execution: Use events for fine-grained synchronization
Image Processing: Use OpenCL image objects and samplers
Local Memory: Utilize __local memory for shared data
Profiling: Enable command queue profiling for performance analysis

Building for Different Platforms

The example includes conditional compilation for different platforms and can be adapted for:

NVIDIA GPUs (via CUDA OpenCL implementation)
AMD GPUs (via ROCm or proprietary drivers)
Intel CPUs/GPUs (via Intel OpenCL runtime)
ARM Mali GPUs (via ARM Compute Library)

This makes OpenCL an excellent choice for cross-platform GPU computing applications.

Continue exploring

Back to index

Ecosystem & Other Projects

Explore the eunomia-bpf ecosystem with additional tools for eBPF benchmarking, AI monitoring agents, and compatibility libraries.

CUDA Function Type Annotations: A Comprehensive Guide

CUDA provides several function type annotations that specify where functions can be called from and where they execute. Understanding these annotations is crucial for effective CUDA programming as they determine the exec

PTX File Loading and Execution Demo

This demo shows how to load a PTX (Parallel Thread Execution) file and call it from CUDA C++ code using the CUDA Driver API.

Last updated: May 28, 2025
First published: May 28, 2025
Contributors: github-actions[bot]

Edit this page Share on X Share on Facebook Join discussion RSS feed

Was this page helpful?