OpenCL Basic Example - Vector Addition Explanation

This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c.

This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c.

This example demonstrates the OpenCL equivalent of the CUDA vector addition, showcasing the differences and similarities between CUDA and OpenCL programming models.

Prerequisites

To run this example, you need:

  • OpenCL-compatible device (GPU, CPU, or other accelerator)
  • OpenCL runtime and headers installed
  • A C compiler (gcc, clang, etc.)
  • GNU Make (for building with the provided Makefile)

Installing OpenCL

Ubuntu/Debian:

# For NVIDIA GPUs
sudo apt-get install nvidia-opencl-dev
 
# For AMD GPUs
sudo apt-get install amdgpu-pro-opencl-dev
 
# For Intel GPUs/CPUs
sudo apt-get install intel-opencl-icd
 
# Generic OpenCL headers
sudo apt-get install opencl-headers ocl-icd-opencl-dev

CentOS/RHEL:

# Install OpenCL headers and loader
sudo yum install opencl-headers ocl-icd-devel
 
# For NVIDIA GPUs, install CUDA toolkit
# For AMD GPUs, install ROCm

macOS: OpenCL is included with the system (no additional installation needed).

Building and Running

  1. Build the example:
make 15-opencl-vector-addition
  1. Run the program:
./15-opencl-vector-addition

Code Structure and Explanation

1. Header Files and Platform Detection

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

OpenCL headers are located differently on macOS vs. other platforms:

  • macOS: <OpenCL/opencl.h>
  • Linux/Windows: <CL/cl.h>

2. OpenCL Kernel Source

const char* kernelSource = 
"__kernel void vectorAdd(__global const float* A,\n"
"                       __global const float* B,\n"
"                       __global float* C,\n"
"                       const int numElements) {\n"
"    int i = get_global_id(0);\n"
"    if (i < numElements) {\n"
"        C[i] = A[i] + B[i];\n"
"    }\n"
"}\n";

Key differences from CUDA:

  • __kernel instead of __global__
  • __global memory space qualifier for pointers
  • get_global_id(0) instead of manual thread index calculation
  • OpenCL kernels are compiled at runtime from source strings

3. Error Handling

OpenCL requires extensive error checking. The example includes:

void checkError(cl_int error, const char* operation) {
    if (error != CL_SUCCESS) {
        printf("Error during %s: %d\n", operation, error);
        exit(1);
    }
}

And a comprehensive error string function for debugging.

4. Platform and Device Discovery

Unlike CUDA which automatically uses NVIDIA GPUs, OpenCL requires explicit platform and device discovery:

// Get platform
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
checkError(ret, "getting platform IDs");
 
// Get device (prefer GPU, fallback to any device)
ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);
if (ret != CL_SUCCESS) {
    printf("No GPU found, trying any device type...\n");
    ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 1, &device_id, &ret_num_devices);
    checkError(ret, "getting device IDs");
}

This code:

  1. Finds the first available OpenCL platform
  2. Tries to get a GPU device
  3. Falls back to any available device if no GPU is found

5. Context and Command Queue Creation

// Create OpenCL context
cl_context context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret);
checkError(ret, "creating context");
 
// Create command queue
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
checkError(ret, "creating command queue");

OpenCL uses:

  • Context: Manages devices and memory objects
  • Command Queue: Queues operations for execution on a device

6. Memory Management

// Create memory buffers on device
cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, dataSize, NULL, &ret);
 
// Copy data to device buffers
ret = clEnqueueWriteBuffer(command_queue, d_A, CL_TRUE, 0, dataSize, h_A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, d_B, CL_TRUE, 0, dataSize, h_B, 0, NULL, NULL);

Key differences from CUDA:

  • clCreateBuffer() instead of cudaMalloc()
  • Memory access patterns specified at creation (CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY)
  • clEnqueueWriteBuffer() instead of cudaMemcpy()
  • All operations are queued on command queues

7. Runtime Compilation

// Create program from source
cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &ret);
checkError(ret, "creating program");
 
// Build program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
if (ret != CL_SUCCESS) {
    // Get build log for debugging
    size_t log_size;
    clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
    char *log = (char*)malloc(log_size);
    clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
    printf("Build log:\n%s\n", log);
    free(log);
    exit(1);
}

OpenCL compiles kernels at runtime, allowing for:

  • Platform-specific optimizations
  • Runtime kernel generation
  • Better portability across vendors

8. Kernel Execution

// Create kernel
cl_kernel kernel = clCreateKernel(program, "vectorAdd", &ret);
 
// Set kernel arguments
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_A);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_B);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_C);
ret = clSetKernelArg(kernel, 3, sizeof(int), (void*)&numElements);
 
// Execute kernel
size_t globalWorkSize = numElements;
size_t localWorkSize = 256; // Work group size
 
// Adjust global work size to be multiple of local work size
if (globalWorkSize % localWorkSize != 0) {
    globalWorkSize = ((globalWorkSize / localWorkSize) + 1) * localWorkSize;
}
 
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);

Key concepts:

  • Global Work Size: Total number of work items (similar to total threads in CUDA)
  • Local Work Size: Work group size (similar to block size in CUDA)
  • Global work size must be multiple of local work size
  • Arguments are set individually with type and size information

9. Synchronization and Results

// Wait for kernel to complete
ret = clFinish(command_queue);
checkError(ret, "waiting for kernel to finish");
 
// Read result back to host
ret = clEnqueueReadBuffer(command_queue, d_C, CL_TRUE, 0, dataSize, h_C, 0, NULL, NULL);
  • clFinish() waits for all queued operations to complete
  • clEnqueueReadBuffer() with CL_TRUE performs blocking read

CUDA vs OpenCL Comparison

AspectCUDAOpenCL
VendorNVIDIA onlyCross-platform (NVIDIA, AMD, Intel, etc.)
LanguageC++ with extensionsC99 with extensions
CompilationCompile-time (nvcc)Runtime compilation
Memory ModelImplicit global memoryExplicit memory spaces (__global, __local, etc.)
Thread IndexingManual calculationBuilt-in functions (get_global_id())
Error HandlingReturn codes + cudaGetLastError()Return codes for all functions
Kernel Launch<<<blocks, threads>>> syntaxclEnqueueNDRangeKernel()
Memory ManagementcudaMalloc, cudaMemcpyclCreateBuffer, clEnqueueWriteBuffer

Performance Considerations

  1. Work Group Size

    • Similar to CUDA block size
    • Should be multiple of 32 (warp size) on NVIDIA GPUs
    • Should be multiple of 64 (wavefront size) on AMD GPUs
  2. Memory Access Patterns

    • Coalesced access still important
    • OpenCL provides more explicit control over memory spaces
  3. Kernel Compilation

    • Runtime compilation adds overhead
    • Can cache compiled binaries for production use

Common Issues and Debugging

  1. No OpenCL Platforms Found

    Solution: Install OpenCL runtime for your hardware
    - NVIDIA: Install CUDA toolkit
    - AMD: Install ROCm or Adrenalin drivers
    - Intel: Install Intel OpenCL runtime
    
  2. Kernel Compilation Failures

    Solution: Check build log output
    - The example prints detailed compilation errors
    - Common issues: syntax errors, unsupported features
    
  3. Work Size Errors

    Solution: Ensure global work size is multiple of local work size
    - The example automatically adjusts work sizes
    
  4. Memory Errors

    Solution: Check buffer creation and data transfer
    - Verify sufficient device memory
    - Check buffer access patterns in kernel
    

Expected Output

When running successfully, you should see:

OpenCL Vector addition of 50000 elements
Using OpenCL platform: NVIDIA CUDA
Using device: Tesla P40
Device type: GPU
Global memory: 22906 MB
Compute units: 60
Max work group size: 1024
OpenCL kernel launch with global work size 50176 and local work size 256
Verifying results...
Test PASSED
Done

Advanced Features

This basic example can be extended to explore:

  1. Multiple Devices: Run on multiple GPUs/CPUs simultaneously
  2. Asynchronous Execution: Use events for fine-grained synchronization
  3. Image Processing: Use OpenCL image objects and samplers
  4. Local Memory: Utilize __local memory for shared data
  5. Profiling: Enable command queue profiling for performance analysis

Building for Different Platforms

The example includes conditional compilation for different platforms and can be adapted for:

  • NVIDIA GPUs (via CUDA OpenCL implementation)
  • AMD GPUs (via ROCm or proprietary drivers)
  • Intel CPUs/GPUs (via Intel OpenCL runtime)
  • ARM Mali GPUs (via ARM Compute Library)

This makes OpenCL an excellent choice for cross-platform GPU computing applications.