CUDA Programming Methods Comparison: Matrix Multiplication
This example demonstrates and compares different approaches to GPU programming using CUDA, focusing on a matrix multiplication problem. By implementing the same algorithm using various techniques, we can understand the t
This example demonstrates and compares different approaches to GPU programming using CUDA, focusing on a matrix multiplication problem. By implementing the same algorithm using various techniques, we can understand the trade-offs between programming complexity, performance, and code maintainability.
You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial
Overview
Matrix multiplication is a classic problem that benefits greatly from parallelization. This example implements a simple matrix multiplication C = A × B using seven different approaches:
- Standard CUDA C/C++
- CUDA with Inline PTX Assembly
- CUDA Unified Memory
- CUDA Shared Memory
- Thrust (high-level C++ abstraction)
- CUDA Streams
- CUDA Dynamic Parallelism
For each implementation, we measure and compare the execution time, verifying the correctness of results against a CPU implementation.
Programming Methods Explained
1. Standard CUDA C/C++
The standard CUDA approach uses explicit memory management and a kernel function:
__global__ void matrix_multiply_cuda(float *A, float *B, float *C, int n) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < n && col < n) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
sum += A[row * n + k] * B[k * n + col];
}
C[row * n + col] = sum;
}
}This implementation requires manual:
- Memory allocation on both host and device
- Data transfer between host and device
- Kernel launch configuration
- Synchronization
- Memory deallocation
Advantages:
- Direct control over memory management
- Good performance
- Familiar programming model
Disadvantages:
- Requires explicit memory management
- More verbose code
- Memory transfers can be a bottleneck
Implementation Details:
- Each thread computes one element of the output matrix
- Threads are organized in a 2D grid that maps directly to the output matrix dimensions
- The kernel has O(n) computational complexity per thread
- Global memory is used for all matrix elements
2. CUDA with Inline PTX Assembly
PTX (Parallel Thread Execution) is NVIDIA's low-level assembly language. We can embed PTX directly in CUDA C++ code:
__device__ float multiply_add_ptx(float a, float b, float c) {
float result;
asm("fma.rn.f32 %0, %1, %2, %3;" : "=f"(result) : "f"(a), "f"(b), "f"(c));
return result;
}This implementation uses a fused multiply-add (FMA) instruction which computes a * b + c in a single operation, potentially improving performance and numerical accuracy.
Advantages:
- Fine-grained control over specific operations
- Can leverage architecture-specific instructions
- Potentially better performance for critical sections
- Direct access to hardware features not exposed in CUDA C/C++
Disadvantages:
- Least portable across different architectures
- Most complex to write and maintain
- Requires deep knowledge of GPU architecture
- Architecture-specific optimizations may become obsolete with newer hardware
Implementation Details:
- Uses the
fma.rn.f32PTX instruction for fused multiply-add - The
.rnsuffix specifies round-to-nearest-even rounding mode - The instruction executes in a single clock cycle on compatible hardware
- PTX assembly allows precise control over instruction selection and scheduling
3. CUDA Unified Memory
Unified Memory provides a single memory space accessible by both CPU and GPU:
cudaMallocManaged(&u_A, matrix_size);
cudaMallocManaged(&u_B, matrix_size);
cudaMallocManaged(&u_C, matrix_size);The kernel code remains the same as the standard CUDA version, but memory management is simplified.
Advantages:
- Simplified memory management
- No explicit data transfers
- Easier programming model
- Automatic page migration between CPU and GPU
- Enables larger-than-GPU-memory datasets
Disadvantages:
- Potentially lower performance due to automatic page migration
- Less control over data movement
- Performance depends heavily on the access pattern
- First-touch overhead and potential page faults
Implementation Details:
- Uses
cudaMallocManaged()instead of separatemalloc()andcudaMalloc() - The CUDA runtime handles data transfers automatically
- Memory pages migrate on-demand between CPU and GPU
- The same pointer can be used on both host and device
- Prefetching hints can be used to optimize data movement (not shown in this example)
4. CUDA Shared Memory
Shared memory is a fast on-chip memory accessible by all threads in a block:
__global__ void matrix_multiply_shared(float *A, float *B, float *C, int n) {
__shared__ float shared_A[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float shared_B[BLOCK_SIZE][BLOCK_SIZE];
// Load tiles of matrices into shared memory
// ...
// Compute using the faster shared memory
// ...
}This implementation divides matrices into tiles that fit in shared memory, reducing global memory accesses.
Advantages:
- Much faster memory access (typically 100x faster than global memory)
- Reduced global memory bandwidth requirements
- Better performance for memory-bound applications
- Enables data reuse within thread blocks
Disadvantages:
- Limited size of shared memory (typically 48KB-64KB per SM)
- More complex programming model
- Requires careful management of tile sizes
- Potential for bank conflicts if not carefully designed
Implementation Details:
- Matrices are divided into tiles of size BLOCK_SIZE × BLOCK_SIZE
- Each thread block loads one tile from each input matrix into shared memory
- Threads within a block synchronize using
__syncthreads() - Each thread computes one element of the output tile
- The algorithm reduces global memory accesses by a factor of BLOCK_SIZE
5. Thrust High-Level Implementation
Thrust is a C++ template library for CUDA that provides high-level abstractions:
void run_thrust_implementation(float *h_A, float *h_B, float *h_C, int n) {
thrust::device_vector<float> d_A(h_A, h_A + n * n);
thrust::device_vector<float> d_B(h_B, h_B + n * n);
thrust::device_vector<float> d_C(n * n);
// Create a 2D index space and transform
// ...
}Advantages:
- Most concise code
- High-level abstractions similar to STL
- Automatic memory management
- Highly reusable components
- Reduced development time and fewer bugs
Disadvantages:
- Less control over implementation details
- May be less performant for specific use cases
- Can be harder to debug
- Potential overhead from abstractions
Implementation Details:
- Uses
thrust::device_vectorfor automatic GPU memory management - Leverages
thrust::transformalgorithm with a custom functor - Creates a 2D index space using
thrust::make_zip_iteratorandthrust::counting_iterator - The functor implements the matrix multiplication for each element
- Memory transfers are handled automatically by the Thrust containers
6. CUDA Streams
CUDA streams enable concurrent operations on the GPU:
void run_cuda_streams_implementation(float *h_A, float *h_B, float *h_C, int n) {
// Create multiple CUDA streams
const int numStreams = 4;
cudaStream_t streams[numStreams];
// Divide work among streams
for (int i = 0; i < numStreams; i++) {
// Asynchronous memory transfers and kernel launches
cudaMemcpyAsync(..., streams[i]);
matrix_multiply_cuda<<<grid, threads, 0, streams[i]>>>(...);
cudaMemcpyAsync(..., streams[i]);
}
// Synchronize all streams
for (int i = 0; i < numStreams; i++) {
cudaStreamSynchronize(streams[i]);
}
}Advantages:
- Overlaps computation with data transfers
- Enables concurrent kernel execution
- Better utilization of GPU resources
- Can significantly improve overall throughput
- Good for processing independent data chunks
Disadvantages:
- More complex synchronization
- Harder to debug and reason about
- Potential for race conditions if not carefully designed
- Benefits depend on hardware capabilities
Implementation Details:
- Divides the matrix into horizontal strips, one per stream
- Each stream processes its strip independently
- Uses
cudaMemcpyAsync()for non-blocking memory transfers - Launches kernels in different streams with the stream parameter
- Each stream has its own command queue that executes independently
- Final synchronization ensures all operations complete
7. Dynamic Parallelism
Dynamic parallelism allows CUDA kernels to launch nested kernels:
__global__ void matrix_multiply_dynamic_parent(float *A, float *B, float *C, int n) {
// Calculate submatrix position based on thread index
// ...
// Launch a child kernel for this submatrix
multiply_submatrix<<<dimGrid, dimBlock>>>(A, B, C, n, row_start, col_start, subsize);
}Advantages:
- Enables adaptive algorithms that refine work dynamically
- Allows recursive problem decomposition
- Can handle irregular workloads efficiently
- Reduces CPU-GPU coordination overhead
- More natural expression of divide-and-conquer algorithms
Disadvantages:
- Additional overhead of kernel launch from device
- More complex resource management
- Requires newer GPU architectures (Compute Capability 3.5+)
- Potentially deeper call stacks and more register usage
Implementation Details:
- Parent kernel divides the matrix into large submatrices
- Each thread in the parent grid launches a child grid to process one submatrix
- Child kernels work on smaller, more manageable chunks
- Requires the
-rdc=truecompilation flag for relocatable device code - Parent grid synchronizes automatically with all child grids at kernel end
Memory Hierarchy and Access Patterns
Different memory types in CUDA have vastly different performance characteristics:
| Memory Type | Access Speed | Scope | Lifetime | Caching |
|---|---|---|---|---|
| Global Memory | Slowest | Host & All threads | Application | L2 only |
| Shared Memory | Fast | Thread block | Kernel execution | On-chip |
| Registers | Fastest | Single thread | Thread lifetime | On-chip |
| Constant Memory | Medium | All threads (read) | Application | Special cache |
| Texture Memory | Medium | All threads (read) | Application | Special cache |
| Local Memory | Slow | Single thread | Thread lifetime | L2 only |
| Unified Memory | Variable | Host & All threads | Application | System managed |
Our implementations demonstrate different ways to leverage this memory hierarchy:
- Standard CUDA: Uses global memory for all data
- PTX Assembly: Same memory pattern as standard CUDA but with optimized instructions
- Unified Memory: Uses automatically managed memory
- Shared Memory: Explicitly caches data in on-chip shared memory
- Thrust: Abstracts memory management through containers
- CUDA Streams: Uses global memory with overlapped transfers
- Dynamic Parallelism: Uses global memory with hierarchical access patterns
Advanced Optimization Techniques
Beyond the methods shown in the examples, other optimization techniques include:
- Warp Shuffle Instructions: Exchange data between threads in a warp without using shared memory
- Tensor Cores: Specialized hardware for matrix operations (on newer GPUs)
- Persistent Threads: Keep threads resident for multiple work items
- Register Blocking: Store partial results in registers to reduce memory traffic
- Memory Coalescing: Ensure aligned, contiguous memory access patterns
- Instruction-Level Parallelism: Schedule independent operations to hide latency
- Loop Unrolling: Reduce loop overhead and increase instruction-level parallelism
- Function Inlining: Eliminate function call overhead
- Occupancy Optimization: Balance resource usage to maximize active warps
Choosing the Right Approach
Use this decision framework to select the appropriate implementation:
- Standard CUDA when you need a balance of control and readability
- PTX Assembly only for performance-critical sections that can benefit from specific instructions
- Unified Memory for easier development with modest performance requirements
- Shared Memory when memory access is a bottleneck and memory locality can be exploited
- Thrust for rapid development, especially when implementing standard algorithms
- CUDA Streams when you need to overlap computation and data transfers
- Dynamic Parallelism for problems that benefit from recursive decomposition or adaptive refinement
Compiler Flags and Optimization Levels
Different compiler flags can significantly impact performance:
# Basic compilation
nvcc -o basic_version file.cu
# Optimized compilation
nvcc -O3 -arch=sm_70 -o optimized_version file.cu
# With dynamic parallelism
nvcc -rdc=true -O3 -arch=sm_70 -o dynamic_parallel_version file.cu
# With fast math (may reduce precision)
nvcc -O3 -use_fast_math -arch=sm_70 -o fast_math_version file.cuOur example uses -O3 for high optimization and -rdc=true for dynamic parallelism.
Building and Running
To compile the example:
make basic03To run:
./basic03The program will:
- Generate random matrices
- Compute the result on CPU for reference
- Run each GPU implementation
- Verify correctness of each implementation
- Print timing information and speedup compared to CPU
Profiling and Performance Analysis
To analyze the performance of these implementations more deeply, use NVIDIA's profiling tools:
# Basic profiling
nvprof ./basic03
# Detailed timeline
nvprof --export-profile timeline.nvvp ./basic03
# For newer versions, use Nsight Systems
nsys profile ./basic03Key metrics to monitor:
- Global memory load/store throughput
- Shared memory load/store throughput
- Achieved occupancy
- SM efficiency
- Warp execution efficiency
- Memory bandwidth utilization
Future Directions and Advanced Topics
The field of GPU computing continues to evolve. Some advanced topics to explore:
- Multi-GPU Programming: Distributing work across multiple GPUs
- Heterogeneous Computing: Combining CPU and GPU computation optimally
- Mixed Precision: Using lower precision where appropriate for better performance
- Tensor Core Programming: Leveraging specialized hardware for matrix operations
- Graph-based Execution: Using CUDA Graphs for optimized workflow execution
- Cooperative Groups: More flexible thread synchronization capabilities
- CUDA-aware MPI: Direct GPU-to-GPU communication in distributed systems
Further Reading
Continue exploring
Back to index
Ecosystem & Other Projects
Explore the eunomia-bpf ecosystem with additional tools for eBPF benchmarking, AI monitoring agents, and compatibility libraries.
Previous
Tutorial: Understanding GPU Assembly with PTX
Time Required: 45-60 minutes Difficulty: Intermediate Prerequisites: Completed Tutorial 01 (Vector Addition), understanding of assembly language concepts helpful but not required
Next
Tutorial: Understanding GPU Architecture and Execution Model
Time Required: 60-75 minutes Difficulty: Intermediate Prerequisites: Completed Tutorials 01 and 02, basic understanding of computer architecture
- Last updated
- May 25, 2025
- First published
- May 23, 2025
- Contributors
- officeyutong
Was this page helpful?