CUDA GPU Profiling and Tracing

This document provides a comprehensive guide to profiling and tracing CUDA applications to identify performance bottlenecks and optimize GPU code execution.

You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial

Introduction to GPU Profiling
Profiling Tools
Key Performance Metrics
Profiling Methodology
Common Performance Bottlenecks
Tracing Techniques
Example Application
Best Practices
Further Reading

Introduction to GPU Profiling

GPU profiling is the process of measuring and analyzing the performance characteristics of GPU applications. It helps developers:

Identify performance bottlenecks
Optimize resource utilization
Understand execution patterns
Validate optimization decisions
Ensure scalability across different hardware

Effective profiling is essential for high-performance CUDA applications as the complex nature of GPU architecture makes intuitive optimization insufficient.

Profiling Tools

NVIDIA Nsight Systems

Nsight Systems is a system-wide performance analysis tool that provides insights into CPU and GPU execution:

System-level tracing: CPU, GPU, memory, and I/O activities
Timeline visualization: Shows kernel execution, memory transfers, and CPU activity
API trace: Captures CUDA API calls and their durations
Low overhead: Suitable for production code profiling

NVIDIA Nsight Compute

Nsight Compute is an interactive kernel profiler for CUDA applications:

Detailed kernel metrics: SM utilization, memory throughput, instruction mix
Guided analysis: Provides optimization recommendations
Roofline analysis: Shows performance relative to hardware limits
Kernel comparison: Compare kernels across runs or hardware platforms

NVIDIA Visual Profiler and nvprof

Legacy tools (deprecated but still useful for older CUDA versions):

nvprof: Command-line profiler with low overhead
Visual Profiler: GUI-based analysis tool
CUDA profiling APIs: Allow programmatic access to profiling data

Other Tools

Compute Sanitizer: Memory access checking and race detection
CUPTI: CUDA Profiling Tools Interface for custom profilers
PyTorch/TensorFlow Profilers: Framework-specific profiling for deep learning

Key Performance Metrics

Execution Metrics

SM Occupancy: Ratio of active warps to maximum possible warps
- Higher values generally enable better latency hiding
- Target: >50% for most applications
Warp Execution Efficiency: Percentage of threads active during execution
- Lower values indicate branch divergence
- Target: >80% for compute-bound kernels
Instruction Throughput:
- Instructions per clock (IPC)
- Arithmetic intensity (operations per byte)
- Mix of instruction types

Memory Metrics

Memory Throughput:
- Global memory read/write bandwidth
- Shared memory bandwidth
- L1/L2 cache hit rates
- Target: As close to peak hardware bandwidth as possible
Memory Access Patterns:
- Load/store efficiency
- Global memory coalescing rate
- Shared memory bank conflicts
Data Transfer:
- Host-device transfer bandwidth
- PCIe utilization
- NVLink utilization (if available)

Compute Metrics

Compute Utilization:
- SM activity
- Tensor/RT core utilization (if used)
- Instruction mix (FP32, FP64, INT, etc.)
Compute Efficiency:
- Achieved vs. theoretical FLOPS
- Resource limitations (compute vs. memory bound)
- Roofline model position

Profiling Methodology

A structured approach to profiling CUDA applications:

1. Initial Assessment

Start with high-level system profiling (Nsight Systems)
Identify time distribution between CPU, GPU, and data transfers
Look for obvious bottlenecks like excessive synchronization or transfers

2. Kernel Analysis

Profile individual kernels (Nsight Compute)
Identify the most time-consuming kernels
Collect key metrics for these kernels

3. Bottleneck Identification

Determine if kernels are compute-bound or memory-bound
Use the roofline model to understand performance limiters
Check for specific inefficiencies (divergence, non-coalesced access)

4. Guided Optimization

Address the most significant bottlenecks first
Make one change at a time and measure impact
Compare before/after profiles to validate improvements

Repeat the process for the next bottleneck
Re-profile the entire application periodically
Continue until performance goals are met

Common Performance Bottlenecks

Non-coalesced Memory Access:
- Symptoms: Low global memory load/store efficiency
- Solution: Reorganize data layout or access patterns
Shared Memory Bank Conflicts:
- Symptoms: Low shared memory bandwidth
- Solution: Adjust padding or access patterns
Excessive Global Memory Access:
- Symptoms: High memory dependency
- Solution: Increase data reuse through shared memory or registers

Warp Divergence:
- Symptoms: Low warp execution efficiency
- Solution: Reorganize algorithms to minimize divergent paths
Low Occupancy:
- Symptoms: SM occupancy below 50%
- Solution: Reduce register/shared memory usage or adjust block size
Kernel Launch Overhead:
- Symptoms: Many small, short-duration kernels
- Solution: Kernel fusion or persistent kernels

System-Level Issues

Excessive Host-Device Transfers:
- Symptoms: High PCIe utilization, many transfer operations
- Solution: Batch transfers, use pinned memory, or unified memory
CPU-GPU Synchronization:
- Symptoms: GPU idle periods between kernels
- Solution: Use CUDA streams, asynchronous operations
Underutilized GPU Resources:
- Symptoms: Low overall GPU utilization
- Solution: Concurrent kernels, streams, or increase problem size

Tracing Techniques

Tracing provides a timeline view of application execution:

CUDA Events

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
 
cudaEventRecord(start);
myKernel<<<grid, block>>>(data);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
 
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %f ms\n", milliseconds);

NVTX Markers and Ranges

NVIDIA Tools Extension (NVTX) allows custom annotations:

#include <nvtx3/nvToolsExt.h>
 
// Mark an instantaneous event
nvtxMark("Interesting point");
 
// Begin a range
nvtxRangePushA("Data preparation");
// ... code ...
nvtxRangePop(); // End the range
 
// Colored range for better visibility
nvtxEventAttributes_t eventAttrib = {0};
eventAttrib.version = NVTX_VERSION;
eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
eventAttrib.colorType = NVTX_COLOR_ARGB;
eventAttrib.color = 0xFF00FF00; // Green
eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
eventAttrib.message.ascii = "Kernel Execution";
nvtxRangePushEx(&eventAttrib);
myKernel<<<grid, block>>>(data);
nvtxRangePop();

Programmatic Profiling with CUPTI

CUDA Profiling Tools Interface (CUPTI) enables programmatic access to profiling data:

// Simplified CUPTI usage example
#include <cupti.h>
 
void CUPTIAPI callbackHandler(void *userdata, CUpti_CallbackDomain domain,
                             CUpti_CallbackId cbid, const void *cbInfo) {
    // Handle callback
}
 
// Initialize CUPTI and register callbacks
CUpti_SubscriberHandle subscriber;
cuptiSubscribe(&subscriber, callbackHandler, NULL);
cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API,
                   CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020);

Example Application

The accompanying basic08.cu demonstrates:

Basic kernel timing: Using CUDA events
NVTX annotations: Adding markers and ranges
Memory transfer profiling: Analyzing host-device transfers
Kernel optimization: Comparing different implementation strategies
Interpreting profiling data: Making optimization decisions

Key Code Sections

Basic kernel timing:

__global__ void computeKernel(float *input, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float x = input[idx];
        // Perform computation
        float result = x * x + x + 1.0f;
        output[idx] = result;
    }
}
 
void timeKernel() {
    // Allocate memory
    float *d_input, *d_output;
    cudaMalloc(&d_input, SIZE * sizeof(float));
    cudaMalloc(&d_output, SIZE * sizeof(float));
    
    // Initialize data
    float *h_input = new float[SIZE];
    for (int i = 0; i < SIZE; i++) h_input[i] = i;
    
    cudaMemcpy(d_input, h_input, SIZE * sizeof(float), cudaMemcpyHostToDevice);
    
    // Timing with events
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    // Warm-up run
    computeKernel<<<(SIZE + 255) / 256, 256>>>(d_input, d_output, SIZE);
    
    // Timed run
    cudaEventRecord(start);
    computeKernel<<<(SIZE + 255) / 256, 256>>>(d_input, d_output, SIZE);
    cudaEventRecord(stop);
    
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf("Kernel execution time: %f ms\n", milliseconds);
    
    // Cleanup
    delete[] h_input;
    cudaFree(d_input);
    cudaFree(d_output);
}

Best Practices

Profiling Workflow

Start with high-level profiling before diving into details
Establish baselines for important kernels
Profile regularly during development, not just at the end
Automate profiling where possible for regression testing
Compare across hardware to ensure portability

Tool Selection

Nsight Systems for system-level analysis and timeline
Nsight Compute for detailed kernel metrics
NVTX markers for custom annotations
CUDA events for lightweight timing measurements

Optimization Approach

Focus on hotspots: Address the most time-consuming operations first
Use roofline analysis: Understand theoretical limits
Balance efforts: Don't over-optimize less critical sections
Consider trade-offs: Sometimes readability > minor performance gains
Document insights: Record profiling discoveries for future reference

CUDA GPU Profiling and Tracing

Table of Contents

Introduction to GPU Profiling

Profiling Tools

NVIDIA Nsight Systems

NVIDIA Nsight Compute

NVIDIA Visual Profiler and nvprof

Other Tools

Key Performance Metrics

Execution Metrics

Memory Metrics

Compute Metrics

Profiling Methodology

1. Initial Assessment

2. Kernel Analysis

3. Bottleneck Identification

4. Guided Optimization

5. Iterative Refinement

Common Performance Bottlenecks

System-Level Issues

Tracing Techniques

CUDA Events

NVTX Markers and Ranges

Programmatic Profiling with CUPTI

Example Application

Key Code Sections

Best Practices

Profiling Workflow

Tool Selection

Optimization Approach

Further Reading

CUDA GPU Profiling and Tracing

Table of Contents

Introduction to GPU Profiling

Profiling Tools

NVIDIA Nsight Systems

NVIDIA Nsight Compute

NVIDIA Visual Profiler and nvprof

Other Tools

Key Performance Metrics

Execution Metrics

Memory Metrics

Compute Metrics

Profiling Methodology

1. Initial Assessment

2. Kernel Analysis

3. Bottleneck Identification

4. Guided Optimization

5. Iterative Refinement

Common Performance Bottlenecks

Memory-Related Issues

Execution-Related Issues

System-Level Issues

Tracing Techniques

CUDA Events

NVTX Markers and Ranges

Programmatic Profiling with CUPTI

Example Application

Key Code Sections

Best Practices

Profiling Workflow

Tool Selection

Optimization Approach

Further Reading