CUDA Shared Memory & Performance Optimization Study Guide
π Table of Contents
Project Overview
This project walks you through 4 progressively advanced CUDA examples to master shared memory, kernel performance analysis, and matrix multiplication optimization techniques.
π― Learning Objectives
-
Understand CUDA memory hierarchy
-
Master efficient shared memory usage
-
Learn to analyze and optimize CUDA kernel performance
-
Implement high-performance matrix multiplication
-
Gain proficiency in profiling tools
β Recommended Prerequisites
π Project Structure
cuda-learning/
βββ src/
β βββ 01_shared_memory_basics.cu
β βββ 02_matrix_multiply_evolution.cu
β βββ 03_performance_analysis.cu
β βββ 04_advanced_shared_memory.cu
βββ scripts/
β βββ compile_and_run.sh
βββ build/
βββ docs/
βββ Makefile
βββ README.md
Environment Setup
π System Requirements
-
OS: Ubuntu 18.04+ / CentOS 7+ / Windows 10+
-
GPU: NVIDIA GPU (Compute Capability 3.0+)
-
CUDA Toolkit: 10.0+
-
Compiler: GCC 7+ / MSVC 2017+
π§ Installing CUDA Toolkit
Ubuntu/Debian:
# Add NVIDIA package repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
# Install CUDA
sudo apt update
sudo apt install cuda-toolkit-11-8
Or via default package manager:
sudo apt install nvidia-cuda-toolkit
β Verify Installation
nvidia-smi # Check NVIDIA driver
nvcc --version # Check CUDA compiler
nvidia-smi -L # Check GPU info
π Build and Run
Method 1: Using Makefile
make # Build all
make run-example1 # Run example 1
make examples # Run all
make profile # Run performance profiling
Method 2: Using script
chmod +x scripts/compile_and_run.sh
./scripts/compile_and_run.sh 1 # Run example 1
./scripts/compile_and_run.sh -a # Run all
./scripts/compile_and_run.sh -p 3 # Profile example 3
Method 3: Manual
mkdir -p build
nvcc -O3 -arch=sm_50 src/01_shared_memory_basics.cu -o build/example1
./build/example1
Shared Memory Basics
π§ Theoretical Background
CUDA Memory Hierarchy
βββββββββββββββββββββββββββββββ
β Global Memory β β Large, High Latency
β β’ Size: GB β
β β’ Latency: 400-800 cycles β
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β L2 Cache β
β β’ Size: MB β
β β’ Latency: 200-300 cycles β
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β L1 Cache / Shared Memory β β Fast & Small
β β’ Size: 48β164KB β
β β’ Latency: 20-30 cycles β
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β Registers β β Fastest
β β’ Latency: 1 cycle β
βββββββββββββββββββββββββββββββ
Shared Memory Features
-
Fast access: 10β100x faster than global memory
-
Intra-block sharing: Threads in the same block can share data
-
Manually managed
-
Limited size
-
Banked architecture: Avoid bank conflicts for best performance
π‘ Example 1: Shared Memory Basics (
01_shared_memory_basics.cu
)
Covers:
-
Shared vs global memory speed comparison
-
Parallel reduction using shared memory
-
Demonstrating and resolving bank conflicts
// Shared memory reduction kernel
__global__ void vector_reduce_shared(float* input, float* output, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + tid;
sdata[tid] = (idx < n) ? input[idx] : 0.0f;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) atomicAdd(output, sdata[0]);
}
Performance wins:
-
Reduces global memory reads
-
Fast computation in shared memory
-
Coalesced memory access
Matrix Multiplication Optimization Evolution
π― Goal
Start from a naive implementation and evolve into a high-performance matrix multiplication using shared memory and conflict avoidance.
π Example 2: Matrix Multiply Evolution (
02_matrix_multiply_evolution.cu
)
Version 1: Naive
__global__ void matrix_mul_naive(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
Problem: Excessive redundant memory access.
Version 2: Tiled (Shared Memory)
__global__ void matrix_mul_shared_basic(float* A, float* B, float* C, int N) {
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int tx = threadIdx.x, ty = threadIdx.y;
int row = blockIdx.y * TILE_SIZE + ty;
int col = blockIdx.x * TILE_SIZE + tx;
float sum = 0.0f;
for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
if (row < N && t * TILE_SIZE + tx < N)
As[ty][tx] = A[row * N + t * TILE_SIZE + tx];
else
As[ty][tx] = 0.0f;
if (col < N && t * TILE_SIZE + ty < N)
Bs[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];
else
Bs[ty][tx] = 0.0f;
__syncthreads();
for (int k = 0; k < TILE_SIZE; k++)
sum += As[ty][k] * Bs[k][tx];
__syncthreads();
}
if (row < N && col < N)
C[row * N + col] = sum;
}
Version 3: Optimized with Conflict Avoidance
__shared__ float As[TILE_SIZE][TILE_SIZE + 1]; // padded to avoid conflicts
Performance Comparison:
Version | Time (ms) | Speedup vs CPU | Key Optimization |
---|---|---|---|
CPU | 8000 | 1.0x | Serial computation |
Naive GPU | 400 | 20x | Parallel compute |
Shared Memory | 80 | 100x | Tiling & data reuse |
Optimized | 60 | 133x | Conflict avoidance |
cuBLAS | 15 | 533x | Highly tuned library |
Hereβs the continuation of your CUDA Shared Memory & Performance Optimization Study Guide translation:
Performance Analysis Techniques
π Example 3: Performance Analysis (
03_performance_analysis.cu
)
Key Analysis Aspects
-
Memory Access Patterns
-
Coalesced vs Strided vs Random
-
Cache hit rate analysis
-
-
Compute vs Memory Bound
-
Arithmetic intensity
-
Compare compute throughput to memory bandwidth
-
-
Branch Divergence
-
Consistency of execution within warps
-
Branch efficiency
-
-
Occupancy Analysis
-
SM utilization
-
Registers/shared memory per thread block
-
Key Metrics
// Memory bandwidth measurement
float measure_bandwidth(float* d_input, float* d_output, int n, int iterations) {
float bytes = 2.0f * n * sizeof(float) * iterations;
float bandwidth = bytes / (time_ms / 1000.0f) / (1024 * 1024 * 1024);
return bandwidth;
}
// Occupancy estimation
void analyze_occupancy(const void* kernel, int block_size) {
int max_active_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_active_blocks, kernel, block_size, 0);
float occupancy = (max_active_blocks * block_size / (float)prop.maxThreadsPerMultiProcessor) * 100;
printf("Theoretical occupancy: %.1f%%\n", occupancy);
}
π οΈ Toolchain Overview
1.Β
CUDA Events
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<...>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float time_ms;
cudaEventElapsedTime(&time_ms, start, stop);
2.Β
nvprof
Β (Deprecated, still usable)
nvprof ./your_program
nvprof --metrics achieved_occupancy,gld_efficiency ./your_program
3.Β
Nsight Compute (Recommended)
ncu --set full ./your_program
ncu --metrics sm__cycles_elapsed.avg,dram__bytes_read.sum ./your_program
4.Β
Nsight Systems
nsys profile -t cuda ./your_program
π Optimization Workflow
-
Benchmark β
-
Identify bottlenecks β
-
Apply targeted optimizations β
-
Measure impact β
-
Iterate.
Advanced Shared Memory Techniques
π Example 4: Advanced Shared Memory (
04_advanced_shared_memory.cu
)
Bank Conflicts Explained
Shared memory has 32 banks. If multiple threads access the same bank, performance degrades.
Bank 0: address 0, 32, 64, ...
Bank 1: address 1, 33, 65, ...
...
Bank 31: address 31, 63, 95, ...
Techniques to Avoid Bank Conflicts
- Padding
__shared__ float data[32][32]; // causes conflict
__shared__ float data[32][33]; // avoids conflict
- Access Pattern Tuning
// Bad
data[threadIdx.x * 2]; // 2-way conflict
// Good
data[threadIdx.x];
Double Buffering
__global__ void double_buffering_example(float* input, float* output, int n) {
extern __shared__ float sdata[];
float* buffer_a = sdata;
float* buffer_b = &sdata[tile_size];
buffer_a[tid] = input[tid];
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
float result = compute(buffer_a[tid]);
buffer_b[tid] = input[next_tile_offset];
__syncthreads();
output[current_offset] = result;
swap(buffer_a, buffer_b);
__syncthreads();
}
}
Warp-Level Primitives
__global__ void warp_reduce(float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = (idx < n) ? input[idx] : 0.0f;
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(0xffffffff, val, offset);
if (threadIdx.x % 32 == 0)
atomicAdd(output, val);
}
Optimized Matrix Transpose
__global__ void matrix_transpose_optimized(float* input, float* output, int width, int height) {
__shared__ float tile[TILE_SIZE][TILE_SIZE + 1]; // +1 padding
int x_in = blockIdx.x * TILE_SIZE + threadIdx.x;
int y_in = blockIdx.y * TILE_SIZE + threadIdx.y;
int x_out = blockIdx.y * TILE_SIZE + threadIdx.x;
int y_out = blockIdx.x * TILE_SIZE + threadIdx.y;
if (x_in < width && y_in < height)
tile[threadIdx.y][threadIdx.x] = input[y_in * width + x_in];
__syncthreads();
if (x_out < height && y_out < width)
output[y_out * height + x_out] = tile[threadIdx.x][threadIdx.y];
}
Practical Exercises
π― Exercise 1: Shared Memory Reduction
Task:
-
Implement a high-performance array summation kernel
Requirements:
-
Use shared memory
-
Handle arbitrary array sizes
-
Avoid bank conflicts
-
Compare against naive version
π― Exercise 2: Shared Memory Convolution
Task:
-
Optimize 2D convolution using shared memory
Requirements:
-
Support arbitrary image and kernel sizes
-
Shared memory tiling
-
Handle boundaries
-
Measure memory bandwidth
π― Exercise 3: Kernel Profiling & Optimization
Task:
-
Choose a real CUDA kernel
-
Profile with Nsight Compute
-
Identify bottlenecks
-
Apply optimization
-
Write a performance report
FAQ
β Q1: When should I use shared memory?
A: When:
-
Threads reuse data
-
Access pattern is predictable
-
Global memory bottlenecks exist
-
The algorithm benefits from locality
β Q2: How to choose the best tile size?
A: Balance between:
-
Shared memory budget
-
Register usage
-
Data reuse
-
Occupancy
Typical choices: 16x16 or 32x32
β Q3: How bad are bank conflicts?
A:
-
2-way β ~50% slower
-
4-way β ~75% slower
-
32-way β ~97% slowdown
β Q4: Why is my kernel occupancy low?
Possible causes:
-
Too many registers
-
Excessive shared memory
-
Oversized blocks
-
Branch divergence
β Q5: How to handle irregular memory access?
Strategies:
-
Preprocessing: reshape or pad data
-
Index indirection: use index arrays
-
Tiling: break into regular blocks
-
Texture memory: for spatial locality
Advanced Learning Resources
π Recommended Books
-
Professional CUDA C Programming β John Cheng
-
CUDA by Example β Jason Sanders
-
Programming Massively Parallel Processors β David Kirk
π Online Resources
π οΈ Useful Tools
-
Nsight Compute: kernel-level profiling
-
Nsight Systems: timeline and system analysis
-
CUDA-MEMCHECK: memory error checker
π Advanced Topics
-
Multi-GPU programming
-
CUDA Streams
-
Dynamic Parallelism
-
Tensor Core Programming
-
cuBLAS/cuDNN/Thrust integration
π‘ Summary
By completing this guide, youβve learned:
β CUDA memory hierarchy & shared memory usage
β Matrix multiplication optimization pipeline
β Performance analysis tools and workflows
β Shared memory conflict resolution and double buffering
β Warp-level efficiency techniques
π Next Steps
-
Explore CUDA libraries like cuBLAS
-
Learn about new GPU features (Tensor Cores, MIG)
-
Apply to real-world compute-intensive projects
-
Keep profiling and refining
π§ Performance tuning is a mindset β mix theory with hands-on practice!