CUDA Shared Memory & Performance Optimization Study Guide

πŸ“š Table of Contents

  1. Project Overview

  2. Environment Setup

  3. Shared Memory Basics

  4. Matrix Multiplication Optimization Evolution

  5. Performance Analysis Techniques

  6. Advanced Shared Memory Techniques

  7. Practical Exercises

  8. FAQ

  9. Advanced Learning Resources


Project Overview

This project walks you through 4 progressively advanced CUDA examples to master shared memory, kernel performance analysis, and matrix multiplication optimization techniques.

🎯 Learning Objectives

  • Understand CUDA memory hierarchy

  • Master efficient shared memory usage

  • Learn to analyze and optimize CUDA kernel performance

  • Implement high-performance matrix multiplication

  • Gain proficiency in profiling tools

πŸ“‚ Project Structure

cuda-learning/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ 01_shared_memory_basics.cu
β”‚   β”œβ”€β”€ 02_matrix_multiply_evolution.cu
β”‚   β”œβ”€β”€ 03_performance_analysis.cu
β”‚   └── 04_advanced_shared_memory.cu
β”œβ”€β”€ scripts/
β”‚   └── compile_and_run.sh
β”œβ”€β”€ build/
β”œβ”€β”€ docs/
β”œβ”€β”€ Makefile
└── README.md

Environment Setup

πŸ“‹ System Requirements

  • OS: Ubuntu 18.04+ / CentOS 7+ / Windows 10+

  • GPU: NVIDIA GPU (Compute Capability 3.0+)

  • CUDA Toolkit: 10.0+

  • Compiler: GCC 7+ / MSVC 2017+

πŸ”§ Installing CUDA Toolkit

Ubuntu/Debian:

# Add NVIDIA package repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

# Install CUDA
sudo apt update
sudo apt install cuda-toolkit-11-8

Or via default package manager:

sudo apt install nvidia-cuda-toolkit

βœ… Verify Installation

nvidia-smi           # Check NVIDIA driver
nvcc --version       # Check CUDA compiler
nvidia-smi -L        # Check GPU info

πŸš€ Build and Run

Method 1: Using Makefile

make                 # Build all
make run-example1    # Run example 1
make examples        # Run all
make profile         # Run performance profiling

Method 2: Using script

chmod +x scripts/compile_and_run.sh
./scripts/compile_and_run.sh 1       # Run example 1
./scripts/compile_and_run.sh -a      # Run all
./scripts/compile_and_run.sh -p 3    # Profile example 3

Method 3: Manual

mkdir -p build
nvcc -O3 -arch=sm_50 src/01_shared_memory_basics.cu -o build/example1
./build/example1

Shared Memory Basics

🧠 Theoretical Background

CUDA Memory Hierarchy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Global Memory               β”‚ ← Large, High Latency
β”‚ β€’ Size: GB                  β”‚
β”‚ β€’ Latency: 400-800 cycles   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↑
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L2 Cache                    β”‚
β”‚ β€’ Size: MB                  β”‚
β”‚ β€’ Latency: 200-300 cycles   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↑
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1 Cache / Shared Memory    β”‚ ← Fast & Small
β”‚ β€’ Size: 48–164KB            β”‚
β”‚ β€’ Latency: 20-30 cycles     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↑
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Registers                   β”‚ ← Fastest
β”‚ β€’ Latency: 1 cycle          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Shared Memory Features

  • Fast access: 10–100x faster than global memory

  • Intra-block sharing: Threads in the same block can share data

  • Manually managed

  • Limited size

  • Banked architecture: Avoid bank conflicts for best performance

πŸ’‘ Example 1: Shared Memory Basics (

01_shared_memory_basics.cu

)

Covers:

  • Shared vs global memory speed comparison

  • Parallel reduction using shared memory

  • Demonstrating and resolving bank conflicts

// Shared memory reduction kernel
__global__ void vector_reduce_shared(float* input, float* output, int n) {
    extern __shared__ float sdata[];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tid;

    sdata[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) atomicAdd(output, sdata[0]);
}

Performance wins:

  • Reduces global memory reads

  • Fast computation in shared memory

  • Coalesced memory access


Matrix Multiplication Optimization Evolution

🎯 Goal

Start from a naive implementation and evolve into a high-performance matrix multiplication using shared memory and conflict avoidance.

πŸ“Š Example 2: Matrix Multiply Evolution (

02_matrix_multiply_evolution.cu

)

Version 1: Naive

__global__ void matrix_mul_naive(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

Problem: Excessive redundant memory access.

Version 2: Tiled (Shared Memory)

__global__ void matrix_mul_shared_basic(float* A, float* B, float* C, int N) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE_SIZE + ty;
    int col = blockIdx.x * TILE_SIZE + tx;

    float sum = 0.0f;

    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        if (row < N && t * TILE_SIZE + tx < N)
            As[ty][tx] = A[row * N + t * TILE_SIZE + tx];
        else
            As[ty][tx] = 0.0f;

        if (col < N && t * TILE_SIZE + ty < N)
            Bs[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];
        else
            Bs[ty][tx] = 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE_SIZE; k++)
            sum += As[ty][k] * Bs[k][tx];

        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Version 3: Optimized with Conflict Avoidance

__shared__ float As[TILE_SIZE][TILE_SIZE + 1];  // padded to avoid conflicts

Performance Comparison:

VersionTime (ms)Speedup vs CPUKey Optimization
CPU80001.0xSerial computation
Naive GPU40020xParallel compute
Shared Memory80100xTiling & data reuse
Optimized60133xConflict avoidance
cuBLAS15533xHighly tuned library

Here’s the continuation of your CUDA Shared Memory & Performance Optimization Study Guide translation:


Performance Analysis Techniques

πŸ” Example 3: Performance Analysis (

03_performance_analysis.cu

)

Key Analysis Aspects

  1. Memory Access Patterns

    • Coalesced vs Strided vs Random

    • Cache hit rate analysis

  2. Compute vs Memory Bound

    • Arithmetic intensity

    • Compare compute throughput to memory bandwidth

  3. Branch Divergence

    • Consistency of execution within warps

    • Branch efficiency

  4. Occupancy Analysis

    • SM utilization

    • Registers/shared memory per thread block

Key Metrics

// Memory bandwidth measurement
float measure_bandwidth(float* d_input, float* d_output, int n, int iterations) {
    float bytes = 2.0f * n * sizeof(float) * iterations;
    float bandwidth = bytes / (time_ms / 1000.0f) / (1024 * 1024 * 1024);
    return bandwidth;
}

// Occupancy estimation
void analyze_occupancy(const void* kernel, int block_size) {
    int max_active_blocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_active_blocks, kernel, block_size, 0);
    float occupancy = (max_active_blocks * block_size / (float)prop.maxThreadsPerMultiProcessor) * 100;
    printf("Theoretical occupancy: %.1f%%\n", occupancy);
}

πŸ› οΈ Toolchain Overview

1.Β 

CUDA Events

cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<...>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);

float time_ms;
cudaEventElapsedTime(&time_ms, start, stop);

2.Β 

nvprof

Β (Deprecated, still usable)

nvprof ./your_program
nvprof --metrics achieved_occupancy,gld_efficiency ./your_program

3.Β 

ncu --set full ./your_program
ncu --metrics sm__cycles_elapsed.avg,dram__bytes_read.sum ./your_program

4.Β 

Nsight Systems

nsys profile -t cuda ./your_program

πŸ“ˆ Optimization Workflow

  1. Benchmark β†’

  2. Identify bottlenecks β†’

  3. Apply targeted optimizations β†’

  4. Measure impact β†’

  5. Iterate.


Advanced Shared Memory Techniques

πŸš€ Example 4: Advanced Shared Memory (

04_advanced_shared_memory.cu

)

Bank Conflicts Explained

Shared memory has 32 banks. If multiple threads access the same bank, performance degrades.

Bank 0: address 0, 32, 64, ...
Bank 1: address 1, 33, 65, ...
...
Bank 31: address 31, 63, 95, ...

Techniques to Avoid Bank Conflicts

  1. Padding
__shared__ float data[32][32];     // causes conflict
__shared__ float data[32][33];     // avoids conflict
  1. Access Pattern Tuning
// Bad
data[threadIdx.x * 2];  // 2-way conflict

// Good
data[threadIdx.x];

Double Buffering

__global__ void double_buffering_example(float* input, float* output, int n) {
    extern __shared__ float sdata[];
    float* buffer_a = sdata;
    float* buffer_b = &sdata[tile_size];

    buffer_a[tid] = input[tid];
    __syncthreads();

    for (int tile = 0; tile < num_tiles - 1; tile++) {
        float result = compute(buffer_a[tid]);
        buffer_b[tid] = input[next_tile_offset];
        __syncthreads();
        output[current_offset] = result;
        swap(buffer_a, buffer_b);
        __syncthreads();
    }
}

Warp-Level Primitives

__global__ void warp_reduce(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = (idx < n) ? input[idx] : 0.0f;

    for (int offset = 16; offset > 0; offset /= 2)
        val += __shfl_down_sync(0xffffffff, val, offset);

    if (threadIdx.x % 32 == 0)
        atomicAdd(output, val);
}

Optimized Matrix Transpose

__global__ void matrix_transpose_optimized(float* input, float* output, int width, int height) {
    __shared__ float tile[TILE_SIZE][TILE_SIZE + 1];  // +1 padding

    int x_in = blockIdx.x * TILE_SIZE + threadIdx.x;
    int y_in = blockIdx.y * TILE_SIZE + threadIdx.y;
    int x_out = blockIdx.y * TILE_SIZE + threadIdx.x;
    int y_out = blockIdx.x * TILE_SIZE + threadIdx.y;

    if (x_in < width && y_in < height)
        tile[threadIdx.y][threadIdx.x] = input[y_in * width + x_in];

    __syncthreads();

    if (x_out < height && y_out < width)
        output[y_out * height + x_out] = tile[threadIdx.x][threadIdx.y];
}

Practical Exercises

🎯 Exercise 1: Shared Memory Reduction

Task:

  • Implement a high-performance array summation kernel

Requirements:

  • Use shared memory

  • Handle arbitrary array sizes

  • Avoid bank conflicts

  • Compare against naive version

🎯 Exercise 2: Shared Memory Convolution

Task:

  • Optimize 2D convolution using shared memory

Requirements:

  • Support arbitrary image and kernel sizes

  • Shared memory tiling

  • Handle boundaries

  • Measure memory bandwidth

🎯 Exercise 3: Kernel Profiling & Optimization

Task:

  • Choose a real CUDA kernel

  • Profile with Nsight Compute

  • Identify bottlenecks

  • Apply optimization

  • Write a performance report


FAQ

❓ Q1: When should I use shared memory?

A: When:

  • Threads reuse data

  • Access pattern is predictable

  • Global memory bottlenecks exist

  • The algorithm benefits from locality

❓ Q2: How to choose the best tile size?

A: Balance between:

  • Shared memory budget

  • Register usage

  • Data reuse

  • Occupancy

Typical choices: 16x16 or 32x32

❓ Q3: How bad are bank conflicts?

A:

  • 2-way β†’ ~50% slower

  • 4-way β†’ ~75% slower

  • 32-way β†’ ~97% slowdown

❓ Q4: Why is my kernel occupancy low?

Possible causes:

  • Too many registers

  • Excessive shared memory

  • Oversized blocks

  • Branch divergence

❓ Q5: How to handle irregular memory access?

Strategies:

  • Preprocessing: reshape or pad data

  • Index indirection: use index arrays

  • Tiling: break into regular blocks

  • Texture memory: for spatial locality


Advanced Learning Resources

  1. Professional CUDA C Programming – John Cheng

  2. CUDA by Example – Jason Sanders

  3. Programming Massively Parallel Processors – David Kirk

🌐 Online Resources

πŸ› οΈ Useful Tools

  • Nsight Compute: kernel-level profiling

  • Nsight Systems: timeline and system analysis

  • CUDA-MEMCHECK: memory error checker

πŸ“– Advanced Topics

  • Multi-GPU programming

  • CUDA Streams

  • Dynamic Parallelism

  • Tensor Core Programming

  • cuBLAS/cuDNN/Thrust integration


πŸ’‘ Summary

By completing this guide, you’ve learned:

βœ… CUDA memory hierarchy & shared memory usage

βœ… Matrix multiplication optimization pipeline

βœ… Performance analysis tools and workflows

βœ… Shared memory conflict resolution and double buffering

βœ… Warp-level efficiency techniques

πŸš€ Next Steps

  1. Explore CUDA libraries like cuBLAS

  2. Learn about new GPU features (Tensor Cores, MIG)

  3. Apply to real-world compute-intensive projects

  4. Keep profiling and refining

🧠 Performance tuning is a mindset β€” mix theory with hands-on practice!