CUDA Shared Memory & Performance Optimization Study Guide

📚 Table of Contents

Project Overview
Environment Setup
Shared Memory Basics
Matrix Multiplication Optimization Evolution
Performance Analysis Techniques
Advanced Shared Memory Techniques
Practical Exercises
FAQ
Advanced Learning Resources

Project Overview

This project walks you through 4 progressively advanced CUDA examples to master shared memory, kernel performance analysis, and matrix multiplication optimization techniques.

🎯 Learning Objectives

Understand CUDA memory hierarchy
Master efficient shared memory usage
Learn to analyze and optimize CUDA kernel performance
Implement high-performance matrix multiplication
Gain proficiency in profiling tools

⭐ Recommended Prerequisites

📂 Project Structure

cuda-learning/
├── src/
│   ├── 01_shared_memory_basics.cu
│   ├── 02_matrix_multiply_evolution.cu
│   ├── 03_performance_analysis.cu
│   └── 04_advanced_shared_memory.cu
├── scripts/
│   └── compile_and_run.sh
├── build/
├── docs/
├── Makefile
└── README.md

Environment Setup

📋 System Requirements

OS: Ubuntu 18.04+ / CentOS 7+ / Windows 10+
GPU: NVIDIA GPU (Compute Capability 3.0+)
CUDA Toolkit: 10.0+
Compiler: GCC 7+ / MSVC 2017+

🔧 Installing CUDA Toolkit

Ubuntu/Debian:

# Add NVIDIA package repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

# Install CUDA
sudo apt update
sudo apt install cuda-toolkit-11-8

Or via default package manager:

sudo apt install nvidia-cuda-toolkit

✅ Verify Installation

nvidia-smi           # Check NVIDIA driver
nvcc --version       # Check CUDA compiler
nvidia-smi -L        # Check GPU info

🚀 Build and Run

Method 1: Using Makefile

make                 # Build all
make run-example1    # Run example 1
make examples        # Run all
make profile         # Run performance profiling

Method 2: Using script

chmod +x scripts/compile_and_run.sh
./scripts/compile_and_run.sh 1       # Run example 1
./scripts/compile_and_run.sh -a      # Run all
./scripts/compile_and_run.sh -p 3    # Profile example 3

Method 3: Manual

mkdir -p build
nvcc -O3 -arch=sm_50 src/01_shared_memory_basics.cu -o build/example1
./build/example1

Shared Memory Basics

🧠 Theoretical Background

CUDA Memory Hierarchy

┌─────────────────────────────┐
│ Global Memory               │ ← Large, High Latency
│ • Size: GB                  │
│ • Latency: 400-800 cycles   │
└─────────────────────────────┘
          ↑
┌─────────────────────────────┐
│ L2 Cache                    │
│ • Size: MB                  │
│ • Latency: 200-300 cycles   │
└─────────────────────────────┘
          ↑
┌─────────────────────────────┐
│ L1 Cache / Shared Memory    │ ← Fast & Small
│ • Size: 48–164KB            │
│ • Latency: 20-30 cycles     │
└─────────────────────────────┘
          ↑
┌─────────────────────────────┐
│ Registers                   │ ← Fastest
│ • Latency: 1 cycle          │
└─────────────────────────────┘

Shared Memory Features

Fast access: 10–100x faster than global memory
Intra-block sharing: Threads in the same block can share data
Manually managed
Limited size
Banked architecture: Avoid bank conflicts for best performance

💡 Example 1: Shared Memory Basics (

01_shared_memory_basics.cu

)

Covers:

Shared vs global memory speed comparison
Parallel reduction using shared memory
Demonstrating and resolving bank conflicts

// Shared memory reduction kernel
__global__ void vector_reduce_shared(float* input, float* output, int n) {
    extern __shared__ float sdata[];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tid;

    sdata[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) atomicAdd(output, sdata[0]);
}

Performance wins:

Reduces global memory reads
Fast computation in shared memory
Coalesced memory access

Matrix Multiplication Optimization Evolution

🎯 Goal

Start from a naive implementation and evolve into a high-performance matrix multiplication using shared memory and conflict avoidance.

📊 Example 2: Matrix Multiply Evolution (

02_matrix_multiply_evolution.cu

)

Version 1: Naive

__global__ void matrix_mul_naive(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

Problem: Excessive redundant memory access.

Version 2: Tiled (Shared Memory)

__global__ void matrix_mul_shared_basic(float* A, float* B, float* C, int N) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE_SIZE + ty;
    int col = blockIdx.x * TILE_SIZE + tx;

    float sum = 0.0f;

    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        if (row < N && t * TILE_SIZE + tx < N)
            As[ty][tx] = A[row * N + t * TILE_SIZE + tx];
        else
            As[ty][tx] = 0.0f;

        if (col < N && t * TILE_SIZE + ty < N)
            Bs[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];
        else
            Bs[ty][tx] = 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE_SIZE; k++)
            sum += As[ty][k] * Bs[k][tx];

        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Version 3: Optimized with Conflict Avoidance

__shared__ float As[TILE_SIZE][TILE_SIZE + 1];  // padded to avoid conflicts

Performance Comparison:

Version	Time (ms)	Speedup vs CPU	Key Optimization
CPU	8000	1.0x	Serial computation
Naive GPU	400	20x	Parallel compute
Shared Memory	80	100x	Tiling & data reuse
Optimized	60	133x	Conflict avoidance
cuBLAS	15	533x	Highly tuned library

Here’s the continuation of your CUDA Shared Memory & Performance Optimization Study Guide translation:

Performance Analysis Techniques

🔍 Example 3: Performance Analysis (

03_performance_analysis.cu

)

Key Analysis Aspects

Memory Access Patterns
- Coalesced vs Strided vs Random
- Cache hit rate analysis
Compute vs Memory Bound
- Arithmetic intensity
- Compare compute throughput to memory bandwidth
Branch Divergence
- Consistency of execution within warps
- Branch efficiency
Occupancy Analysis
- SM utilization
- Registers/shared memory per thread block

Key Metrics

// Memory bandwidth measurement
float measure_bandwidth(float* d_input, float* d_output, int n, int iterations) {
    float bytes = 2.0f * n * sizeof(float) * iterations;
    float bandwidth = bytes / (time_ms / 1000.0f) / (1024 * 1024 * 1024);
    return bandwidth;
}

// Occupancy estimation
void analyze_occupancy(const void* kernel, int block_size) {
    int max_active_blocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_active_blocks, kernel, block_size, 0);
    float occupancy = (max_active_blocks * block_size / (float)prop.maxThreadsPerMultiProcessor) * 100;
    printf("Theoretical occupancy: %.1f%%\n", occupancy);
}

🛠️ Toolchain Overview

1. CUDA Events

cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<...>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);

float time_ms;
cudaEventElapsedTime(&time_ms, start, stop);

2. nvprof

(Deprecated, still usable)

nvprof ./your_program
nvprof --metrics achieved_occupancy,gld_efficiency ./your_program

3. Nsight Compute (Recommended)

ncu --set full ./your_program
ncu --metrics sm__cycles_elapsed.avg,dram__bytes_read.sum ./your_program

4. Nsight Systems

nsys profile -t cuda ./your_program

📈 Optimization Workflow

Benchmark →
Identify bottlenecks →
Apply targeted optimizations →
Measure impact →
Iterate.

Advanced Shared Memory Techniques

🚀 Example 4: Advanced Shared Memory (

04_advanced_shared_memory.cu

)

Bank Conflicts Explained

Shared memory has 32 banks. If multiple threads access the same bank, performance degrades.

Bank 0: address 0, 32, 64, ...
Bank 1: address 1, 33, 65, ...
...
Bank 31: address 31, 63, 95, ...

Techniques to Avoid Bank Conflicts

Padding

__shared__ float data[32][32];     // causes conflict
__shared__ float data[32][33];     // avoids conflict

Access Pattern Tuning

// Bad
data[threadIdx.x * 2];  // 2-way conflict

// Good
data[threadIdx.x];

Double Buffering

__global__ void double_buffering_example(float* input, float* output, int n) {
    extern __shared__ float sdata[];
    float* buffer_a = sdata;
    float* buffer_b = &sdata[tile_size];

    buffer_a[tid] = input[tid];
    __syncthreads();

    for (int tile = 0; tile < num_tiles - 1; tile++) {
        float result = compute(buffer_a[tid]);
        buffer_b[tid] = input[next_tile_offset];
        __syncthreads();
        output[current_offset] = result;
        swap(buffer_a, buffer_b);
        __syncthreads();
    }
}

Warp-Level Primitives

__global__ void warp_reduce(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = (idx < n) ? input[idx] : 0.0f;

    for (int offset = 16; offset > 0; offset /= 2)
        val += __shfl_down_sync(0xffffffff, val, offset);

    if (threadIdx.x % 32 == 0)
        atomicAdd(output, val);
}

Optimized Matrix Transpose

__global__ void matrix_transpose_optimized(float* input, float* output, int width, int height) {
    __shared__ float tile[TILE_SIZE][TILE_SIZE + 1];  // +1 padding

    int x_in = blockIdx.x * TILE_SIZE + threadIdx.x;
    int y_in = blockIdx.y * TILE_SIZE + threadIdx.y;
    int x_out = blockIdx.y * TILE_SIZE + threadIdx.x;
    int y_out = blockIdx.x * TILE_SIZE + threadIdx.y;

    if (x_in < width && y_in < height)
        tile[threadIdx.y][threadIdx.x] = input[y_in * width + x_in];

    __syncthreads();

    if (x_out < height && y_out < width)
        output[y_out * height + x_out] = tile[threadIdx.x][threadIdx.y];
}

Practical Exercises

🎯 Exercise 1: Shared Memory Reduction

Task:

Implement a high-performance array summation kernel

Requirements:

Use shared memory
Handle arbitrary array sizes
Avoid bank conflicts
Compare against naive version

🎯 Exercise 2: Shared Memory Convolution

Task:

Optimize 2D convolution using shared memory

Requirements:

Support arbitrary image and kernel sizes
Shared memory tiling
Handle boundaries
Measure memory bandwidth

🎯 Exercise 3: Kernel Profiling & Optimization

Task:

Choose a real CUDA kernel
Profile with Nsight Compute
Identify bottlenecks
Apply optimization
Write a performance report

FAQ

❓ Q1: When should I use shared memory?

A: When:

Threads reuse data
Access pattern is predictable
Global memory bottlenecks exist
The algorithm benefits from locality

❓ Q2: How to choose the best tile size?

A: Balance between:

Shared memory budget
Register usage
Data reuse
Occupancy

Typical choices: 16x16 or 32x32

❓ Q3: How bad are bank conflicts?

2-way → ~50% slower
4-way → ~75% slower
32-way → ~97% slowdown

❓ Q4: Why is my kernel occupancy low?

Possible causes:

Too many registers
Excessive shared memory
Oversized blocks
Branch divergence

❓ Q5: How to handle irregular memory access?

Strategies:

Preprocessing: reshape or pad data
Index indirection: use index arrays
Tiling: break into regular blocks
Texture memory: for spatial locality

Advanced Learning Resources

📚 Recommended Books

Professional CUDA C Programming – John Cheng
CUDA by Example – Jason Sanders
Programming Massively Parallel Processors – David Kirk

🌐 Online Resources

🛠️ Useful Tools

Nsight Compute: kernel-level profiling
Nsight Systems: timeline and system analysis
CUDA-MEMCHECK: memory error checker

📖 Advanced Topics

Multi-GPU programming
CUDA Streams
Dynamic Parallelism
Tensor Core Programming
cuBLAS/cuDNN/Thrust integration

💡 Summary

By completing this guide, you’ve learned:

✅ CUDA memory hierarchy & shared memory usage

✅ Matrix multiplication optimization pipeline

✅ Performance analysis tools and workflows

✅ Shared memory conflict resolution and double buffering

✅ Warp-level efficiency techniques

🚀 Next Steps

Explore CUDA libraries like cuBLAS
Learn about new GPU features (Tensor Cores, MIG)
Apply to real-world compute-intensive projects
Keep profiling and refining

🧠 Performance tuning is a mindset — mix theory with hands-on practice!

Alex Xi's Notes

Explorer

Prep

CUDA Shared Memory & Performance Optimization Study Guide

📚 Table of Contents

Project Overview

🎯 Learning Objectives

⭐ Recommended Prerequisites

📂 Project Structure

Environment Setup

📋 System Requirements

🔧 Installing CUDA Toolkit

Ubuntu/Debian:

Or via default package manager:

✅ Verify Installation

🚀 Build and Run

Method 1: Using Makefile

Method 2: Using script

Method 3: Manual

Shared Memory Basics

🧠 Theoretical Background

CUDA Memory Hierarchy

Shared Memory Features

💡 Example 1: Shared Memory Basics (

01_shared_memory_basics.cu

)

Matrix Multiplication Optimization Evolution

🎯 Goal

📊 Example 2: Matrix Multiply Evolution (

02_matrix_multiply_evolution.cu

)

Version 1: Naive

Version 2: Tiled (Shared Memory)

Version 3: Optimized with Conflict Avoidance

Performance Analysis Techniques

🔍 Example 3: Performance Analysis (

03_performance_analysis.cu

)

Key Analysis Aspects

Key Metrics

🛠️ Toolchain Overview

1.

CUDA Events

2.

nvprof

(Deprecated, still usable)

3.

Nsight Compute (Recommended)

4.

Nsight Systems

📈 Optimization Workflow

Advanced Shared Memory Techniques

🚀 Example 4: Advanced Shared Memory (

04_advanced_shared_memory.cu

)

Bank Conflicts Explained

Techniques to Avoid Bank Conflicts

Double Buffering

Warp-Level Primitives

Optimized Matrix Transpose

Practical Exercises

🎯 Exercise 1: Shared Memory Reduction

🎯 Exercise 2: Shared Memory Convolution

🎯 Exercise 3: Kernel Profiling & Optimization

FAQ

❓ Q1: When should I use shared memory?

❓ Q2: How to choose the best tile size?

❓ Q3: How bad are bank conflicts?

❓ Q4: Why is my kernel occupancy low?

❓ Q5: How to handle irregular memory access?

Advanced Learning Resources

📚 Recommended Books

🌐 Online Resources

🛠️ Useful Tools

📖 Advanced Topics

💡 Summary

🚀 Next Steps

Table of Contents

Backlinks

Explorer