Triton & CUDA Study Group Plan (Restructured)

12-Week Structured Learning Program with Explicit Materials & Exercises


📋 Overview

This structured study plan is designed for software engineers with GPU knowledge but no CUDA/Triton experience. Each week includes specific readings, videos, learning objectives with discussion points, and targeted exercises.

Target Audience

  • Software engineers with GPU conceptual knowledge
  • No prior CUDA or Triton experience required
  • Comfortable with C/C++ and Python
  • Goal: Practical GPU programming competency

Week 1: GPU Architecture & CUDA Setup

📚 Required Readings & Videos (4-5 hours)

  1. NVIDIA CUDA Programming Guide - Chapter 1 (1 hour)
  2. An Even Easier Introduction to CUDA (30 min)
  3. GPU Architecture Basics - Video (45 min)
  4. FreeCodeCamp CUDA Course - First 2 hours (2 hours)

🎯 What You Should Learn

  • GPU vs CPU Architecture: SIMT vs SIMD execution model, memory hierarchy differences
  • CUDA Toolkit Components: nvcc compiler, runtime API, driver API
  • Basic Terminology: Host/device, kernel, thread hierarchy
  • Development Environment: Setting up CUDA development workflow

💬 Discussion Points for Group Meeting

  1. “Why do GPUs excel at parallel tasks that CPUs struggle with?”
  2. “What are the trade-offs between GPU and CPU for different algorithms?”
  3. “How does the SIMT execution model impact how we write algorithms?”
  4. “What development challenges do you anticipate with GPU programming?”

🛠️ Exercises

Exercise 1.1: Environment Setup (30 min)

# Verify CUDA installation
nvcc --version
nvidia-smi
 
# Compile and run deviceQuery sample
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
make
./deviceQuery

Goal: Confirm working CUDA environment and understand GPU specifications

Exercise 1.2: Hello CUDA (45 min)

File: hello_cuda.cu

#include <cuda_runtime.h>
#include <stdio.h>
 
__global__ void helloFromGPU() {
    printf("Hello from GPU thread %d in block %d\n", 
           threadIdx.x, blockIdx.x);
}
 
int main() {
    printf("Hello from CPU\n");
    helloFromGPU<<<2, 4>>>();
    cudaDeviceSynchronize();
    return 0;
}

Task: Modify to use different grid/block configurations and observe output patterns Goal: Understand basic kernel launch syntax and thread indexing

Exercise 1.3: GPU Memory Investigation (30 min)

Task: Use deviceQuery output to answer:

  • What is your GPU’s compute capability?
  • How much global memory does it have?
  • How many multiprocessors?
  • What is the maximum threads per block?

Goal: Connect hardware specifications to programming limitations


Week 2: CUDA Fundamentals - Memory & Kernels

📚 Required Readings & Videos (4-5 hours)

  1. CUDA Programming Guide - Chapter 2 (1.5 hours)
  2. CUDA Memory Management (45 min)
  3. FreeCodeCamp CUDA Course - Hours 2-4 (2 hours)
  4. Understanding CUDA Grid and Block Dimensions (30 min)

🎯 What You Should Learn

  • Thread Hierarchy: Grids, blocks, threads, and their indexing
  • Memory Management: cudaMalloc, cudaMemcpy, cudaFree
  • Kernel Syntax: __global__, __device__, __host__ qualifiers
  • Error Handling: cudaError_t and proper error checking

💬 Discussion Points for Group Meeting

  1. “Why is the thread hierarchy organized as grid→block→thread instead of a flat structure?”
  2. “What are the advantages and disadvantages of explicit memory management?”
  3. “How do we determine optimal grid and block dimensions for a given problem?”
  4. “What patterns do you see between problem decomposition and CUDA threading model?”

🛠️ Exercises

Exercise 2.1: Vector Addition (ECE408 MP0 Style) (90 min)

File: vector_add.cu

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
 
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    // TODO: Implement parallel vector addition
    // Each thread should compute one element
}
 
int main() {
    int N = 10000;
    size_t size = N * sizeof(float);
    
    // TODO: Allocate host memory
    // TODO: Initialize input arrays
    // TODO: Allocate device memory
    // TODO: Copy input data to device
    // TODO: Launch kernel
    // TODO: Copy result back to host
    // TODO: Verify result and cleanup
    
    return 0;
}

Tasks:

  1. Complete the implementation
  2. Add proper error checking
  3. Test with different vector sizes (1K, 10K, 1M elements)
  4. Experiment with different block sizes (32, 128, 256, 512)

Goal: Master basic CUDA workflow and understand performance implications

Exercise 2.2: Matrix Addition (60 min)

Task: Extend vector addition to 2D matrices using 2D grid/block configuration

__global__ void matrixAdd(float *A, float *B, float *C, int rows, int cols) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    // TODO: Complete implementation with bounds checking
}

Goal: Understand 2D indexing and bounds checking

Exercise 2.3: Memory Transfer Timing (45 min)

Task: Measure and compare:

  • CPU computation time
  • GPU computation time (kernel only)
  • GPU total time (including memory transfers)
  • Memory transfer time (host↔device)

Goal: Understand when GPU acceleration is beneficial


Week 3: Introduction to Triton

📚 Required Readings & Videos (4-5 hours)

  1. OpenAI Triton Introduction (45 min)
  2. Triton Vector Addition Tutorial (1 hour)
  3. Understanding Triton Tutorials Part 1 (45 min) (user’s recommended resource)
  4. Getting Started with Triton Tutorial (1.5 hours)
  5. Triton vs CUDA Comparison (30 min)

🎯 What You Should Learn

  • Triton’s Philosophy: Block-level programming vs thread-level
  • Python Integration: @triton.jit decorator and kernel compilation
  • Memory Model: Block pointers and automatic memory optimization
  • Auto-tuning: Basic concepts of performance optimization

💬 Discussion Points for Group Meeting

  1. “How does Triton’s block-level approach simplify GPU programming compared to CUDA?”
  2. “What are the trade-offs between Triton’s abstraction and CUDA’s explicit control?”
  3. “When would you choose Triton over CUDA and vice versa?”
  4. “How does Triton’s auto-tuning compare to manual optimization in CUDA?”

🛠️ Exercises

Exercise 3.1: Triton Vector Addition (60 min)

File: triton_vector_add.py

import torch
import triton
import triton.language as tl
 
@triton.jit
def vector_add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    # TODO: Implement vector addition in Triton
    # Use block-level operations
    pass
 
def vector_add(x: torch.Tensor, y: torch.Tensor):
    # TODO: Set up kernel launch parameters
    # TODO: Call the kernel
    pass
 
# Test and benchmark
if __name__ == "__main__":
    # TODO: Create test tensors and verify correctness
    pass

Tasks:

  1. Complete the Triton implementation
  2. Compare performance with PyTorch’s native add
  3. Experiment with different BLOCK_SIZE values
  4. Verify numerical correctness

Goal: Understand Triton’s programming model and syntax

Exercise 3.2: Element-wise Operations (45 min)

Task: Implement in Triton:

  • ReLU activation: output = max(0, x)
  • Squared operation: output = x^2
  • GELU approximation: output = x * sigmoid(1.702 * x)

Goal: Practice Triton’s mathematical operations and function composition

Exercise 3.3: CUDA vs Triton Comparison (60 min)

Task: Implement the same vector addition in both CUDA and Triton, then compare:

  • Lines of code
  • Development time
  • Performance
  • Ease of debugging

Goal: Understand practical differences between the approaches


Week 4: CUDA Memory Optimization

📚 Required Readings & Videos (5-6 hours)

  1. CUDA Memory Model (2 hours)
  2. Memory Coalescing Tutorial (1 hour)
  3. Shared Memory Programming (1.5 hours)
  4. FreeCodeCamp CUDA Course - Hours 4-6 (2 hours)

🎯 What You Should Learn

  • Memory Hierarchy: Global, shared, constant, texture memory characteristics
  • Coalescing Rules: How to achieve efficient memory access patterns
  • Shared Memory: Bank conflicts, padding, synchronization
  • Memory Profiling: Using nvidia-smi and basic profiling techniques

💬 Discussion Points for Group Meeting

  1. “Why is memory bandwidth often the bottleneck in GPU applications?”
  2. “How do coalescing rules relate to the underlying hardware architecture?”
  3. “When is the complexity of shared memory optimization worth the effort?”
  4. “What patterns can we identify for memory-bound vs compute-bound kernels?”

🛠️ Exercises

Exercise 4.1: Coalescing Analysis (ECE408 MP1 Style) (90 min)

File: coalescing_test.cu

// Test different memory access patterns
__global__ void strided_access(float *data, int stride, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        data[tid * stride] = tid;  // Strided access
    }
}
 
__global__ void coalesced_access(float *data, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        data[tid] = tid;  // Coalesced access
    }
}

Tasks:

  1. Implement timing for both kernels
  2. Test with strides of 1, 2, 4, 8, 16, 32
  3. Graph the performance vs stride
  4. Explain the performance pattern

Goal: Understand coalescing impact on performance

Exercise 4.2: Matrix Transpose with Shared Memory (120 min)

File: matrix_transpose.cu

#define TILE_SIZE 32
 
__global__ void transpose_naive(float *input, float *output, int N) {
    // TODO: Naive implementation (non-coalesced writes)
}
 
__global__ void transpose_shared(float *input, float *output, int N) {
    __shared__ float tile[TILE_SIZE][TILE_SIZE + 1];  // +1 to avoid bank conflicts
    // TODO: Shared memory implementation
}

Tasks:

  1. Implement both versions
  2. Measure and compare performance
  3. Verify correctness
  4. Experiment with different tile sizes

Goal: Master shared memory programming and bank conflict avoidance

Exercise 4.3: Memory Bandwidth Analysis (45 min)

Task: Calculate theoretical vs achieved memory bandwidth for your kernels

// Bandwidth = (bytes_read + bytes_written) / time_in_seconds
// Compare against GPU's peak memory bandwidth from deviceQuery

Goal: Understand memory efficiency metrics


Week 5: Advanced CUDA - Reductions

📚 Required Readings & Videos (5-6 hours)

  1. NVIDIA Reduction Examples (2 hours)
  2. Parallel Reduction Patterns (1.5 hours)
  3. Warp Shuffle Operations (1 hour)
  4. FreeCodeCamp CUDA Course - Hours 6-8 (2 hours)

🎯 What You Should Learn

  • Reduction Algorithms: Tree-based reductions, warp-level optimizations
  • Synchronization: __syncthreads(), warp synchronous programming
  • Warp Primitives: __shfl_down_sync(), __shfl_xor_sync()
  • Multiple Blocks: Handling reductions across grid dimensions

💬 Discussion Points for Group Meeting

  1. “Why are reductions fundamentally challenging to parallelize efficiently?”
  2. “How do warp-level primitives change the performance characteristics?”
  3. “What are the trade-offs between different reduction strategies?”
  4. “When does it make sense to use multiple kernel launches vs single kernel?”

🛠️ Exercises

Exercise 5.1: Basic Reduction (ECE408 MP2 Style) (120 min)

File: reduction.cu

__global__ void reduce_sum_naive(float *input, float *output, int N) {
    // TODO: Implement tree-based reduction in shared memory
}
 
__global__ void reduce_sum_optimized(float *input, float *output, int N) {
    // TODO: Add warp-level optimizations
}

Tasks:

  1. Implement tree-based reduction
  2. Add warp shuffle optimizations
  3. Handle arbitrary input sizes
  4. Compare with CPU reduction

Goal: Master reduction algorithms and warp-level programming

Exercise 5.2: Multiple Reductions (60 min)

Task: Implement kernels to find:

  • Sum, max, min of array
  • Mean and variance in single pass
  • Histogram with atomic operations

Goal: Apply reduction concepts to different operations

Exercise 5.3: Performance Analysis (45 min)

Task: Profile your reduction implementations and compare against:

  • cuBLAS cublasSasum
  • Thrust thrust::reduce
  • CPU single-threaded version

Goal: Understand production-quality implementation performance


Week 6: Triton Matrix Operations

📚 Required Readings & Videos (5-6 hours)

  1. Triton Matrix Multiplication Tutorial (2 hours)
  2. Matrix Multiplication Optimization Guide (1.5 hours)
  3. Triton Auto-tuning Documentation (1 hour)
  4. Triton Kernel Compilation Stages (1 hour)

🎯 What You Should Learn

  • Block-level Matrix Multiplication: Tiling strategies in Triton
  • Auto-tuning: @triton.autotune decorator usage
  • Memory Efficiency: Triton’s automatic memory optimizations
  • Integration: Calling Triton kernels from PyTorch

💬 Discussion Points for Group Meeting

  1. “How does Triton’s auto-tuning compare to manual CUDA optimization?”
  2. “What are the advantages of block-level thinking for matrix operations?”
  3. “How does Triton handle memory coalescing automatically?”
  4. “When might Triton’s optimizations be suboptimal compared to hand-tuned CUDA?”

🛠️ Exercises

Exercise 6.1: Triton Matrix Multiplication (150 min)

File: triton_matmul.py

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64}, num_stages=3, num_warps=8),
        # TODO: Add more configurations
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
    # TODO: Implement blocked matrix multiplication
    pass

Tasks:

  1. Complete the matrix multiplication implementation
  2. Add multiple auto-tune configurations
  3. Benchmark against PyTorch’s torch.mm
  4. Test with different matrix sizes

Goal: Master Triton’s auto-tuning and matrix operations

Exercise 6.2: Advanced Matrix Operations (90 min)

Task: Implement in Triton:

  • Matrix transpose
  • Element-wise matrix operations (add, multiply)
  • Softmax across matrix rows

Goal: Build library of common matrix operations

Exercise 6.3: Performance Comparison (60 min)

Task: Compare your Triton implementations against:

  • PyTorch operations
  • cuBLAS (if available)
  • Your CUDA implementations from previous weeks

Goal: Understand Triton’s performance characteristics


Week 7: CUDA Streams and Concurrency

📚 Required Readings & Videos (5-6 hours)

  1. CUDA Streams Tutorial (1.5 hours)
  2. Overlapping Computation and Data Transfer (2 hours)
  3. Asynchronous Memory Transfers (1.5 hours)
  4. FreeCodeCamp CUDA Course - Hours 8-10 (2 hours)

🎯 What You Should Learn

  • CUDA Streams: Creating and managing multiple execution streams
  • Asynchronous Operations: Non-blocking memory transfers and kernel launches
  • Event Timing: Precise performance measurement with CUDA events
  • Memory Pinning: Page-locked memory for faster transfers

💬 Discussion Points for Group Meeting

  1. “When does stream-based concurrency provide significant benefits?”
  2. “How do we balance memory bandwidth with computational throughput?”
  3. “What are the challenges of debugging asynchronous GPU code?”
  4. “How do streams relate to real-world data processing pipelines?”

🛠️ Exercises

Exercise 7.1: Stream Overlap (ECE408 MP3 Style) (120 min)

File: stream_overlap.cu

void processDataWithStreams(float *h_data, int N, int nStreams) {
    // TODO: Create multiple streams
    // TODO: Process data in chunks with overlapped H2D, kernel, D2H
    // TODO: Synchronize and measure total time
}
 
void processDataSynchronous(float *h_data, int N) {
    // TODO: Process all data synchronously for comparison
}

Tasks:

  1. Implement streaming data processing
  2. Compare with synchronous version
  3. Experiment with different numbers of streams
  4. Measure memory transfer vs computation overlap

Goal: Master asynchronous GPU programming

Exercise 7.2: Event-based Timing (60 min)

Task: Create precise timing framework using CUDA events

class CudaTimer {
    cudaEvent_t start, stop;
public:
    void startTimer();
    float stopTimer();  // Returns elapsed time in milliseconds
};

Goal: Implement accurate GPU performance measurement

Exercise 7.3: Pipeline Processing (90 min)

Task: Implement a 3-stage pipeline:

  1. CPU preprocessing
  2. GPU computation
  3. CPU postprocessing

Use streams to overlap all three stages.

Goal: Build practical concurrent processing system


Week 8: Deep Learning Kernels

📚 Required Readings & Videos (6-7 hours)

  1. Flash Attention Paper - Sections 1-3 (2 hours)
  2. Triton Fused Attention Tutorial (2 hours)
  3. Understanding Flash Attention Implementation (1.5 hours)
  4. Custom PyTorch CUDA Extensions (1 hour)

🎯 What You Should Learn

  • Attention Mechanisms: Mathematical foundation and GPU challenges
  • Kernel Fusion: Combining multiple operations for efficiency
  • Memory-IO Awareness: Algorithm design for GPU memory hierarchy
  • PyTorch Integration: Creating custom operators

💬 Discussion Points for Group Meeting

  1. “Why is attention computation particularly challenging for GPUs?”
  2. “How does Flash Attention’s approach differ from naive implementations?”
  3. “What principles from Flash Attention apply to other algorithms?”
  4. “How do we balance mathematical precision with computational efficiency?”

🛠️ Exercises

Exercise 8.1: Attention Mechanism Components (150 min)

Task: Implement individual components in Triton:

  1. Softmax kernel with numerical stability
  2. Scaled dot-product attention
  3. Multi-head attention reshaping

File: attention_components.py

@triton.jit
def softmax_kernel(input_ptr, output_ptr, input_row_stride, output_row_stride, n_cols, BLOCK_SIZE: tl.constexpr):
    # TODO: Implement numerically stable softmax
    pass
 
@triton.jit
def attention_kernel(q_ptr, k_ptr, v_ptr, output_ptr, ...):
    # TODO: Implement basic attention computation
    pass

Goal: Build foundation for complex attention implementations

Exercise 8.2: Flash Attention Implementation (180 min)

Task: Implement simplified Flash Attention algorithm

  • Focus on forward pass only
  • Use block-wise computation
  • Compare memory usage vs standard attention

Goal: Understand memory-efficient algorithm design

Exercise 8.3: PyTorch Integration (90 min)

Task: Create PyTorch-compatible operators

class TritonAttention(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v):
        # TODO: Call Triton kernel
        pass
    
    @staticmethod
    def backward(ctx, grad_output):
        # TODO: Implement backward pass
        pass

Goal: Create production-ready custom operators


Week 9: Profiling and Optimization

📚 Required Readings & Videos (5-6 hours)

  1. Nsight Systems User Guide - Chapters 1-3 (2 hours)
  2. Nsight Compute User Guide - Chapters 1-2 (1.5 hours)
  3. CUDA Profiling Best Practices (1 hour)
  4. Performance Optimization Guide - Chapters 5-7 (2 hours)

🎯 What You Should Learn

  • Profiling Tools: Nsight Systems, Nsight Compute, nvidia-smi
  • Performance Metrics: Occupancy, bandwidth utilization, instruction throughput
  • Bottleneck Identification: Memory-bound vs compute-bound analysis
  • Optimization Strategies: Systematic performance improvement

💬 Discussion Points for Group Meeting

  1. “How do we systematically identify performance bottlenecks?”
  2. “What metrics are most important for different types of kernels?”
  3. “How do we balance development time with optimization effort?”
  4. “What are common performance pitfalls and how to avoid them?”

🛠️ Exercises

Exercise 9.1: Profiling Workshop (120 min)

Provided: Intentionally suboptimal kernels with various issues:

  • Non-coalesced memory access
  • Low occupancy
  • Bank conflicts
  • Unnecessary synchronization

Tasks:

  1. Profile each kernel with Nsight Compute
  2. Identify the primary bottleneck
  3. Propose optimization strategy
  4. Implement and measure improvement

Goal: Develop systematic optimization methodology

Exercise 9.2: Occupancy Analysis (90 min)

Task: Analyze occupancy for your previous kernels

// Use CUDA occupancy API
int maxThreadsPerBlock;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&maxThreadsPerBlock, kernel, blockSize, 0);
 
// Calculate theoretical occupancy
// Compare with achieved occupancy from profiler

Goal: Understand occupancy optimization

Exercise 9.3: Memory Bandwidth Optimization (90 min)

Task: Optimize memory-bound kernels to achieve >80% peak bandwidth

  • Use profiler to measure achieved bandwidth
  • Apply coalescing optimizations
  • Consider cache behavior

Goal: Master memory optimization techniques


Week 10: Advanced Optimization Techniques

📚 Required Readings & Videos (5-6 hours)

  1. Tensor Core Programming (1.5 hours)
  2. Advanced CUDA Optimization (1 hour)
  3. Loop Unrolling and ILP (1 hour)
  4. CUTLASS Library Examples - Browse examples (2 hours)

🎯 What You Should Learn

  • Tensor Cores: Specialized matrix computation units
  • Instruction-Level Parallelism: Loop unrolling, register optimization
  • Cache Optimization: L1/L2 cache utilization strategies
  • Architecture-Specific: Optimizing for different GPU generations

💬 Discussion Points for Group Meeting

  1. “When is the complexity of Tensor Core programming worthwhile?”
  2. “How do we balance code readability with optimization level?”
  3. “What optimization techniques translate across GPU architectures?”
  4. “How do we future-proof our optimization strategies?”

🛠️ Exercises

Exercise 10.1: Tensor Core GEMM (180 min)

Task: Implement matrix multiplication using Tensor Cores

#include <mma.h>
using namespace nvcuda;
 
__global__ void tensor_core_gemm(half *A, half *B, float *C, int M, int N, int K) {
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    
    // TODO: Implement Tensor Core computation
}

Goal: Utilize specialized hardware for maximum performance

Exercise 10.2: Register Optimization (90 min)

Task: Optimize kernel for register usage

  • Minimize register spilling
  • Use __launch_bounds__ directive
  • Measure occupancy impact

Goal: Understand register pressure optimization

Exercise 10.3: Multi-GPU Scaling (120 min)

Task: Scale your best kernel across multiple GPUs

  • Implement data distribution
  • Handle inter-GPU communication
  • Measure scaling efficiency

Goal: Understand distributed GPU computing


Week 11: Production Integration

📚 Required Readings & Videos (4-5 hours)

  1. PyTorch C++ Extensions Guide (2 hours)
  2. CUDA Error Handling Best Practices (1 hour)
  3. Docker GPU Support (1 hour)
  4. Testing GPU Code (1 hour)

🎯 What You Should Learn

  • Integration Patterns: Making kernels production-ready
  • Error Handling: Robust error checking and recovery
  • Testing Strategies: Unit testing GPU code
  • Deployment: Containerization and version management

💬 Discussion Points for Group Meeting

  1. “What makes GPU code ‘production-ready’ vs research code?”
  2. “How do we handle GPU errors gracefully in production systems?”
  3. “What testing strategies work best for GPU kernels?”
  4. “How do we manage GPU code dependencies and versions?”

🛠️ Exercises

Exercise 11.1: PyTorch Extension (120 min)

Task: Package your best kernels as PyTorch extensions

# setup.py
from pybind11.setup_helpers import Pybind11Extension, build_ext
from pybind11 import get_cmake_dir
 
ext_modules = [
    Pybind11Extension(
        "my_cuda_ops",
        ["src/cuda_ops.cpp", "src/kernels.cu"],
        # TODO: Add proper build configuration
    ),
]

Goal: Create installable GPU operation packages

Exercise 11.2: Error Handling Framework (90 min)

Task: Implement comprehensive error handling

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            // TODO: Implement proper error handling \
        } \
    } while(0)

Goal: Build robust GPU applications

Exercise 11.3: Performance Testing Suite (90 min)

Task: Create automated performance testing

  • Benchmark against baselines
  • Test across different input sizes
  • Validate numerical accuracy
  • Generate performance reports

Goal: Ensure production performance standards


Week 12: Capstone Projects

📚 Required Readings & Videos (4-5 hours)

  1. Recent GPU Computing Papers - Choose 2 recent papers in your interest area (3 hours)
  2. Future of GPU Computing (1 hour)
  3. Review all previous week materials for project planning (1 hour)

🎯 What You Should Learn

  • Project Planning: Scoping GPU computing projects
  • Research Application: Applying techniques to novel problems
  • Performance Analysis: Comprehensive evaluation methodology
  • Knowledge Transfer: Teaching others your implementations

💬 Discussion Points for Group Meeting

  1. “What GPU computing trends do you see emerging?”
  2. “How can we apply what we’ve learned to our company’s problems?”
  3. “What areas need further study for production deployment?”
  4. “How do we stay current with rapidly evolving GPU technology?”

🛠️ Final Projects (Choose One)

Project Option A: Custom Deep Learning Operator (16+ hours)

Goal: Implement a complete custom operator for a specific neural network layer Requirements:

  • Forward and backward passes
  • PyTorch integration with autograd
  • Performance benchmarking vs existing implementations
  • Comprehensive testing

Example operators:

  • Grouped convolution
  • Attention variant (sparse, local, etc.)
  • Custom activation functions
  • Layer normalization variants

Project Option B: Scientific Computing Kernel (16+ hours)

Goal: Solve a real scientific computing problem with GPU acceleration Requirements:

  • Problem analysis and algorithm design
  • CUDA implementation with optimization
  • Comparison with CPU implementation
  • Scaling analysis

Example problems:

  • N-body simulation
  • Finite difference PDE solver
  • Monte Carlo simulation
  • Image processing pipeline

Project Option C: Performance Library (16+ hours)

Goal: Create a library of optimized GPU operations Requirements:

  • Multiple related operations (e.g., all BLAS Level 1 operations)
  • Consistent API design
  • Comprehensive benchmarking
  • Documentation and examples

Project Option D: GPU Computing Education Tool (16+ hours)

Goal: Create a tool to help others learn GPU programming Requirements:

  • Interactive visualization of GPU concepts
  • Code examples with explanations
  • Performance demonstration
  • User-friendly interface

📊 Final Presentations (Week 12, Day 2)

Format: 15-minute presentations + 5 minutes Q&A Content:

  1. Problem statement and approach (3 min)
  2. Implementation details and challenges (5 min)
  3. Performance analysis and results (4 min)
  4. Lessons learned and future work (3 min)

🏁 Completion Checklist

By End of Week 12, You Should Be Able To:

  • Write efficient CUDA kernels from scratch
  • Implement custom operations in Triton
  • Profile and optimize GPU code systematically
  • Integrate GPU kernels with PyTorch
  • Debug GPU applications effectively
  • Design memory-efficient algorithms
  • Apply parallel computing principles to new problems
  • Evaluate GPU vs CPU trade-offs for different tasks

Portfolio Projects:

  • Working CUDA vector addition with optimization
  • Triton matrix multiplication with auto-tuning
  • Memory-optimized matrix operations
  • Parallel reduction implementations
  • Streaming data processing pipeline
  • Custom attention mechanism
  • Production-ready PyTorch extension
  • Comprehensive final project

📈 Success Metrics

Weekly Assessment:

  • Exercises Completion (60%): All exercises working and optimized
  • Discussion Participation (20%): Active engagement in meetings
  • Code Quality (20%): Clean, well-documented, efficient code

Progress Milestones:

  • Week 4: Basic CUDA proficiency
  • Week 6: Triton competency
  • Week 8: Advanced optimization skills
  • Week 12: Production-ready implementations

Learning Outcomes Verification:

  • Can implement complex algorithms in both CUDA and Triton
  • Demonstrates understanding of GPU architecture implications
  • Shows ability to optimize code systematically
  • Creates production-quality, testable GPU applications

📚 Resource Library

Essential References:

  • NVIDIA CUDA Programming Guide
  • Triton Language Documentation
  • “Programming Massively Parallel Processors” by Kirk & Hwu

Tools Required:

  • CUDA Toolkit 12.0+
  • Python 3.8+ with PyTorch
  • Nsight Systems & Compute
  • Git for version control

Community Support:

  • NVIDIA Developer Forums
  • GPU MODE Discord
  • PyTorch Discussion Forums
  • Stack Overflow cuda/triton tags

This detailed plan provides explicit materials, clear learning objectives, and targeted exercises for each week. The progression from basic concepts to production-ready implementations ensures practical competency in GPU programming with both CUDA and Triton.