Triton & CUDA Study Group Plan (Restructured)
12-Week Structured Learning Program with Explicit Materials & Exercises
📋 Overview
This structured study plan is designed for software engineers with GPU knowledge but no CUDA/Triton experience. Each week includes specific readings, videos, learning objectives with discussion points, and targeted exercises.
Target Audience
- Software engineers with GPU conceptual knowledge
- No prior CUDA or Triton experience required
- Comfortable with C/C++ and Python
- Goal: Practical GPU programming competency
Week 1: GPU Architecture & CUDA Setup
📚 Required Readings & Videos (4-5 hours)
- NVIDIA CUDA Programming Guide - Chapter 1 (1 hour)
- An Even Easier Introduction to CUDA (30 min)
- GPU Architecture Basics - Video (45 min)
- FreeCodeCamp CUDA Course - First 2 hours (2 hours)
🎯 What You Should Learn
- GPU vs CPU Architecture: SIMT vs SIMD execution model, memory hierarchy differences
- CUDA Toolkit Components: nvcc compiler, runtime API, driver API
- Basic Terminology: Host/device, kernel, thread hierarchy
- Development Environment: Setting up CUDA development workflow
💬 Discussion Points for Group Meeting
- “Why do GPUs excel at parallel tasks that CPUs struggle with?”
- “What are the trade-offs between GPU and CPU for different algorithms?”
- “How does the SIMT execution model impact how we write algorithms?”
- “What development challenges do you anticipate with GPU programming?”
🛠️ Exercises
Exercise 1.1: Environment Setup (30 min)
# Verify CUDA installation
nvcc --version
nvidia-smi
# Compile and run deviceQuery sample
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
make
./deviceQuery
Goal: Confirm working CUDA environment and understand GPU specifications
Exercise 1.2: Hello CUDA (45 min)
File: hello_cuda.cu
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void helloFromGPU() {
printf("Hello from GPU thread %d in block %d\n",
threadIdx.x, blockIdx.x);
}
int main() {
printf("Hello from CPU\n");
helloFromGPU<<<2, 4>>>();
cudaDeviceSynchronize();
return 0;
}
Task: Modify to use different grid/block configurations and observe output patterns Goal: Understand basic kernel launch syntax and thread indexing
Exercise 1.3: GPU Memory Investigation (30 min)
Task: Use deviceQuery
output to answer:
- What is your GPU’s compute capability?
- How much global memory does it have?
- How many multiprocessors?
- What is the maximum threads per block?
Goal: Connect hardware specifications to programming limitations
Week 2: CUDA Fundamentals - Memory & Kernels
📚 Required Readings & Videos (4-5 hours)
- CUDA Programming Guide - Chapter 2 (1.5 hours)
- CUDA Memory Management (45 min)
- FreeCodeCamp CUDA Course - Hours 2-4 (2 hours)
- Understanding CUDA Grid and Block Dimensions (30 min)
🎯 What You Should Learn
- Thread Hierarchy: Grids, blocks, threads, and their indexing
- Memory Management:
cudaMalloc
,cudaMemcpy
,cudaFree
- Kernel Syntax:
__global__
,__device__
,__host__
qualifiers - Error Handling:
cudaError_t
and proper error checking
💬 Discussion Points for Group Meeting
- “Why is the thread hierarchy organized as grid→block→thread instead of a flat structure?”
- “What are the advantages and disadvantages of explicit memory management?”
- “How do we determine optimal grid and block dimensions for a given problem?”
- “What patterns do you see between problem decomposition and CUDA threading model?”
🛠️ Exercises
Exercise 2.1: Vector Addition (ECE408 MP0 Style) (90 min)
File: vector_add.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
// TODO: Implement parallel vector addition
// Each thread should compute one element
}
int main() {
int N = 10000;
size_t size = N * sizeof(float);
// TODO: Allocate host memory
// TODO: Initialize input arrays
// TODO: Allocate device memory
// TODO: Copy input data to device
// TODO: Launch kernel
// TODO: Copy result back to host
// TODO: Verify result and cleanup
return 0;
}
Tasks:
- Complete the implementation
- Add proper error checking
- Test with different vector sizes (1K, 10K, 1M elements)
- Experiment with different block sizes (32, 128, 256, 512)
Goal: Master basic CUDA workflow and understand performance implications
Exercise 2.2: Matrix Addition (60 min)
Task: Extend vector addition to 2D matrices using 2D grid/block configuration
__global__ void matrixAdd(float *A, float *B, float *C, int rows, int cols) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// TODO: Complete implementation with bounds checking
}
Goal: Understand 2D indexing and bounds checking
Exercise 2.3: Memory Transfer Timing (45 min)
Task: Measure and compare:
- CPU computation time
- GPU computation time (kernel only)
- GPU total time (including memory transfers)
- Memory transfer time (host↔device)
Goal: Understand when GPU acceleration is beneficial
Week 3: Introduction to Triton
📚 Required Readings & Videos (4-5 hours)
- OpenAI Triton Introduction (45 min)
- Triton Vector Addition Tutorial (1 hour)
- Understanding Triton Tutorials Part 1 (45 min) (user’s recommended resource)
- Getting Started with Triton Tutorial (1.5 hours)
- Triton vs CUDA Comparison (30 min)
🎯 What You Should Learn
- Triton’s Philosophy: Block-level programming vs thread-level
- Python Integration:
@triton.jit
decorator and kernel compilation - Memory Model: Block pointers and automatic memory optimization
- Auto-tuning: Basic concepts of performance optimization
💬 Discussion Points for Group Meeting
- “How does Triton’s block-level approach simplify GPU programming compared to CUDA?”
- “What are the trade-offs between Triton’s abstraction and CUDA’s explicit control?”
- “When would you choose Triton over CUDA and vice versa?”
- “How does Triton’s auto-tuning compare to manual optimization in CUDA?”
🛠️ Exercises
Exercise 3.1: Triton Vector Addition (60 min)
File: triton_vector_add.py
import torch
import triton
import triton.language as tl
@triton.jit
def vector_add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
# TODO: Implement vector addition in Triton
# Use block-level operations
pass
def vector_add(x: torch.Tensor, y: torch.Tensor):
# TODO: Set up kernel launch parameters
# TODO: Call the kernel
pass
# Test and benchmark
if __name__ == "__main__":
# TODO: Create test tensors and verify correctness
pass
Tasks:
- Complete the Triton implementation
- Compare performance with PyTorch’s native add
- Experiment with different
BLOCK_SIZE
values - Verify numerical correctness
Goal: Understand Triton’s programming model and syntax
Exercise 3.2: Element-wise Operations (45 min)
Task: Implement in Triton:
- ReLU activation:
output = max(0, x)
- Squared operation:
output = x^2
- GELU approximation:
output = x * sigmoid(1.702 * x)
Goal: Practice Triton’s mathematical operations and function composition
Exercise 3.3: CUDA vs Triton Comparison (60 min)
Task: Implement the same vector addition in both CUDA and Triton, then compare:
- Lines of code
- Development time
- Performance
- Ease of debugging
Goal: Understand practical differences between the approaches
Week 4: CUDA Memory Optimization
📚 Required Readings & Videos (5-6 hours)
- CUDA Memory Model (2 hours)
- Memory Coalescing Tutorial (1 hour)
- Shared Memory Programming (1.5 hours)
- FreeCodeCamp CUDA Course - Hours 4-6 (2 hours)
🎯 What You Should Learn
- Memory Hierarchy: Global, shared, constant, texture memory characteristics
- Coalescing Rules: How to achieve efficient memory access patterns
- Shared Memory: Bank conflicts, padding, synchronization
- Memory Profiling: Using
nvidia-smi
and basic profiling techniques
💬 Discussion Points for Group Meeting
- “Why is memory bandwidth often the bottleneck in GPU applications?”
- “How do coalescing rules relate to the underlying hardware architecture?”
- “When is the complexity of shared memory optimization worth the effort?”
- “What patterns can we identify for memory-bound vs compute-bound kernels?”
🛠️ Exercises
Exercise 4.1: Coalescing Analysis (ECE408 MP1 Style) (90 min)
File: coalescing_test.cu
// Test different memory access patterns
__global__ void strided_access(float *data, int stride, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N) {
data[tid * stride] = tid; // Strided access
}
}
__global__ void coalesced_access(float *data, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N) {
data[tid] = tid; // Coalesced access
}
}
Tasks:
- Implement timing for both kernels
- Test with strides of 1, 2, 4, 8, 16, 32
- Graph the performance vs stride
- Explain the performance pattern
Goal: Understand coalescing impact on performance
Exercise 4.2: Matrix Transpose with Shared Memory (120 min)
File: matrix_transpose.cu
#define TILE_SIZE 32
__global__ void transpose_naive(float *input, float *output, int N) {
// TODO: Naive implementation (non-coalesced writes)
}
__global__ void transpose_shared(float *input, float *output, int N) {
__shared__ float tile[TILE_SIZE][TILE_SIZE + 1]; // +1 to avoid bank conflicts
// TODO: Shared memory implementation
}
Tasks:
- Implement both versions
- Measure and compare performance
- Verify correctness
- Experiment with different tile sizes
Goal: Master shared memory programming and bank conflict avoidance
Exercise 4.3: Memory Bandwidth Analysis (45 min)
Task: Calculate theoretical vs achieved memory bandwidth for your kernels
// Bandwidth = (bytes_read + bytes_written) / time_in_seconds
// Compare against GPU's peak memory bandwidth from deviceQuery
Goal: Understand memory efficiency metrics
Week 5: Advanced CUDA - Reductions
📚 Required Readings & Videos (5-6 hours)
- NVIDIA Reduction Examples (2 hours)
- Parallel Reduction Patterns (1.5 hours)
- Warp Shuffle Operations (1 hour)
- FreeCodeCamp CUDA Course - Hours 6-8 (2 hours)
🎯 What You Should Learn
- Reduction Algorithms: Tree-based reductions, warp-level optimizations
- Synchronization:
__syncthreads()
, warp synchronous programming - Warp Primitives:
__shfl_down_sync()
,__shfl_xor_sync()
- Multiple Blocks: Handling reductions across grid dimensions
💬 Discussion Points for Group Meeting
- “Why are reductions fundamentally challenging to parallelize efficiently?”
- “How do warp-level primitives change the performance characteristics?”
- “What are the trade-offs between different reduction strategies?”
- “When does it make sense to use multiple kernel launches vs single kernel?”
🛠️ Exercises
Exercise 5.1: Basic Reduction (ECE408 MP2 Style) (120 min)
File: reduction.cu
__global__ void reduce_sum_naive(float *input, float *output, int N) {
// TODO: Implement tree-based reduction in shared memory
}
__global__ void reduce_sum_optimized(float *input, float *output, int N) {
// TODO: Add warp-level optimizations
}
Tasks:
- Implement tree-based reduction
- Add warp shuffle optimizations
- Handle arbitrary input sizes
- Compare with CPU reduction
Goal: Master reduction algorithms and warp-level programming
Exercise 5.2: Multiple Reductions (60 min)
Task: Implement kernels to find:
- Sum, max, min of array
- Mean and variance in single pass
- Histogram with atomic operations
Goal: Apply reduction concepts to different operations
Exercise 5.3: Performance Analysis (45 min)
Task: Profile your reduction implementations and compare against:
- cuBLAS
cublasSasum
- Thrust
thrust::reduce
- CPU single-threaded version
Goal: Understand production-quality implementation performance
Week 6: Triton Matrix Operations
📚 Required Readings & Videos (5-6 hours)
- Triton Matrix Multiplication Tutorial (2 hours)
- Matrix Multiplication Optimization Guide (1.5 hours)
- Triton Auto-tuning Documentation (1 hour)
- Triton Kernel Compilation Stages (1 hour)
🎯 What You Should Learn
- Block-level Matrix Multiplication: Tiling strategies in Triton
- Auto-tuning:
@triton.autotune
decorator usage - Memory Efficiency: Triton’s automatic memory optimizations
- Integration: Calling Triton kernels from PyTorch
💬 Discussion Points for Group Meeting
- “How does Triton’s auto-tuning compare to manual CUDA optimization?”
- “What are the advantages of block-level thinking for matrix operations?”
- “How does Triton handle memory coalescing automatically?”
- “When might Triton’s optimizations be suboptimal compared to hand-tuned CUDA?”
🛠️ Exercises
Exercise 6.1: Triton Matrix Multiplication (150 min)
File: triton_matmul.py
@triton.autotune(
configs=[
triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64}, num_stages=3, num_warps=8),
# TODO: Add more configurations
],
key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(
a_ptr, b_ptr, c_ptr,
M, N, K,
stride_am, stride_ak,
stride_bk, stride_bn,
stride_cm, stride_cn,
BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
# TODO: Implement blocked matrix multiplication
pass
Tasks:
- Complete the matrix multiplication implementation
- Add multiple auto-tune configurations
- Benchmark against PyTorch’s
torch.mm
- Test with different matrix sizes
Goal: Master Triton’s auto-tuning and matrix operations
Exercise 6.2: Advanced Matrix Operations (90 min)
Task: Implement in Triton:
- Matrix transpose
- Element-wise matrix operations (add, multiply)
- Softmax across matrix rows
Goal: Build library of common matrix operations
Exercise 6.3: Performance Comparison (60 min)
Task: Compare your Triton implementations against:
- PyTorch operations
- cuBLAS (if available)
- Your CUDA implementations from previous weeks
Goal: Understand Triton’s performance characteristics
Week 7: CUDA Streams and Concurrency
📚 Required Readings & Videos (5-6 hours)
- CUDA Streams Tutorial (1.5 hours)
- Overlapping Computation and Data Transfer (2 hours)
- Asynchronous Memory Transfers (1.5 hours)
- FreeCodeCamp CUDA Course - Hours 8-10 (2 hours)
🎯 What You Should Learn
- CUDA Streams: Creating and managing multiple execution streams
- Asynchronous Operations: Non-blocking memory transfers and kernel launches
- Event Timing: Precise performance measurement with CUDA events
- Memory Pinning: Page-locked memory for faster transfers
💬 Discussion Points for Group Meeting
- “When does stream-based concurrency provide significant benefits?”
- “How do we balance memory bandwidth with computational throughput?”
- “What are the challenges of debugging asynchronous GPU code?”
- “How do streams relate to real-world data processing pipelines?”
🛠️ Exercises
Exercise 7.1: Stream Overlap (ECE408 MP3 Style) (120 min)
File: stream_overlap.cu
void processDataWithStreams(float *h_data, int N, int nStreams) {
// TODO: Create multiple streams
// TODO: Process data in chunks with overlapped H2D, kernel, D2H
// TODO: Synchronize and measure total time
}
void processDataSynchronous(float *h_data, int N) {
// TODO: Process all data synchronously for comparison
}
Tasks:
- Implement streaming data processing
- Compare with synchronous version
- Experiment with different numbers of streams
- Measure memory transfer vs computation overlap
Goal: Master asynchronous GPU programming
Exercise 7.2: Event-based Timing (60 min)
Task: Create precise timing framework using CUDA events
class CudaTimer {
cudaEvent_t start, stop;
public:
void startTimer();
float stopTimer(); // Returns elapsed time in milliseconds
};
Goal: Implement accurate GPU performance measurement
Exercise 7.3: Pipeline Processing (90 min)
Task: Implement a 3-stage pipeline:
- CPU preprocessing
- GPU computation
- CPU postprocessing
Use streams to overlap all three stages.
Goal: Build practical concurrent processing system
Week 8: Deep Learning Kernels
📚 Required Readings & Videos (6-7 hours)
- Flash Attention Paper - Sections 1-3 (2 hours)
- Triton Fused Attention Tutorial (2 hours)
- Understanding Flash Attention Implementation (1.5 hours)
- Custom PyTorch CUDA Extensions (1 hour)
🎯 What You Should Learn
- Attention Mechanisms: Mathematical foundation and GPU challenges
- Kernel Fusion: Combining multiple operations for efficiency
- Memory-IO Awareness: Algorithm design for GPU memory hierarchy
- PyTorch Integration: Creating custom operators
💬 Discussion Points for Group Meeting
- “Why is attention computation particularly challenging for GPUs?”
- “How does Flash Attention’s approach differ from naive implementations?”
- “What principles from Flash Attention apply to other algorithms?”
- “How do we balance mathematical precision with computational efficiency?”
🛠️ Exercises
Exercise 8.1: Attention Mechanism Components (150 min)
Task: Implement individual components in Triton:
- Softmax kernel with numerical stability
- Scaled dot-product attention
- Multi-head attention reshaping
File: attention_components.py
@triton.jit
def softmax_kernel(input_ptr, output_ptr, input_row_stride, output_row_stride, n_cols, BLOCK_SIZE: tl.constexpr):
# TODO: Implement numerically stable softmax
pass
@triton.jit
def attention_kernel(q_ptr, k_ptr, v_ptr, output_ptr, ...):
# TODO: Implement basic attention computation
pass
Goal: Build foundation for complex attention implementations
Exercise 8.2: Flash Attention Implementation (180 min)
Task: Implement simplified Flash Attention algorithm
- Focus on forward pass only
- Use block-wise computation
- Compare memory usage vs standard attention
Goal: Understand memory-efficient algorithm design
Exercise 8.3: PyTorch Integration (90 min)
Task: Create PyTorch-compatible operators
class TritonAttention(torch.autograd.Function):
@staticmethod
def forward(ctx, q, k, v):
# TODO: Call Triton kernel
pass
@staticmethod
def backward(ctx, grad_output):
# TODO: Implement backward pass
pass
Goal: Create production-ready custom operators
Week 9: Profiling and Optimization
📚 Required Readings & Videos (5-6 hours)
- Nsight Systems User Guide - Chapters 1-3 (2 hours)
- Nsight Compute User Guide - Chapters 1-2 (1.5 hours)
- CUDA Profiling Best Practices (1 hour)
- Performance Optimization Guide - Chapters 5-7 (2 hours)
🎯 What You Should Learn
- Profiling Tools: Nsight Systems, Nsight Compute, nvidia-smi
- Performance Metrics: Occupancy, bandwidth utilization, instruction throughput
- Bottleneck Identification: Memory-bound vs compute-bound analysis
- Optimization Strategies: Systematic performance improvement
💬 Discussion Points for Group Meeting
- “How do we systematically identify performance bottlenecks?”
- “What metrics are most important for different types of kernels?”
- “How do we balance development time with optimization effort?”
- “What are common performance pitfalls and how to avoid them?”
🛠️ Exercises
Exercise 9.1: Profiling Workshop (120 min)
Provided: Intentionally suboptimal kernels with various issues:
- Non-coalesced memory access
- Low occupancy
- Bank conflicts
- Unnecessary synchronization
Tasks:
- Profile each kernel with Nsight Compute
- Identify the primary bottleneck
- Propose optimization strategy
- Implement and measure improvement
Goal: Develop systematic optimization methodology
Exercise 9.2: Occupancy Analysis (90 min)
Task: Analyze occupancy for your previous kernels
// Use CUDA occupancy API
int maxThreadsPerBlock;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&maxThreadsPerBlock, kernel, blockSize, 0);
// Calculate theoretical occupancy
// Compare with achieved occupancy from profiler
Goal: Understand occupancy optimization
Exercise 9.3: Memory Bandwidth Optimization (90 min)
Task: Optimize memory-bound kernels to achieve >80% peak bandwidth
- Use profiler to measure achieved bandwidth
- Apply coalescing optimizations
- Consider cache behavior
Goal: Master memory optimization techniques
Week 10: Advanced Optimization Techniques
📚 Required Readings & Videos (5-6 hours)
- Tensor Core Programming (1.5 hours)
- Advanced CUDA Optimization (1 hour)
- Loop Unrolling and ILP (1 hour)
- CUTLASS Library Examples - Browse examples (2 hours)
🎯 What You Should Learn
- Tensor Cores: Specialized matrix computation units
- Instruction-Level Parallelism: Loop unrolling, register optimization
- Cache Optimization: L1/L2 cache utilization strategies
- Architecture-Specific: Optimizing for different GPU generations
💬 Discussion Points for Group Meeting
- “When is the complexity of Tensor Core programming worthwhile?”
- “How do we balance code readability with optimization level?”
- “What optimization techniques translate across GPU architectures?”
- “How do we future-proof our optimization strategies?”
🛠️ Exercises
Exercise 10.1: Tensor Core GEMM (180 min)
Task: Implement matrix multiplication using Tensor Cores
#include <mma.h>
using namespace nvcuda;
__global__ void tensor_core_gemm(half *A, half *B, float *C, int M, int N, int K) {
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
// TODO: Implement Tensor Core computation
}
Goal: Utilize specialized hardware for maximum performance
Exercise 10.2: Register Optimization (90 min)
Task: Optimize kernel for register usage
- Minimize register spilling
- Use
__launch_bounds__
directive - Measure occupancy impact
Goal: Understand register pressure optimization
Exercise 10.3: Multi-GPU Scaling (120 min)
Task: Scale your best kernel across multiple GPUs
- Implement data distribution
- Handle inter-GPU communication
- Measure scaling efficiency
Goal: Understand distributed GPU computing
Week 11: Production Integration
📚 Required Readings & Videos (4-5 hours)
- PyTorch C++ Extensions Guide (2 hours)
- CUDA Error Handling Best Practices (1 hour)
- Docker GPU Support (1 hour)
- Testing GPU Code (1 hour)
🎯 What You Should Learn
- Integration Patterns: Making kernels production-ready
- Error Handling: Robust error checking and recovery
- Testing Strategies: Unit testing GPU code
- Deployment: Containerization and version management
💬 Discussion Points for Group Meeting
- “What makes GPU code ‘production-ready’ vs research code?”
- “How do we handle GPU errors gracefully in production systems?”
- “What testing strategies work best for GPU kernels?”
- “How do we manage GPU code dependencies and versions?”
🛠️ Exercises
Exercise 11.1: PyTorch Extension (120 min)
Task: Package your best kernels as PyTorch extensions
# setup.py
from pybind11.setup_helpers import Pybind11Extension, build_ext
from pybind11 import get_cmake_dir
ext_modules = [
Pybind11Extension(
"my_cuda_ops",
["src/cuda_ops.cpp", "src/kernels.cu"],
# TODO: Add proper build configuration
),
]
Goal: Create installable GPU operation packages
Exercise 11.2: Error Handling Framework (90 min)
Task: Implement comprehensive error handling
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
// TODO: Implement proper error handling \
} \
} while(0)
Goal: Build robust GPU applications
Exercise 11.3: Performance Testing Suite (90 min)
Task: Create automated performance testing
- Benchmark against baselines
- Test across different input sizes
- Validate numerical accuracy
- Generate performance reports
Goal: Ensure production performance standards
Week 12: Capstone Projects
📚 Required Readings & Videos (4-5 hours)
- Recent GPU Computing Papers - Choose 2 recent papers in your interest area (3 hours)
- Future of GPU Computing (1 hour)
- Review all previous week materials for project planning (1 hour)
🎯 What You Should Learn
- Project Planning: Scoping GPU computing projects
- Research Application: Applying techniques to novel problems
- Performance Analysis: Comprehensive evaluation methodology
- Knowledge Transfer: Teaching others your implementations
💬 Discussion Points for Group Meeting
- “What GPU computing trends do you see emerging?”
- “How can we apply what we’ve learned to our company’s problems?”
- “What areas need further study for production deployment?”
- “How do we stay current with rapidly evolving GPU technology?”
🛠️ Final Projects (Choose One)
Project Option A: Custom Deep Learning Operator (16+ hours)
Goal: Implement a complete custom operator for a specific neural network layer Requirements:
- Forward and backward passes
- PyTorch integration with autograd
- Performance benchmarking vs existing implementations
- Comprehensive testing
Example operators:
- Grouped convolution
- Attention variant (sparse, local, etc.)
- Custom activation functions
- Layer normalization variants
Project Option B: Scientific Computing Kernel (16+ hours)
Goal: Solve a real scientific computing problem with GPU acceleration Requirements:
- Problem analysis and algorithm design
- CUDA implementation with optimization
- Comparison with CPU implementation
- Scaling analysis
Example problems:
- N-body simulation
- Finite difference PDE solver
- Monte Carlo simulation
- Image processing pipeline
Project Option C: Performance Library (16+ hours)
Goal: Create a library of optimized GPU operations Requirements:
- Multiple related operations (e.g., all BLAS Level 1 operations)
- Consistent API design
- Comprehensive benchmarking
- Documentation and examples
Project Option D: GPU Computing Education Tool (16+ hours)
Goal: Create a tool to help others learn GPU programming Requirements:
- Interactive visualization of GPU concepts
- Code examples with explanations
- Performance demonstration
- User-friendly interface
📊 Final Presentations (Week 12, Day 2)
Format: 15-minute presentations + 5 minutes Q&A Content:
- Problem statement and approach (3 min)
- Implementation details and challenges (5 min)
- Performance analysis and results (4 min)
- Lessons learned and future work (3 min)
🏁 Completion Checklist
By End of Week 12, You Should Be Able To:
- Write efficient CUDA kernels from scratch
- Implement custom operations in Triton
- Profile and optimize GPU code systematically
- Integrate GPU kernels with PyTorch
- Debug GPU applications effectively
- Design memory-efficient algorithms
- Apply parallel computing principles to new problems
- Evaluate GPU vs CPU trade-offs for different tasks
Portfolio Projects:
- Working CUDA vector addition with optimization
- Triton matrix multiplication with auto-tuning
- Memory-optimized matrix operations
- Parallel reduction implementations
- Streaming data processing pipeline
- Custom attention mechanism
- Production-ready PyTorch extension
- Comprehensive final project
📈 Success Metrics
Weekly Assessment:
- Exercises Completion (60%): All exercises working and optimized
- Discussion Participation (20%): Active engagement in meetings
- Code Quality (20%): Clean, well-documented, efficient code
Progress Milestones:
- Week 4: Basic CUDA proficiency
- Week 6: Triton competency
- Week 8: Advanced optimization skills
- Week 12: Production-ready implementations
Learning Outcomes Verification:
- Can implement complex algorithms in both CUDA and Triton
- Demonstrates understanding of GPU architecture implications
- Shows ability to optimize code systematically
- Creates production-quality, testable GPU applications
📚 Resource Library
Essential References:
- NVIDIA CUDA Programming Guide
- Triton Language Documentation
- “Programming Massively Parallel Processors” by Kirk & Hwu
Tools Required:
- CUDA Toolkit 12.0+
- Python 3.8+ with PyTorch
- Nsight Systems & Compute
- Git for version control
Community Support:
- NVIDIA Developer Forums
- GPU MODE Discord
- PyTorch Discussion Forums
- Stack Overflow cuda/triton tags
This detailed plan provides explicit materials, clear learning objectives, and targeted exercises for each week. The progression from basic concepts to production-ready implementations ensures practical competency in GPU programming with both CUDA and Triton.