Triton & CUDA Study Group Plan (Restructured)

12-Week Structured Learning Program with Explicit Materials & Exercises

📋 Overview

This structured study plan is designed for software engineers with GPU knowledge but no CUDA/Triton experience. Each week includes specific readings, videos, learning objectives with discussion points, and targeted exercises.

Target Audience

Software engineers with GPU conceptual knowledge
No prior CUDA or Triton experience required
Comfortable with C/C++ and Python
Goal: Practical GPU programming competency

Week 1: GPU Architecture & CUDA Setup

📚 Required Readings & Videos (4-5 hours)

NVIDIA CUDA Programming Guide - Chapter 1 (1 hour)
An Even Easier Introduction to CUDA (30 min)
GPU Architecture Basics - Video (45 min)
FreeCodeCamp CUDA Course - First 2 hours (2 hours)

🎯 What You Should Learn

GPU vs CPU Architecture: SIMT vs SIMD execution model, memory hierarchy differences
CUDA Toolkit Components: nvcc compiler, runtime API, driver API
Basic Terminology: Host/device, kernel, thread hierarchy
Development Environment: Setting up CUDA development workflow

💬 Discussion Points for Group Meeting

“Why do GPUs excel at parallel tasks that CPUs struggle with?”
“What are the trade-offs between GPU and CPU for different algorithms?”
“How does the SIMT execution model impact how we write algorithms?”
“What development challenges do you anticipate with GPU programming?”

🛠️ Exercises

Exercise 1.1: Environment Setup (30 min)

# Verify CUDA installation
nvcc --version
nvidia-smi
 
# Compile and run deviceQuery sample
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
make
./deviceQuery

Goal: Confirm working CUDA environment and understand GPU specifications

Exercise 1.2: Hello CUDA (45 min)

File: hello_cuda.cu

#include <cuda_runtime.h>
#include <stdio.h>
 
__global__ void helloFromGPU() {
    printf("Hello from GPU thread %d in block %d\n", 
           threadIdx.x, blockIdx.x);
}
 
int main() {
    printf("Hello from CPU\n");
    helloFromGPU<<<2, 4>>>();
    cudaDeviceSynchronize();
    return 0;
}

Task: Modify to use different grid/block configurations and observe output patterns Goal: Understand basic kernel launch syntax and thread indexing

Exercise 1.3: GPU Memory Investigation (30 min)

Task: Use deviceQuery output to answer:

What is your GPU’s compute capability?
How much global memory does it have?
How many multiprocessors?
What is the maximum threads per block?

Goal: Connect hardware specifications to programming limitations

Week 2: CUDA Fundamentals - Memory & Kernels

📚 Required Readings & Videos (4-5 hours)

CUDA Programming Guide - Chapter 2 (1.5 hours)
CUDA Memory Management (45 min)
FreeCodeCamp CUDA Course - Hours 2-4 (2 hours)
Understanding CUDA Grid and Block Dimensions (30 min)

🎯 What You Should Learn

Thread Hierarchy: Grids, blocks, threads, and their indexing
Memory Management: cudaMalloc, cudaMemcpy, cudaFree
Kernel Syntax: __global__, __device__, __host__ qualifiers
Error Handling: cudaError_t and proper error checking

💬 Discussion Points for Group Meeting

“Why is the thread hierarchy organized as grid→block→thread instead of a flat structure?”
“What are the advantages and disadvantages of explicit memory management?”
“How do we determine optimal grid and block dimensions for a given problem?”
“What patterns do you see between problem decomposition and CUDA threading model?”

🛠️ Exercises

Exercise 2.1: Vector Addition (ECE408 MP0 Style) (90 min)

File: vector_add.cu

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
 
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    // TODO: Implement parallel vector addition
    // Each thread should compute one element
}
 
int main() {
    int N = 10000;
    size_t size = N * sizeof(float);
    
    // TODO: Allocate host memory
    // TODO: Initialize input arrays
    // TODO: Allocate device memory
    // TODO: Copy input data to device
    // TODO: Launch kernel
    // TODO: Copy result back to host
    // TODO: Verify result and cleanup
    
    return 0;
}

Tasks:

Complete the implementation
Add proper error checking
Test with different vector sizes (1K, 10K, 1M elements)
Experiment with different block sizes (32, 128, 256, 512)

Goal: Master basic CUDA workflow and understand performance implications

Exercise 2.2: Matrix Addition (60 min)

Task: Extend vector addition to 2D matrices using 2D grid/block configuration

__global__ void matrixAdd(float *A, float *B, float *C, int rows, int cols) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    // TODO: Complete implementation with bounds checking
}

Goal: Understand 2D indexing and bounds checking

Exercise 2.3: Memory Transfer Timing (45 min)

Task: Measure and compare:

CPU computation time
GPU computation time (kernel only)
GPU total time (including memory transfers)
Memory transfer time (host↔device)

Goal: Understand when GPU acceleration is beneficial

Week 3: Introduction to Triton

📚 Required Readings & Videos (4-5 hours)

OpenAI Triton Introduction (45 min)
Triton Vector Addition Tutorial (1 hour)
Understanding Triton Tutorials Part 1 (45 min) (user’s recommended resource)
Getting Started with Triton Tutorial (1.5 hours)
Triton vs CUDA Comparison (30 min)

🎯 What You Should Learn

Triton’s Philosophy: Block-level programming vs thread-level
Python Integration: @triton.jit decorator and kernel compilation
Memory Model: Block pointers and automatic memory optimization
Auto-tuning: Basic concepts of performance optimization

💬 Discussion Points for Group Meeting

“How does Triton’s block-level approach simplify GPU programming compared to CUDA?”
“What are the trade-offs between Triton’s abstraction and CUDA’s explicit control?”
“When would you choose Triton over CUDA and vice versa?”
“How does Triton’s auto-tuning compare to manual optimization in CUDA?”

🛠️ Exercises

Exercise 3.1: Triton Vector Addition (60 min)

File: triton_vector_add.py

import torch
import triton
import triton.language as tl
 
@triton.jit
def vector_add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    # TODO: Implement vector addition in Triton
    # Use block-level operations
    pass
 
def vector_add(x: torch.Tensor, y: torch.Tensor):
    # TODO: Set up kernel launch parameters
    # TODO: Call the kernel
    pass
 
# Test and benchmark
if __name__ == "__main__":
    # TODO: Create test tensors and verify correctness
    pass

Tasks:

Complete the Triton implementation
Compare performance with PyTorch’s native add
Experiment with different BLOCK_SIZE values
Verify numerical correctness

Goal: Understand Triton’s programming model and syntax

Exercise 3.2: Element-wise Operations (45 min)

Task: Implement in Triton:

ReLU activation: output = max(0, x)
Squared operation: output = x^2
GELU approximation: output = x * sigmoid(1.702 * x)

Goal: Practice Triton’s mathematical operations and function composition

Exercise 3.3: CUDA vs Triton Comparison (60 min)

Task: Implement the same vector addition in both CUDA and Triton, then compare:

Lines of code
Development time
Performance
Ease of debugging

Goal: Understand practical differences between the approaches

Week 4: CUDA Memory Optimization

📚 Required Readings & Videos (5-6 hours)

CUDA Memory Model (2 hours)
Memory Coalescing Tutorial (1 hour)
Shared Memory Programming (1.5 hours)
FreeCodeCamp CUDA Course - Hours 4-6 (2 hours)

🎯 What You Should Learn

Memory Hierarchy: Global, shared, constant, texture memory characteristics
Coalescing Rules: How to achieve efficient memory access patterns
Shared Memory: Bank conflicts, padding, synchronization
Memory Profiling: Using nvidia-smi and basic profiling techniques

💬 Discussion Points for Group Meeting

“Why is memory bandwidth often the bottleneck in GPU applications?”
“How do coalescing rules relate to the underlying hardware architecture?”
“When is the complexity of shared memory optimization worth the effort?”
“What patterns can we identify for memory-bound vs compute-bound kernels?”

🛠️ Exercises

Exercise 4.1: Coalescing Analysis (ECE408 MP1 Style) (90 min)

File: coalescing_test.cu

// Test different memory access patterns
__global__ void strided_access(float *data, int stride, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        data[tid * stride] = tid;  // Strided access
    }
}
 
__global__ void coalesced_access(float *data, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        data[tid] = tid;  // Coalesced access
    }
}

Tasks:

Implement timing for both kernels
Test with strides of 1, 2, 4, 8, 16, 32
Graph the performance vs stride
Explain the performance pattern

Goal: Understand coalescing impact on performance

Exercise 4.2: Matrix Transpose with Shared Memory (120 min)

File: matrix_transpose.cu

#define TILE_SIZE 32
 
__global__ void transpose_naive(float *input, float *output, int N) {
    // TODO: Naive implementation (non-coalesced writes)
}
 
__global__ void transpose_shared(float *input, float *output, int N) {
    __shared__ float tile[TILE_SIZE][TILE_SIZE + 1];  // +1 to avoid bank conflicts
    // TODO: Shared memory implementation
}

Tasks:

Implement both versions
Measure and compare performance
Verify correctness
Experiment with different tile sizes

Goal: Master shared memory programming and bank conflict avoidance

Exercise 4.3: Memory Bandwidth Analysis (45 min)

Task: Calculate theoretical vs achieved memory bandwidth for your kernels

// Bandwidth = (bytes_read + bytes_written) / time_in_seconds
// Compare against GPU's peak memory bandwidth from deviceQuery

Goal: Understand memory efficiency metrics

Week 5: Advanced CUDA - Reductions

📚 Required Readings & Videos (5-6 hours)

NVIDIA Reduction Examples (2 hours)
Parallel Reduction Patterns (1.5 hours)
Warp Shuffle Operations (1 hour)
FreeCodeCamp CUDA Course - Hours 6-8 (2 hours)

🎯 What You Should Learn

Reduction Algorithms: Tree-based reductions, warp-level optimizations
Synchronization: __syncthreads(), warp synchronous programming
Warp Primitives: __shfl_down_sync(), __shfl_xor_sync()
Multiple Blocks: Handling reductions across grid dimensions

💬 Discussion Points for Group Meeting

“Why are reductions fundamentally challenging to parallelize efficiently?”
“How do warp-level primitives change the performance characteristics?”
“What are the trade-offs between different reduction strategies?”
“When does it make sense to use multiple kernel launches vs single kernel?”

🛠️ Exercises

Exercise 5.1: Basic Reduction (ECE408 MP2 Style) (120 min)

File: reduction.cu

__global__ void reduce_sum_naive(float *input, float *output, int N) {
    // TODO: Implement tree-based reduction in shared memory
}
 
__global__ void reduce_sum_optimized(float *input, float *output, int N) {
    // TODO: Add warp-level optimizations
}

Tasks:

Implement tree-based reduction
Add warp shuffle optimizations
Handle arbitrary input sizes
Compare with CPU reduction

Goal: Master reduction algorithms and warp-level programming

Exercise 5.2: Multiple Reductions (60 min)

Task: Implement kernels to find:

Sum, max, min of array
Mean and variance in single pass
Histogram with atomic operations

Goal: Apply reduction concepts to different operations

Exercise 5.3: Performance Analysis (45 min)

Task: Profile your reduction implementations and compare against:

cuBLAS cublasSasum
Thrust thrust::reduce
CPU single-threaded version

Goal: Understand production-quality implementation performance

Week 6: Triton Matrix Operations

📚 Required Readings & Videos (5-6 hours)

Triton Matrix Multiplication Tutorial (2 hours)
Matrix Multiplication Optimization Guide (1.5 hours)
Triton Auto-tuning Documentation (1 hour)
Triton Kernel Compilation Stages (1 hour)

🎯 What You Should Learn

Block-level Matrix Multiplication: Tiling strategies in Triton
Auto-tuning: @triton.autotune decorator usage
Memory Efficiency: Triton’s automatic memory optimizations
Integration: Calling Triton kernels from PyTorch

💬 Discussion Points for Group Meeting

“How does Triton’s auto-tuning compare to manual CUDA optimization?”
“What are the advantages of block-level thinking for matrix operations?”
“How does Triton handle memory coalescing automatically?”
“When might Triton’s optimizations be suboptimal compared to hand-tuned CUDA?”

🛠️ Exercises

Exercise 6.1: Triton Matrix Multiplication (150 min)

File: triton_matmul.py

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64}, num_stages=3, num_warps=8),
        # TODO: Add more configurations
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
    # TODO: Implement blocked matrix multiplication
    pass

Tasks:

Complete the matrix multiplication implementation
Add multiple auto-tune configurations
Benchmark against PyTorch’s torch.mm
Test with different matrix sizes

Goal: Master Triton’s auto-tuning and matrix operations

Exercise 6.2: Advanced Matrix Operations (90 min)

Task: Implement in Triton:

Matrix transpose
Element-wise matrix operations (add, multiply)
Softmax across matrix rows

Goal: Build library of common matrix operations

Exercise 6.3: Performance Comparison (60 min)

Task: Compare your Triton implementations against:

PyTorch operations
cuBLAS (if available)
Your CUDA implementations from previous weeks

Goal: Understand Triton’s performance characteristics

Week 7: CUDA Streams and Concurrency

📚 Required Readings & Videos (5-6 hours)

CUDA Streams Tutorial (1.5 hours)
Overlapping Computation and Data Transfer (2 hours)
Asynchronous Memory Transfers (1.5 hours)
FreeCodeCamp CUDA Course - Hours 8-10 (2 hours)

🎯 What You Should Learn

CUDA Streams: Creating and managing multiple execution streams
Asynchronous Operations: Non-blocking memory transfers and kernel launches
Event Timing: Precise performance measurement with CUDA events
Memory Pinning: Page-locked memory for faster transfers

💬 Discussion Points for Group Meeting

“When does stream-based concurrency provide significant benefits?”
“How do we balance memory bandwidth with computational throughput?”
“What are the challenges of debugging asynchronous GPU code?”
“How do streams relate to real-world data processing pipelines?”

🛠️ Exercises

Exercise 7.1: Stream Overlap (ECE408 MP3 Style) (120 min)

File: stream_overlap.cu

void processDataWithStreams(float *h_data, int N, int nStreams) {
    // TODO: Create multiple streams
    // TODO: Process data in chunks with overlapped H2D, kernel, D2H
    // TODO: Synchronize and measure total time
}
 
void processDataSynchronous(float *h_data, int N) {
    // TODO: Process all data synchronously for comparison
}

Tasks:

Implement streaming data processing
Compare with synchronous version
Experiment with different numbers of streams
Measure memory transfer vs computation overlap

Goal: Master asynchronous GPU programming

Exercise 7.2: Event-based Timing (60 min)

Task: Create precise timing framework using CUDA events

class CudaTimer {
    cudaEvent_t start, stop;
public:
    void startTimer();
    float stopTimer();  // Returns elapsed time in milliseconds
};

Goal: Implement accurate GPU performance measurement

Exercise 7.3: Pipeline Processing (90 min)

Task: Implement a 3-stage pipeline:

CPU preprocessing
GPU computation
CPU postprocessing

Use streams to overlap all three stages.

Goal: Build practical concurrent processing system

Week 8: Deep Learning Kernels

📚 Required Readings & Videos (6-7 hours)

Flash Attention Paper - Sections 1-3 (2 hours)
Triton Fused Attention Tutorial (2 hours)
Understanding Flash Attention Implementation (1.5 hours)
Custom PyTorch CUDA Extensions (1 hour)

🎯 What You Should Learn

Attention Mechanisms: Mathematical foundation and GPU challenges
Kernel Fusion: Combining multiple operations for efficiency
Memory-IO Awareness: Algorithm design for GPU memory hierarchy
PyTorch Integration: Creating custom operators

💬 Discussion Points for Group Meeting

“Why is attention computation particularly challenging for GPUs?”
“How does Flash Attention’s approach differ from naive implementations?”
“What principles from Flash Attention apply to other algorithms?”
“How do we balance mathematical precision with computational efficiency?”

🛠️ Exercises

Exercise 8.1: Attention Mechanism Components (150 min)

Task: Implement individual components in Triton:

Softmax kernel with numerical stability
Scaled dot-product attention
Multi-head attention reshaping

File: attention_components.py

@triton.jit
def softmax_kernel(input_ptr, output_ptr, input_row_stride, output_row_stride, n_cols, BLOCK_SIZE: tl.constexpr):
    # TODO: Implement numerically stable softmax
    pass
 
@triton.jit
def attention_kernel(q_ptr, k_ptr, v_ptr, output_ptr, ...):
    # TODO: Implement basic attention computation
    pass

Goal: Build foundation for complex attention implementations

Exercise 8.2: Flash Attention Implementation (180 min)

Task: Implement simplified Flash Attention algorithm

Focus on forward pass only
Use block-wise computation
Compare memory usage vs standard attention

Goal: Understand memory-efficient algorithm design

Exercise 8.3: PyTorch Integration (90 min)

Task: Create PyTorch-compatible operators

class TritonAttention(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v):
        # TODO: Call Triton kernel
        pass
    
    @staticmethod
    def backward(ctx, grad_output):
        # TODO: Implement backward pass
        pass

Goal: Create production-ready custom operators

Week 9: Profiling and Optimization

📚 Required Readings & Videos (5-6 hours)

Nsight Systems User Guide - Chapters 1-3 (2 hours)
Nsight Compute User Guide - Chapters 1-2 (1.5 hours)
CUDA Profiling Best Practices (1 hour)
Performance Optimization Guide - Chapters 5-7 (2 hours)

🎯 What You Should Learn

Profiling Tools: Nsight Systems, Nsight Compute, nvidia-smi
Performance Metrics: Occupancy, bandwidth utilization, instruction throughput
Bottleneck Identification: Memory-bound vs compute-bound analysis
Optimization Strategies: Systematic performance improvement

💬 Discussion Points for Group Meeting

“How do we systematically identify performance bottlenecks?”
“What metrics are most important for different types of kernels?”
“How do we balance development time with optimization effort?”
“What are common performance pitfalls and how to avoid them?”

🛠️ Exercises

Exercise 9.1: Profiling Workshop (120 min)

Provided: Intentionally suboptimal kernels with various issues:

Non-coalesced memory access
Low occupancy
Bank conflicts
Unnecessary synchronization

Tasks:

Profile each kernel with Nsight Compute
Identify the primary bottleneck
Propose optimization strategy
Implement and measure improvement

Goal: Develop systematic optimization methodology

Exercise 9.2: Occupancy Analysis (90 min)

Task: Analyze occupancy for your previous kernels

// Use CUDA occupancy API
int maxThreadsPerBlock;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&maxThreadsPerBlock, kernel, blockSize, 0);
 
// Calculate theoretical occupancy
// Compare with achieved occupancy from profiler

Goal: Understand occupancy optimization

Exercise 9.3: Memory Bandwidth Optimization (90 min)

Task: Optimize memory-bound kernels to achieve >80% peak bandwidth

Use profiler to measure achieved bandwidth
Apply coalescing optimizations
Consider cache behavior

Goal: Master memory optimization techniques

Week 10: Advanced Optimization Techniques

📚 Required Readings & Videos (5-6 hours)

Tensor Core Programming (1.5 hours)
Advanced CUDA Optimization (1 hour)
Loop Unrolling and ILP (1 hour)
CUTLASS Library Examples - Browse examples (2 hours)

🎯 What You Should Learn

Tensor Cores: Specialized matrix computation units
Instruction-Level Parallelism: Loop unrolling, register optimization
Cache Optimization: L1/L2 cache utilization strategies
Architecture-Specific: Optimizing for different GPU generations

💬 Discussion Points for Group Meeting

“When is the complexity of Tensor Core programming worthwhile?”
“How do we balance code readability with optimization level?”
“What optimization techniques translate across GPU architectures?”
“How do we future-proof our optimization strategies?”

🛠️ Exercises

Exercise 10.1: Tensor Core GEMM (180 min)

Task: Implement matrix multiplication using Tensor Cores

#include <mma.h>
using namespace nvcuda;
 
__global__ void tensor_core_gemm(half *A, half *B, float *C, int M, int N, int K) {
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    
    // TODO: Implement Tensor Core computation
}

Goal: Utilize specialized hardware for maximum performance

Exercise 10.2: Register Optimization (90 min)

Task: Optimize kernel for register usage

Minimize register spilling
Use __launch_bounds__ directive
Measure occupancy impact

Goal: Understand register pressure optimization

Exercise 10.3: Multi-GPU Scaling (120 min)

Task: Scale your best kernel across multiple GPUs

Implement data distribution
Handle inter-GPU communication
Measure scaling efficiency

Goal: Understand distributed GPU computing

Week 11: Production Integration

📚 Required Readings & Videos (4-5 hours)

PyTorch C++ Extensions Guide (2 hours)
CUDA Error Handling Best Practices (1 hour)
Docker GPU Support (1 hour)
Testing GPU Code (1 hour)

🎯 What You Should Learn

Integration Patterns: Making kernels production-ready
Error Handling: Robust error checking and recovery
Testing Strategies: Unit testing GPU code
Deployment: Containerization and version management

💬 Discussion Points for Group Meeting

“What makes GPU code ‘production-ready’ vs research code?”
“How do we handle GPU errors gracefully in production systems?”
“What testing strategies work best for GPU kernels?”
“How do we manage GPU code dependencies and versions?”

🛠️ Exercises

Exercise 11.1: PyTorch Extension (120 min)

Task: Package your best kernels as PyTorch extensions

# setup.py
from pybind11.setup_helpers import Pybind11Extension, build_ext
from pybind11 import get_cmake_dir
 
ext_modules = [
    Pybind11Extension(
        "my_cuda_ops",
        ["src/cuda_ops.cpp", "src/kernels.cu"],
        # TODO: Add proper build configuration
    ),
]

Goal: Create installable GPU operation packages

Exercise 11.2: Error Handling Framework (90 min)

Task: Implement comprehensive error handling

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            // TODO: Implement proper error handling \
        } \
    } while(0)

Goal: Build robust GPU applications

Exercise 11.3: Performance Testing Suite (90 min)

Task: Create automated performance testing

Benchmark against baselines
Test across different input sizes
Validate numerical accuracy
Generate performance reports

Goal: Ensure production performance standards

Week 12: Capstone Projects

📚 Required Readings & Videos (4-5 hours)

Recent GPU Computing Papers - Choose 2 recent papers in your interest area (3 hours)
Future of GPU Computing (1 hour)
Review all previous week materials for project planning (1 hour)

🎯 What You Should Learn

Project Planning: Scoping GPU computing projects
Research Application: Applying techniques to novel problems
Performance Analysis: Comprehensive evaluation methodology
Knowledge Transfer: Teaching others your implementations

💬 Discussion Points for Group Meeting

“What GPU computing trends do you see emerging?”
“How can we apply what we’ve learned to our company’s problems?”
“What areas need further study for production deployment?”
“How do we stay current with rapidly evolving GPU technology?”

🛠️ Final Projects (Choose One)

Project Option A: Custom Deep Learning Operator (16+ hours)

Goal: Implement a complete custom operator for a specific neural network layer Requirements:

Forward and backward passes
PyTorch integration with autograd
Performance benchmarking vs existing implementations
Comprehensive testing

Example operators:

Grouped convolution
Attention variant (sparse, local, etc.)
Custom activation functions
Layer normalization variants

Project Option B: Scientific Computing Kernel (16+ hours)

Goal: Solve a real scientific computing problem with GPU acceleration Requirements:

Problem analysis and algorithm design
CUDA implementation with optimization
Comparison with CPU implementation
Scaling analysis

Example problems:

N-body simulation
Finite difference PDE solver
Monte Carlo simulation
Image processing pipeline

Project Option C: Performance Library (16+ hours)

Goal: Create a library of optimized GPU operations Requirements:

Multiple related operations (e.g., all BLAS Level 1 operations)
Consistent API design
Comprehensive benchmarking
Documentation and examples

Project Option D: GPU Computing Education Tool (16+ hours)

Goal: Create a tool to help others learn GPU programming Requirements:

Interactive visualization of GPU concepts
Code examples with explanations
Performance demonstration
User-friendly interface

📊 Final Presentations (Week 12, Day 2)

Format: 15-minute presentations + 5 minutes Q&A Content:

Problem statement and approach (3 min)
Implementation details and challenges (5 min)
Performance analysis and results (4 min)
Lessons learned and future work (3 min)

🏁 Completion Checklist

By End of Week 12, You Should Be Able To:

Write efficient CUDA kernels from scratch
Implement custom operations in Triton
Profile and optimize GPU code systematically
Integrate GPU kernels with PyTorch
Debug GPU applications effectively
Design memory-efficient algorithms
Apply parallel computing principles to new problems
Evaluate GPU vs CPU trade-offs for different tasks

Portfolio Projects:

Working CUDA vector addition with optimization
Triton matrix multiplication with auto-tuning
Memory-optimized matrix operations
Parallel reduction implementations
Streaming data processing pipeline
Custom attention mechanism
Production-ready PyTorch extension
Comprehensive final project

📈 Success Metrics

Weekly Assessment:

Exercises Completion (60%): All exercises working and optimized
Discussion Participation (20%): Active engagement in meetings
Code Quality (20%): Clean, well-documented, efficient code

Progress Milestones:

Week 4: Basic CUDA proficiency
Week 6: Triton competency
Week 8: Advanced optimization skills
Week 12: Production-ready implementations

Learning Outcomes Verification:

Can implement complex algorithms in both CUDA and Triton
Demonstrates understanding of GPU architecture implications
Shows ability to optimize code systematically
Creates production-quality, testable GPU applications

📚 Resource Library

Essential References:

NVIDIA CUDA Programming Guide
Triton Language Documentation
“Programming Massively Parallel Processors” by Kirk & Hwu

Tools Required:

CUDA Toolkit 12.0+
Python 3.8+ with PyTorch
Nsight Systems & Compute
Git for version control

Community Support:

NVIDIA Developer Forums
GPU MODE Discord
PyTorch Discussion Forums
Stack Overflow cuda/triton tags

This detailed plan provides explicit materials, clear learning objectives, and targeted exercises for each week. The progression from basic concepts to production-ready implementations ensures practical competency in GPU programming with both CUDA and Triton.

Alex Xi's Notes

Explorer

Reading Group Plan

Triton & CUDA Study Group Plan (Restructured)

12-Week Structured Learning Program with Explicit Materials & Exercises

📋 Overview

Target Audience

Week 1: GPU Architecture & CUDA Setup

📚 Required Readings & Videos (4-5 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 1.1: Environment Setup (30 min)

Exercise 1.2: Hello CUDA (45 min)

Exercise 1.3: GPU Memory Investigation (30 min)

Week 2: CUDA Fundamentals - Memory & Kernels

📚 Required Readings & Videos (4-5 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 2.1: Vector Addition (ECE408 MP0 Style) (90 min)

Exercise 2.2: Matrix Addition (60 min)

Exercise 2.3: Memory Transfer Timing (45 min)

Week 3: Introduction to Triton

📚 Required Readings & Videos (4-5 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 3.1: Triton Vector Addition (60 min)

Exercise 3.2: Element-wise Operations (45 min)

Exercise 3.3: CUDA vs Triton Comparison (60 min)

Week 4: CUDA Memory Optimization

📚 Required Readings & Videos (5-6 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 4.1: Coalescing Analysis (ECE408 MP1 Style) (90 min)

Exercise 4.2: Matrix Transpose with Shared Memory (120 min)

Exercise 4.3: Memory Bandwidth Analysis (45 min)

Week 5: Advanced CUDA - Reductions

📚 Required Readings & Videos (5-6 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 5.1: Basic Reduction (ECE408 MP2 Style) (120 min)

Exercise 5.2: Multiple Reductions (60 min)

Exercise 5.3: Performance Analysis (45 min)

Week 6: Triton Matrix Operations

📚 Required Readings & Videos (5-6 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 6.1: Triton Matrix Multiplication (150 min)

Exercise 6.2: Advanced Matrix Operations (90 min)

Exercise 6.3: Performance Comparison (60 min)

Week 7: CUDA Streams and Concurrency

📚 Required Readings & Videos (5-6 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 7.1: Stream Overlap (ECE408 MP3 Style) (120 min)

Exercise 7.2: Event-based Timing (60 min)

Exercise 7.3: Pipeline Processing (90 min)

Week 8: Deep Learning Kernels

📚 Required Readings & Videos (6-7 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 8.1: Attention Mechanism Components (150 min)

Exercise 8.2: Flash Attention Implementation (180 min)

Exercise 8.3: PyTorch Integration (90 min)

Week 9: Profiling and Optimization

📚 Required Readings & Videos (5-6 hours)

🎯 What You Should Learn

💬 Discussion Points for Group Meeting

🛠️ Exercises

Exercise 9.1: Profiling Workshop (120 min)

Exercise 9.2: Occupancy Analysis (90 min)

Exercise 9.3: Memory Bandwidth Optimization (90 min)

Week 10: Advanced Optimization Techniques