Here’s a richer, clearer, and more engaging version of your reading group outline:
🚀 GPU Programming Reading Group
🎯 Goal
Demystify GPU programming through hands-on learning, group discussion, and concrete implementations—building up from basic concepts to real CUDA kernels that power modern AI and scientific computing.
✅ Prerequisites
To get the most out of this group, you should be comfortable with:
-
💻 Python & C++: Core syntax and basic memory operations
-
📐 Linear Algebra: Especially vectors and matrix multiplication (dot product, row/column transformations)
🧠 Core Concepts We’ll Explore
We’ll break down the GPU execution model and understand how it maps to real workloads:
-
Kernels – your code that runs on the GPU
-
Threads – the smallest unit of execution
-
Blocks – groups of threads
-
Grids – groups of blocks
-
Warps – a group of 32 threads scheduled together; the foundation of performance tuning
We’ll also look at memory types (global, shared, local) and data transfer patterns between host (CPU) and device (GPU).
📚 Learning Materials
CUDA Course (Chapters 1–5)
🛠️ Hands-On Projects
During the reading group, we’ll read, build, and debug CUDA code together. Planned exercises:
-
Hello CUDA World
-
Write and launch your first CUDA kernel
-
Understand thread hierarchy via simple debug output
-
-
Vector Addition
-
Launch many threads to compute in parallel
-
Compare GPU vs CPU performance
-
-
Matrix Multiplication
-
Implement dense matmul using global memory
-
Optimize it step-by-step using tiling and shared memory
-
Optional: explore warp-level primitives for further speed-up
-
✨ Outcomes
By the end of this group, you’ll:
-
Understand the mental model of GPU programming
-
Be comfortable writing simple CUDA kernels from scratch
-
Know how to optimize memory access and thread usage
-
Be ready to explore advanced libraries like Triton, Cutlass, or write custom ops for ML frameworks
Let me know if you’d like a Notion/Markdown export of this for easy sharing!