Prereqs

Here’s a richer, clearer, and more engaging version of your reading group outline:

🚀 GPU Programming Reading Group

🎯 Goal

Demystify GPU programming through hands-on learning, group discussion, and concrete implementations—building up from basic concepts to real CUDA kernels that power modern AI and scientific computing.

✅ Prerequisites

To get the most out of this group, you should be comfortable with:

💻 Python & C++: Core syntax and basic memory operations
📐 Linear Algebra: Especially vectors and matrix multiplication (dot product, row/column transformations)

🧠 Core Concepts We’ll Explore

We’ll break down the GPU execution model and understand how it maps to real workloads:

Kernels – your code that runs on the GPU
Threads – the smallest unit of execution
Blocks – groups of threads
Grids – groups of blocks
Warps – a group of 32 threads scheduled together; the foundation of performance tuning

We’ll also look at memory types (global, shared, local) and data transfer patterns between host (CPU) and device (GPU).

📚 Learning Materials

CUDA Course (Chapters 1–5)

🛠️ Hands-On Projects

During the reading group, we’ll read, build, and debug CUDA code together. Planned exercises:

Hello CUDA World
- Write and launch your first CUDA kernel
- Understand thread hierarchy via simple debug output
Vector Addition
- Launch many threads to compute in parallel
- Compare GPU vs CPU performance
Matrix Multiplication
- Implement dense matmul using global memory
- Optimize it step-by-step using tiling and shared memory
- Optional: explore warp-level primitives for further speed-up

✨ Outcomes

By the end of this group, you’ll:

Understand the mental model of GPU programming
Be comfortable writing simple CUDA kernels from scratch
Know how to optimize memory access and thread usage
Be ready to explore advanced libraries like Triton, Cutlass, or write custom ops for ML frameworks

Let me know if you’d like a Notion/Markdown export of this for easy sharing!

Alex Xi's Notes

Explorer