GPU Study Group - Session 1: Detailed Host Plan

Week 1: GPU Architecture & CUDA Setup

📋 Pre-Session Preparation Checklist (Host)

Technical Setup (Do 1-2 days before)

Test your own CUDA environment - Run all three exercises yourself
Prepare backup options for attendees without CUDA:
- Google Colab notebook with CUDA runtime
- Online CUDA compiler link (godbolt.org)
- Virtual machine with CUDA pre-installed
Screen sharing setup - Test ability to share terminal and IDE
Prepare demo materials - Have deviceQuery output ready to show

Content Preparation

Review all assigned materials - Take notes on key concepts
Prepare visual aids (optional):
- GPU vs CPU architecture diagram
- CUDA memory hierarchy diagram
- Thread hierarchy visualization
Create shared document for group notes/questions

Logistics

Send reminder 24 hours before with:
- Meeting link and time
- Reminder to complete readings
- CUDA installation instructions (if not done)
Prepare attendance tracking and progress monitoring

🕐 Detailed Session Agenda (90 minutes)

Opening (10 minutes)

[0:00-0:10]

Host Script:

“Welcome to our first GPU programming study group! Over the next 12 weeks, we’ll go from CUDA beginners to writing efficient GPU kernels. Today we’re building the foundation - understanding WHY GPUs work the way they do.”

Activities:

Quick introductions (name, background, why interested in GPU programming)
Outline today’s agenda
Set expectations for participation

Knowledge Check & Reading Review (15 minutes)

[0:10-0:25]

Host facilitation: Start with quick polls to gauge preparation level:

“Show of hands - who completed all the readings?”
“Who successfully installed CUDA and ran deviceQuery?”
“What was the most confusing concept from the readings?”

Quick concept review (5 minutes each):

GPU vs CPU architecture - Have someone explain in their own words
CUDA terminology - Quick definitions round-robin
Development workflow - Walk through compile and run process

Core Discussion: Architecture Deep Dive (25 minutes)

[0:25-0:50]

Discussion Point 1: “Why do GPUs excel at parallel tasks that CPUs struggle with?” (8 minutes)

Facilitation approach:

Start with open responses, then guide toward key concepts
Look for these concepts to emerge:
- Thousands of cores vs 4-16 cores
- SIMT execution model
- Memory bandwidth vs latency
- Different design philosophies

Guiding questions if discussion stalls:

“What happens when a CPU encounters a cache miss?”
“How does GPU handle the same situation differently?”
“When would you NOT want to use GPU acceleration?”

Discussion Point 2: “How does the SIMT execution model impact algorithm design?” (8 minutes)

Key concepts to draw out:

All threads in warp execute same instruction
Branch divergence performance impact
Data parallelism vs task parallelism
Algorithm restructuring for GPU efficiency

Real-world examples to mention:

Matrix multiplication (perfect fit)
Sorting algorithms (requires redesign)
Tree traversal (challenging)

Discussion Point 3: “What development challenges do you anticipate?” (9 minutes)

Expected responses:

Debugging parallel code
Memory management complexity
Performance optimization difficulty
Different programming model

Host role: Acknowledge challenges but emphasize they’re solvable

Hands-On Exercise Session (30 minutes)

[0:50-1:20]

Exercise 1.1: Environment Verification (8 minutes)

[0:50-0:58]

Host demonstration:

Show nvcc --version output on your system
Demonstrate nvidia-smi and explain key information
Walk through deviceQuery compilation and key output metrics

Group activity:

Everyone runs commands simultaneously
Troubleshoot issues collectively
Share interesting deviceQuery findings

Exercise 1.2: Hello CUDA Analysis (12 minutes)

[0:58-1:10]

Before coding - concept check (3 minutes):

“What does <<<2, 4>>> mean?”
“How many total threads will be created?”
“What output do we expect?”

Live coding session (6 minutes):

Host shares screen and codes exercise
Explain each line as you type
Compile and run, show output

Experimentation phase (3 minutes):

Challenge: “Try <<<3, 2>>> and <<<1, 8>>>”
“What patterns do you notice in the output?”
“What happens with <<<1, 1024>>>?”

Exercise 1.3: Hardware Exploration (10 minutes)

[1:10-1:20]

Structured sharing:

Each person shares one interesting spec from their GPU
Create group comparison chart on shared doc
Discuss implications for programming:
- “Why does compute capability matter?”
- “How does memory size affect problem size?”
- “What does max threads per block tell us?”

Wrap-up & Next Week Preview (10 minutes)

[1:20-1:30]

Key Takeaways Review (5 minutes)

Ask group to complete these statements:

“GPUs are good at _________ because _________”
“The biggest difference from CPU programming is _________”
“CUDA’s thread hierarchy consists of _________“

Week 2 Preview (3 minutes)

Topic: Memory Management & Basic Kernels
Key focus: Understanding GPU memory types and writing first computational kernels
Preparation reminder: Send materials list via email

Action Items (2 minutes)

Complete any exercises not finished
Set up development environment if needed
Optional: Explore CUDA samples directory

🎯 Discussion Facilitation Guide

Effective Facilitation Techniques

For Quiet Groups:

Direct questions: “Sarah, what did you think about the memory hierarchy?”
Think-pair-share: 30 seconds to think, discuss with neighbor, then share
Written responses: Use shared doc for anonymous input

For Dominant Speakers:

Redirect: “That’s a great point, John. Let’s hear other perspectives.”
Time limits: “Let’s get 2-3 more opinions on this”
Round-robin: Structure equal speaking time

When Discussion Goes Off-Track:

Acknowledge: “Interesting point - let’s revisit that during break”
Redirect: “How does this connect to our GPU architecture focus?”
Table it: “Let’s add that to our ‘questions for later’ list”

Expected Learning Outcomes

By end of session, participants should be able to:

Explain fundamental differences between GPU and CPU architecture
Define basic CUDA terminology (kernel, thread, block, grid)
Successfully compile and run a simple CUDA program
Interpret deviceQuery output for their GPU
Identify when GPU acceleration might be beneficial

🛠️ Materials & Setup Needed

Required Materials

Laptop/Desktop with CUDA capability OR access to Google Colab
Text editor/IDE (VS Code, CLion, or simple text editor)
CUDA Toolkit installed (version 11.0+)
Access to readings - confirm everyone has links

Backup Options for Technical Issues

Google Colab notebook with pre-written exercises
Compiler Explorer (godbolt.org) for quick CUDA compilation
Shared screen coding if individual setups fail

Optional Enhancements

Whiteboard/Digital board for drawing architecture diagrams
Shared document for collaborative notes
Recording setup if sessions will be recorded

📧 Post-Session Follow-up

Immediate Actions (within 24 hours)

Send session summary with key takeaways
Share exercise solutions and explanations
Distribute Week 2 materials and reading list
Address any unresolved technical issues individually

Week 2 Preparation

Update study plan based on group’s pace and interests
Prepare more challenging exercises if group is advanced
Create reference materials for common issues encountered

Continuous Improvement

Collect feedback via simple survey
Note time management - which sections need more/less time?
Track progress - are learning objectives being met?

🚨 Troubleshooting Common Issues

CUDA Installation Problems

Windows: Direct to CUDA installer, check Visual Studio compatibility
Linux: Package manager installation, driver conflicts
macOS: No native CUDA - recommend Google Colab or Docker

Exercise Compilation Errors

Path issues: Help locate nvcc
Library linking: Show basic compilation flags
Code errors: Common typos and fixes

Conceptual Confusion

Memory hierarchy: Use diagrams and analogies
Thread indexing: Work through examples step-by-step
Execution model: Compare to familiar parallel concepts

Session Success Criteria: Everyone leaves with working CUDA setup and clear understanding of GPU architecture fundamentals

Alex Xi's Notes

Explorer

gpu study session1 plan