E0358 Advanced Techniques in Compilation and Programming for Parallel Architectures
Instructor: UKRBTue, Thu: 10:00--11:30 am
CSA 117
Course Notes
Slides (constantly updated as the course progresses)Instructor Lectures (first one month)
-
Introduction
- Compilers in the 21st century
- Parallel architectures (history, evolution, taxonomy)
- Landscape of programming and compilation for the ML/AI era
- Compiling for the ML/AI era
- Compiler Intermediate Representations
- MLIR
- Polyhedral compiler framework
- Polyhedral model - representation
- Dependence analysis
- Transformations and scheduling
- Affine transformations
- Optimizations and Parallelization
- Code generation, tools, and libraries
- OpenMP
- MPI
Paper discussions / Seminars
-
Performance characterization [Dhairya]
-
Roofline Model: an insightful visual performance model for multicore architectures
Williams, Samuel; Waterman, Andrew; Patterson, David
https://dl.acm.org/doi/10.1145/1498765.1498785
-
Execution-Cache-Memory Performance Model: Introduction and Validation
Johannes Hofmann, Jan Eitzinger, Dietmar Fey
https://arxiv.org/pdf/1509.03118
-
Roofline Model: an insightful visual performance model for multicore architectures
- Optimizing Matrix-Matrix Multiply for peak performance [Pushpendra]
-
Analytical Modeling Is Enough for High-Performance BLIS
https://dl.acm.org/citation.cfm?id=2925987
Low et al, ACM TOMS 2016
-
Anatomy of high-performance matrix multiplication
https://dl.acm.org/citation.cfm?id=1356053
ACM TOMS 2008
-
Analytical Modeling Is Enough for High-Performance BLIS
-
The LLVM IR and toolchain [Dhruvin]
- The LLVM Intermediate Representation
- LLVM infrastructure and tools: opt, llc
- Target code generation in LLVM (pattern matching, rewriting)
-
NVIDIA GPU Architecture and the CUDA programming model [Rajeshwaran]
-
The Triton programming framework [Sasidhar]
https://openai.com/index/triton/ -
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation [Abhishek]
ASPLOS 2024.
https://docs.pytorch.org/assets/pytorch2-2.pdf
https://pytorch.org/blog/pytorch-pytorch-2-paper-tutorial/
-
Task-Based Tensor Computations on Modern GPUs [Gayatri]
PLDI 2025.
https://dl.acm.org/doi/10.1145/3729262
-
Optimizing Deep Learning Inference Efficiency through Block Dependency Analysis [Ayush]
ASPLOS 2025.
https://dl.acm.org/doi/10.1145/3676641.3716264
-
The MLIR Transform Dialect - Your compiler is more powerful than you think [guest/auditing student]
CGO 2025.
-
AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies [Pushpendra]
https://arxiv.org/abs/2506.00008
-
Understanding the computational characteristics of the Transformer block of modern generative AI models [Abhishek, Dhairya, Rajeshwaran]
-
LLM visualization
https://bbycroft.net/llm -
Attention is All You Need
https://arxiv.org/abs/1706.03762
-
What is an attention mechanism?
https://www.ibm.com/think/topics/attention-mechanism -
KV Caching in Attention
https://huggingface.co/blog/not-lain/kv-caching
-
LLM visualization
-
Attention layer fusion and optimization [Sasidhar, Gayatri, Ayush]
-
Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135
-
FlashAttention-2: Faster Attention with Better Parallelism ...
https://arxiv.org/abs/2307.08691
-
Fast and Memory-Efficient Exact Attention with IO-Awareness
Assignments
Seminars
- Pick a topic
- Presentation with discussion/questions interspersed (typically, two classes of 1.5 hrs each)
Evaluation
Evaluation: 25% assignment-1, 25% assignment-2, 25% seminar, 25% class participation
Additional reading material
- Polyhedral model - mathematical background and introduction
-
Analytical Modeling Is Enough for High-Performance BLIS
https://dl.acm.org/citation.cfm?id=2925987
Low et al, ACM TOMS 2016.
- Chapter 11 - Compilers - Aho, Lam, Sethi, and Ullman
- Theory of Integer and Linear Programming, A. Schrijver
- Introduction to Linear Algebra, Gilbert Strang
- OpenMP tutorial
- MPI tutorial
- MPI Intro by Bill Gropp
- Introduction to OpenMP by Tim Mattson
- MPI Standard (2.2)
- Parallel architectures - Overview
- Parallel Computer Architecture (book by David Culler et al.)