E0358 Advanced Techniques in Compilation and Programming for Parallel Architectures

Instructor: UKRB
Tue, Thu: 10:00--11:30 am
CSA 117

Course Notes

Slides (constantly updated as the course progresses)

Instructor Lectures (first one month)

Paper discussions / Seminars

  1. Performance characterization [Dhairya]
    • Roofline Model: an insightful visual performance model for multicore architectures
      Williams, Samuel; Waterman, Andrew; Patterson, David
      https://dl.acm.org/doi/10.1145/1498765.1498785
    • Execution-Cache-Memory Performance Model: Introduction and Validation
      Johannes Hofmann, Jan Eitzinger, Dietmar Fey
      https://arxiv.org/pdf/1509.03118

  2. Optimizing Matrix-Matrix Multiply for peak performance [Pushpendra]
    • Analytical Modeling Is Enough for High-Performance BLIS
      https://dl.acm.org/citation.cfm?id=2925987
      Low et al, ACM TOMS 2016
    • Anatomy of high-performance matrix multiplication
      https://dl.acm.org/citation.cfm?id=1356053
      ACM TOMS 2008

  3. The LLVM IR and toolchain [Dhruvin]
    • The LLVM Intermediate Representation
    • LLVM infrastructure and tools: opt, llc
    • Target code generation in LLVM (pattern matching, rewriting)

  4. NVIDIA GPU Architecture and the CUDA programming model [Rajeshwaran]

  5. The Triton programming framework [Sasidhar]
    https://openai.com/index/triton/

  6. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation [Abhishek]
    ASPLOS 2024.
    https://docs.pytorch.org/assets/pytorch2-2.pdf
    https://pytorch.org/blog/pytorch-pytorch-2-paper-tutorial/

  7. Task-Based Tensor Computations on Modern GPUs [Gayatri]
    PLDI 2025.
    https://dl.acm.org/doi/10.1145/3729262

  8. Optimizing Deep Learning Inference Efficiency through Block Dependency Analysis [Ayush]
    ASPLOS 2025.
    https://dl.acm.org/doi/10.1145/3676641.3716264

  9. The MLIR Transform Dialect - Your compiler is more powerful than you think [guest/auditing student]
    CGO 2025.

  10. AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies [Pushpendra]
    https://arxiv.org/abs/2506.00008

  11. Understanding the computational characteristics of the Transformer block of modern generative AI models [Abhishek, Dhairya, Rajeshwaran]
    • LLM visualization
      https://bbycroft.net/llm
    • Attention is All You Need
      https://arxiv.org/abs/1706.03762
    • What is an attention mechanism?
      https://www.ibm.com/think/topics/attention-mechanism
    • KV Caching in Attention
      https://huggingface.co/blog/not-lain/kv-caching

  12. Attention layer fusion and optimization [Sasidhar, Gayatri, Ayush]
    • Fast and Memory-Efficient Exact Attention with IO-Awareness
      https://arxiv.org/abs/2205.14135
    • FlashAttention-2: Faster Attention with Better Parallelism ...
      https://arxiv.org/abs/2307.08691

Assignments

Seminars

Evaluation

Evaluation: 25% assignment-1, 25% assignment-2, 25% seminar, 25% class participation

Additional reading material

  1. Polyhedral model - mathematical background and introduction
  2. Analytical Modeling Is Enough for High-Performance BLIS
    https://dl.acm.org/citation.cfm?id=2925987
    Low et al, ACM TOMS 2016.
  3. Chapter 11 - Compilers - Aho, Lam, Sethi, and Ullman
  4. Theory of Integer and Linear Programming, A. Schrijver
  5. Introduction to Linear Algebra, Gilbert Strang
  6. OpenMP tutorial
  7. MPI tutorial
  8. MPI Intro by Bill Gropp
  9. Introduction to OpenMP by Tim Mattson
  10. MPI Standard (2.2)
  11. Parallel architectures - Overview
  12. Parallel Computer Architecture (book by David Culler et al.)