E0358 Advanced Techniques in Compilation and Programming for Parallel Architectures

Instructor: UKRB
Tue, Thu: 11:00--12:30pm
CSA 117

Course Notes

Slides

Instructor Lectures (first one month)

Introduction
- Compilers in the 21st century
- Parallel architectures (history, evolution, taxonomy)
- Landscape of programming and compilation for the ML/AI era
- Compiling for the ML/AI era
Compiler Intermediate Representations
- MLIR

Polyhedral compiler framework
- Polyhedral model - representation
- Dependence analysis
- Transformations and scheduling
- Affine transformations
- Optimizations and Parallelization
- Code generation, tools, and libraries
De-Facto Parallel Programming Models
- OpenMP
- MPI
Building compilers for ML/AI hardware and ML/AI programming models

Paper discussions / Seminars

Optimizing Matrix-Matrix Multiply for peak performance
- Analytical Modeling Is Enough for High-Performance BLIS
  https://dl.acm.org/citation.cfm?id=2925987
  Low et al, ACM TOMS 2016
- Anatomy of high-performance matrix multiplication
  https://dl.acm.org/citation.cfm?id=1356053
  ACM TOMS 2008
The LLVM IR and toolchain [Gokulnath]
- The LLVM Intermediate Representation
- LLVM infrastructure and tools: opt, llc
- Target code generation in LLVM (pattern matching, rewriting)
NVIDIA GPU Architecture and the CUDA programming model [Keshav]
DNNFusion: accelerating deep neural networks execution with advanced operator fusion [Himanshu]
PLDI 2021.
https://dl.acm.org/doi/10.1145/3453483.3454083
AKG: automatic kernel generation for neural processing units using polyhedral transformations [Pritam]
PLDI 2021.
https://dl.acm.org/doi/10.1145/3453483.3454106
DISTAL: The Distributed Tensor Algebra Compiler [Kripa Shanker]
ACM SIGPLAN PLDI 2022
https://dl.acm.org/doi/10.1145/3519939.3523437
Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model [Nishit]
ACM SIGPLAN PLDI 2022
https://dl.acm.org/doi/abs/10.1145/3519939.3523442
All you need is Superword-Level Parallelism: Systematic Control-Flow Vectorization with SLP [Ajay]
ACM SIGPLAN PLDI 2022
https://dl.acm.org/doi/abs/10.1145/3519939.3523701
TVM, Tensor Expressions[Dhruv]
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning OSDI 2018
  https://www.usenix.org/conference/osdi18/presentation/chen
- Working with Operators Using Tensor Expression
  https://tvm.apache.org/docs/tutorial/tensor_expr_get_started.html
Ansor: generating high-performance tensor programs for deep learning [Abhishek]
OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation
https://dl.acm.org/doi/abs/10.5555/3488766.3488815
RAMMER: enabling holistic deep learning compiler optimizations with rtasks [Abhishek]
OSDI 2020
https://dl.acm.org/doi/abs/10.5555/3488766.3488816
DeepCuts: a deep learning optimization framework for versatile GPU workloads [Gokulnath]
PLDI 2021
https://dl.acm.org/doi/10.1145/3453483.3454038
AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures [Ajay]
ASPLOS 2022.
https://dl.acm.org/doi/10.1145/3503222.3507723
Fast Algorithms for Convolutional Neural Networks [Keshav]
Andrew Lavin, Scott Gray
https://arxiv.org/abs/1509.09308
Code generation for GPU tensor cores [Dhruv]
- Triton: an intermediate language and compiler for tiled neural network computations
  MAPL 2019
  https://dl.acm.org/doi/abs/10.1145/3315508.3329973
- Fireiron: A Data-Movement-Aware Scheduling Language for GPUs
  PACT 2020.
  https://dl.acm.org/doi/10.1145/3410463.3414632
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning [Pritam]
OSDI 2022
https://www.usenix.org/conference/osdi22/presentation/zhu
Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization [Himanshu]
MLSys 2022
https://proceedings.mlsys.org/paper/2022/hash/069059b7ef840f0c74a814ec9237b6ec-Abstract.html
UNIT: Unifying Tensorized Instruction Compilation [Nishit]
https://arxiv.org/abs/2101.08458
2021
Cortex: A Compiler for Recursive Deep Learning Models [Kripa Shanker]
https://arxiv.org/abs/2011.01383
2021

Assignments

Seminars

Pick a topic
Presentation with discussion/questions interspersed (typically, two classes of 1.5 hrs each)

Evaluation

Evaluation: 25% assignment-1, 25% assignment-2, 25% seminar, 25% class participation

Additional reading material

Clan documentation
MLIR
Polyhedral model - mathematical background and introduction
Chapter 11 - Compilers - Aho, Lam, Sethi, and Ullman
Theory of Integer and Linear Programming, A. Schrijver
Introduction to Linear Algebra, Gilbert Strang
OpenMP tutorial
MPI tutorial
MPI Intro by Bill Gropp
Introduction to OpenMP by Tim Mattson
MPI Standard (2.2)
Parallel architectures - Overview
Parallel Computer Architecture (book by David Culler et al.)