E0358 Advanced Techniques in Compilation and Programming for Parallel Architectures
Instructor: UKRBTue, Thu: 11:00--12:30pm
CSA 117
Course Notes
SlidesInstructor Lectures (first one month)
-
Introduction
- Compilers in the 21st century
- Parallel architectures (history, evolution, taxonomy)
- Landscape of programming and compilation for the ML/AI era
- Compiling for the ML/AI era
- Compiler Intermediate Representations
- MLIR
- Polyhedral compiler framework
- Polyhedral model - representation
- Dependence analysis
- Transformations and scheduling
- Affine transformations
- Optimizations and Parallelization
- Code generation, tools, and libraries
- De-Facto Parallel Programming Models
- OpenMP
- MPI
- Building compilers for ML/AI hardware and ML/AI programming models
Paper discussions / Seminars
- Optimizing Matrix-Matrix Multiply for peak performance
-
Analytical Modeling Is Enough for High-Performance BLIS
https://dl.acm.org/citation.cfm?id=2925987
Low et al, ACM TOMS 2016
-
Anatomy of high-performance matrix multiplication
https://dl.acm.org/citation.cfm?id=1356053
ACM TOMS 2008
-
Analytical Modeling Is Enough for High-Performance BLIS
-
The LLVM IR and toolchain [Gokulnath]
- The LLVM Intermediate Representation
- LLVM infrastructure and tools: opt, llc
- Target code generation in LLVM (pattern matching, rewriting)
-
NVIDIA GPU Architecture and the CUDA programming model [Keshav]
-
DNNFusion: accelerating deep neural networks execution with advanced operator fusion [Himanshu]
PLDI 2021.
https://dl.acm.org/doi/10.1145/3453483.3454083
-
AKG: automatic kernel generation for neural processing units using polyhedral transformations [Pritam]
PLDI 2021.
https://dl.acm.org/doi/10.1145/3453483.3454106
-
DISTAL: The Distributed Tensor Algebra Compiler [Kripa Shanker]
ACM SIGPLAN PLDI 2022
https://dl.acm.org/doi/10.1145/3519939.3523437
-
Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model [Nishit]
ACM SIGPLAN PLDI 2022
https://dl.acm.org/doi/abs/10.1145/3519939.3523442 -
All you need is Superword-Level Parallelism: Systematic Control-Flow Vectorization with SLP [Ajay]
ACM SIGPLAN PLDI 2022
https://dl.acm.org/doi/abs/10.1145/3519939.3523701
-
TVM, Tensor Expressions[Dhruv]
-
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
OSDI 2018
https://www.usenix.org/conference/osdi18/presentation/chen
-
Working with Operators Using Tensor Expression
https://tvm.apache.org/docs/tutorial/tensor_expr_get_started.html
-
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
OSDI 2018
-
Ansor: generating high-performance tensor programs for deep learning [Abhishek]
OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation
https://dl.acm.org/doi/abs/10.5555/3488766.3488815 -
RAMMER: enabling holistic deep learning compiler optimizations with rtasks [Abhishek]
OSDI 2020
https://dl.acm.org/doi/abs/10.5555/3488766.3488816
-
DeepCuts: a deep learning optimization framework for versatile GPU workloads [Gokulnath]
PLDI 2021
https://dl.acm.org/doi/10.1145/3453483.3454038
-
AStitch: enabling a new multi-dimensional optimization space for
memory-intensive ML training and inference on modern SIMT
architectures [Ajay]
ASPLOS 2022.
https://dl.acm.org/doi/10.1145/3503222.3507723
-
Fast Algorithms for Convolutional Neural Networks [Keshav]
Andrew Lavin, Scott Gray
https://arxiv.org/abs/1509.09308
-
Code generation for GPU tensor cores [Dhruv]
-
Triton: an intermediate language and compiler for tiled neural network computations
MAPL 2019
https://dl.acm.org/doi/abs/10.1145/3315508.3329973
-
Fireiron: A Data-Movement-Aware Scheduling Language for GPUs
PACT 2020.
https://dl.acm.org/doi/10.1145/3410463.3414632
-
Triton: an intermediate language and compiler for tiled neural network computations
-
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning [Pritam]
OSDI 2022
https://www.usenix.org/conference/osdi22/presentation/zhu
-
Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization [Himanshu]
MLSys 2022
https://proceedings.mlsys.org/paper/2022/hash/069059b7ef840f0c74a814ec9237b6ec-Abstract.html -
UNIT: Unifying Tensorized Instruction Compilation [Nishit]
https://arxiv.org/abs/2101.08458
2021
-
Cortex: A Compiler for Recursive Deep Learning Models [Kripa Shanker]
https://arxiv.org/abs/2011.01383
2021
Assignments
Seminars
- Pick a topic
- Presentation with discussion/questions interspersed (typically, two classes of 1.5 hrs each)
Evaluation
Evaluation: 25% assignment-1, 25% assignment-2, 25% seminar, 25% class participation
Additional reading material
- Clan documentation
- MLIR
- Polyhedral model - mathematical background and introduction
- Chapter 11 - Compilers - Aho, Lam, Sethi, and Ullman
- Theory of Integer and Linear Programming, A. Schrijver
- Introduction to Linear Algebra, Gilbert Strang
- OpenMP tutorial
- MPI tutorial
- MPI Intro by Bill Gropp
- Introduction to OpenMP by Tim Mattson
- MPI Standard (2.2)
- Parallel architectures - Overview
- Parallel Computer Architecture (book by David Culler et al.)