E0358 Advanced Techniques in Compilation and Programming for Parallel Architectures

Instructor: UKRB
Tue, Thu: 11:00--12:30pm
CSA 117

Course Notes

Slides

Instructor Lectures (first one month)

Paper discussions / Seminars

  1. Optimizing Matrix-Matrix Multiply for peak performance
    • Analytical Modeling Is Enough for High-Performance BLIS
      https://dl.acm.org/citation.cfm?id=2925987
      Low et al, ACM TOMS 2016
    • Anatomy of high-performance matrix multiplication
      https://dl.acm.org/citation.cfm?id=1356053
      ACM TOMS 2008
  2. The LLVM IR and toolchain [Gokulnath]
    • The LLVM Intermediate Representation
    • LLVM infrastructure and tools: opt, llc
    • Target code generation in LLVM (pattern matching, rewriting)
  3. NVIDIA GPU Architecture and the CUDA programming model [Keshav]
  4. DNNFusion: accelerating deep neural networks execution with advanced operator fusion [Himanshu]
    PLDI 2021.
    https://dl.acm.org/doi/10.1145/3453483.3454083
  5. AKG: automatic kernel generation for neural processing units using polyhedral transformations [Pritam]
    PLDI 2021.
    https://dl.acm.org/doi/10.1145/3453483.3454106
  6. DISTAL: The Distributed Tensor Algebra Compiler [Kripa Shanker]
    ACM SIGPLAN PLDI 2022
    https://dl.acm.org/doi/10.1145/3519939.3523437
  7. Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model [Nishit]
    ACM SIGPLAN PLDI 2022
    https://dl.acm.org/doi/abs/10.1145/3519939.3523442
  8. All you need is Superword-Level Parallelism: Systematic Control-Flow Vectorization with SLP [Ajay]
    ACM SIGPLAN PLDI 2022
    https://dl.acm.org/doi/abs/10.1145/3519939.3523701
  9. TVM, Tensor Expressions[Dhruv]
    • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning OSDI 2018
      https://www.usenix.org/conference/osdi18/presentation/chen
    • Working with Operators Using Tensor Expression
      https://tvm.apache.org/docs/tutorial/tensor_expr_get_started.html
  10. Ansor: generating high-performance tensor programs for deep learning [Abhishek]
    OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation
    https://dl.acm.org/doi/abs/10.5555/3488766.3488815
  11. RAMMER: enabling holistic deep learning compiler optimizations with rtasks [Abhishek]
    OSDI 2020
    https://dl.acm.org/doi/abs/10.5555/3488766.3488816
  12. DeepCuts: a deep learning optimization framework for versatile GPU workloads [Gokulnath]
    PLDI 2021
    https://dl.acm.org/doi/10.1145/3453483.3454038
  13. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures [Ajay]
    ASPLOS 2022.
    https://dl.acm.org/doi/10.1145/3503222.3507723
  14. Fast Algorithms for Convolutional Neural Networks [Keshav]
    Andrew Lavin, Scott Gray
    https://arxiv.org/abs/1509.09308
  15. Code generation for GPU tensor cores [Dhruv]
    • Triton: an intermediate language and compiler for tiled neural network computations
      MAPL 2019
      https://dl.acm.org/doi/abs/10.1145/3315508.3329973
    • Fireiron: A Data-Movement-Aware Scheduling Language for GPUs
      PACT 2020.
      https://dl.acm.org/doi/10.1145/3410463.3414632
  16. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning [Pritam]
    OSDI 2022
    https://www.usenix.org/conference/osdi22/presentation/zhu
  17. Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization [Himanshu]
    MLSys 2022
    https://proceedings.mlsys.org/paper/2022/hash/069059b7ef840f0c74a814ec9237b6ec-Abstract.html
  18. UNIT: Unifying Tensorized Instruction Compilation [Nishit]
    https://arxiv.org/abs/2101.08458
    2021
  19. Cortex: A Compiler for Recursive Deep Learning Models [Kripa Shanker]
    https://arxiv.org/abs/2011.01383
    2021

Assignments

Seminars

Evaluation

Evaluation: 25% assignment-1, 25% assignment-2, 25% seminar, 25% class participation

Additional reading material

  1. Clan documentation
  2. MLIR
  3. Polyhedral model - mathematical background and introduction
  4. Chapter 11 - Compilers - Aho, Lam, Sethi, and Ullman
  5. Theory of Integer and Linear Programming, A. Schrijver
  6. Introduction to Linear Algebra, Gilbert Strang
  7. OpenMP tutorial
  8. MPI tutorial
  9. MPI Intro by Bill Gropp
  10. Introduction to OpenMP by Tim Mattson
  11. MPI Standard (2.2)
  12. Parallel architectures - Overview
  13. Parallel Computer Architecture (book by David Culler et al.)