Seminars
View all Seminars | Download ICal for this eventHigh-Performance GPU Tensor Core Code Generation for Matmul using MLIR
Series: M.Tech (Research)Thesis Defence - ONLINE
Speaker: Mr. Navdeep Kumar Katel, M.Tech (Research) student, Dept. of CSA
Date/Time: Nov 29 15:30:00
Location: Microsoft Teams - ON-LINE
Faculty Advisor: Prof. Uday Kumar Reddy .B
Abstract:
State of the art in high-performance deep learning is primarily driven by highly
tuned libraries. These libraries are often hand-optimized and tuned by expert
programmers using low-level abstractions with significant effort. A lot of the
effort may have to be repeated for similar hard- ware and future ones. This
process is thus not modular or reusable to the same extent that compiler
infrastructures like LLVM are. Manual optimization does not typically use a
standard intermediate representation (IR) or transformations and passes on such
IRs, although the optimizations performed can be encoded as a sequence of
transformation steps and customized passes on an IR.
<br>
We believe that until the recent introduction of MLIR (Multi-level intermediate
representation), IR infrastructure was not geared to tackle the problem of
automatic generation of libraries in an effective manner. In particular, it was
hard to represent and transform compute abstractions at high, middle, and low
levels using a single IR. Multiple levels of abstractions in a single IR permits
the user to apply transformations and optimizations at the most suitable level
and even reuse them for different targets or front-ends.
<br>
Some previous works have optimized matrix-matrix multiplication (matmul) for
different GPU microarchitectures. All of these works exploit really low-level
details of the hardware. Some of them are written directly in assembly, while
some use a combination of CUDA C++ with inline assembly. While the set of
high-level optimizations is the same, the very dependence on low-level hardware
details drifts them away from re-usability. Going against this trend, we show
that, by using a set of simple optimizations, suitable abstractions, and
lowering passes on such abstractions in MLIR, we can get competitive performance
with hand-written libraries.
<br>
To achieve this, we put together a lowering pipeline that can automatically
generate (with- out hand-writing any code) code for matmul on NVIDIA GPUs while
utilizing its tensor cores. We have used and extended some existing utilities in
MLIR, such as tiling, loop unrolling, loop permutation, and generation of fast
memory buffers for input operands. Additional utilities, types, and operations
necessary for optimal code generation were implemented from scratch. These
include adding WMMA operations and types to provide fundamental support for
programming tensor cores, adding loop normalization support, adding multi-level
tiling support in affine dialect, creating WMMA operations to load, compute, and
store matrix products in a given matmul nest, detection, and hoisting of
invariant WMMA load-store pairs, hiding latency of global to shared data
movement, and adding support for mapping and converting parallel loops to warps.
<br>
On a set of problem sizes we evaluated, performance results show that we can
attain performance that is 95-119% and 80-160% of cuBLAS, for FP32 and FP16
accumulate respectively, on NVIDIAs Ampere microarchitecture based GeForce 3090
RTX. A similar evaluation on NVIDIAs Turing-based RTX 2080 Ti revealed that we
achieve 86-111% and 72-89% of cuBLAS for FP32 and FP16 accumulate, respectively.
<br>
We take our approach further by fusing common pointwise operations with
matrix-matrix multiplication. This is the first work to demonstrate fusion of
operations for tensor core matmul using a systematic IR based approach. Fusion
is done with the support of additional WMMA operations, which perform warp level
matrix operations such as ReLU and constant addition. We see significant gains
on small to medium problem sizes when evaluating our fused kernels against a
combination of library kernels and custom kernels. On Ampere, consumer fusion
performance ranges from 95% to 167% compared with the respective
implementations. Similar ranges on Turing are 84% to 150%. We also present
preliminary results, which serve as a proof of concept, for producer fusion,
i.e., fusion of pointwise operations on the inputs with matmul. Performance of
ReLU on C input fused with matmul against a custom ReLU kernel followed by
cuBLAS matmul, ranges from 98% to 138% on Ampere and 91% to 133% on Turing.
<br>
We believe that these results could be used as a foundation and motivation for
further research and development on automatic code and library generation using
IR infrastructure for similar specialized accelerators.
<br>
Microsoft teams online link:
<br>
<a href="Link