View all Seminars  |  Download ICal for this event

Automatic Code Generation for GPU Tensor Cores using MLIR

Series: M.Tech (Research) Colloquium- ON-LINE

Speaker: Mr. Navdeep Kumar Katel M.Tech (Research) student Dept. of CSA

Date/Time: Jul 12 16:00:00

Location: Microsoft Teams - ON-LINE

Faculty Advisor: Prof. Uday Kumar Reddy .B

The state of the art in high-performance deep learning is primarily driven by highly tuned libraries. These libraries are often hand-optimized and tuned by expert programmers using low-level abstractions with significant effort. A lot of the effort may have to be repeated for similar hardware and future ones. Such a process is thus not modular or reusable to the same extent as compiler infrastructure like LLVM are. Manual optimization does not typically use a standard intermediate representation or transformations and passes on such intermediate representations, although the optimizations performed can be encoded as a sequence of transformation steps and customized passes. Hand tuning may also miss exploration of space or design points only reachable by automatic code generation. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), intermediate representation infrastructure had not reached a stage to tackle the problem of automatic generation of libraries in a scalable and convenient manner. In particular, it was hard to represent and transform compute abstractions at high, middle and low levels using a single IR.
MLIR is an intermediate representation that aims to build reusable, extensible compiler infrastructure and reduce the cost of building domain-specific compilers and code generators. In this work, we tackle the problem of generating code targeting tensor cores on GPUs using the MLIR compiler infrastructure. Tensor cores are programmable matrix-multiply-and-accumulate units performing matrix-multiply accumulate operations on small matrices. First, we introduce low-level operations which are necessary to compute on tensor cores and which were absent from MLIR. Then, building on these operations, we put together a lowering pipeline that is able to fully automatically generate code for matrix-matrix multiplication (matmul) on tensor cores. Matmul is an excellent candidate to demonstrate our work as: (1) it is at the heart of many deep-learning models such as BERT, and 2) it is an excellent candidate to demonstrate various individual optimizations. We evaluate our pipeline on two different devices: 1) an NVIDIA Turing-based RTX 2080 Ti, and 2) an NVIDIA Ampere-based Geforce RTX 3090 and with two different precisions for accumulation, namely 32-bit and 16-bit wide floats. On a set of problem sizes that we evaluate, we achieve performance that is within 93% to 117% for F32 accumulate and between 79% and 158% with F16 accumulate of CuBLAS on NVIDIA Turing and Ampere respectively. We take this approach further by demonstrating the fusion of MatMul with operations that commonly follow it in deep-learning models.
Microsoft teams link:

Speaker Bio:

Host Faculty: