E0358 (2025)

ASSIGNMENT 2

Deadline: 21 Nov 2025, 5 pm.

This assignment is available at
https://www.csa.iisc.ac.in/~udayb/e0358/2025/asst-2.txt
Instead of storing a copy of this file, please check at this URL in case there
are updates or clarifications. Updates will be notified by email if any.

DESCRIPTION

Consider the following specification of tensor computations on 2-d tensors. The
inputs will be in f16.

X = matmul(A, B)
Y = matmul(A, C)
Z = X + Y
W = Z * Z

Implement this specification in PyTorch, Triton, and CUDA (w/ CUTLASS
abstractions/optimizations), benchmark performance individually, compare
performances, and analyze the strengths and limitations of the three
approaches. All execution and performance comparison has to be done on GPUs.

All tensors involved above are square-shaped of size NxN, and you are expected
to benchmark varying N from 64 to 4096 in powers of two. The input data types
for matmul are in f16 to enable use of tensor cores. All performance
benchmarking is to be done by measuring GPU kernel execution times alone using
nsys.

Use the CSA mcastle server with a Geforce Turing GPU for the assignment. You
will be provided access by the CSA admin.

The PyTorch specification for CPUs is provided for reference. It can be modified
for GPU execution and to confirm with the input/output specification described
in the testing section below.  Benchmark both eager execution and
torch.compile with Inductor for PyTorch.
https://www.csa.iisc.ac.in/~udayb/e0358/2025/dual_gemm_pointwise_torch_cpu.py

Triton implementation: starting with the triton optimized example for Matmul
available in the official documentation as the reference, you are expected to
write a single (fused) triton kernel and employ autotuning available with Triton
to improve performance.

Use the latest stable version of triton, which is 3.5.0.
https://github.com/triton-lang/triton

The CUDA implementation should employ CUTLASS for better matmul performance.
Compile with nvcc -O3.

We will thus have four implementations to compare, all on GPU.

1) PyTorch (eager)
2) PyTorch inductor
3) Triton
4) CUDA + CUTLASS


HOW TO TEST

All implementations should read tensors from a file and dump results as
described below.

$ python <implementation> -N 512  < input.npz > output.npy

input.npz has multiple numpy arrays in a single file (zip archive). It can
loaded with np.load and accessed with the keywords A, B, C.

output.npy is the output numpy file.

For the CUDA implementation,

$ ./dual_gemm_pointwise -N 512 < input.dat > output.dat

where the `.dat` file is a text dump of the numpy tensors concatenated (for the
input) using the format

# First matrix
<nrows> <ncols>
<first row values with a space between two values>
...
<last row>
# Second matrix
<nrows> <ncols>
<first row>
...
<last row>

Ensure the -N option to specify the problem size. Ignore the comments starting
with #.

A validate.py script is provided to compare two output files with a specified
tolerance. For the CUDA program's output, convert from the text to the numpy
binary format with a simple script on your own.

https://www.csa.iisc.ac.in/~udayb/e0358/2025/validate.py

Example:

$ validate.py output1.dat output2.dat -atol 0.8

WHAT TO SUBMIT

Please submit your assignment as a single zip or tar bzipped file with three
files in it:

1) dual_gemm_pointwise_torch.py
2) dual_gemm_pointwise_triton.py
3) dual_gemm_pointwise.cu
4) README.txt

The README.txt should include the performance evaluation and comparison with
analysis and insights into the relative strengths and limitations of the
approaches. Automatic formatting can be performed using pyflakes or black on
the Python files submitted.

The zipped file should be emailed to udayb@iisc.ac.in by the deadline mentioned
on the top.