E0255 (2026)

ASSIGNMENT 2b

Deadline: 17 Apr 2026, 5 pm.

This assignment is available at
https://www.csa.iisc.ac.in/~udayb/e0255/2026/asst-2b.txt
Instead of storing a copy of this file, please check at this URL in case there
are updates or clarifications. Updates will be notified by email if any.

DESCRIPTION

Consider the following specification of tensor computations on 2-d tensors. The
inputs will be in f32.

transpose_T = T.t()
T + transpose_T

Implement this specification in PyTorch, Triton, and CUDA, while employing all
optimizations for locality/parallelism taught in the course for the Triton and
CUDA implementation. Benchmark performance individually, compare
performances, and analyze the strengths and limitations of the three
approaches. All execution and performance comparison has to be done on an nVidia
GPU on the server you will be provided access to. While measuring performance,
report the memory bandwidth as well your implementation sustains for the output
tensor (in addition to the nsys kernel execution times).

The reference PyTorch implementation for CPUs and GPUs with a test bench is
already provided for reference at
https://www.csa.iisc.ac.in/~udayb/e0255/2026/asst-2/transpose_add_torch.py

Use this as a starting point.

All tensors involved above are square-shaped of size NxN, and you are expected
to benchmark varying N from 256 to 8192 in powers of two. The input data types
are in f32. All performance benchmarking is to be done by measuring GPU kernel
execution times alone using nsys (which avoid all host-side overheads and
transfer overheads).

Use the CSA server which has an NVIDIA Geforce 4090 RTX for the assignment. You
will be provided access by the CSA admin.

Benchmark both eager execution and torch.compile with Inductor for PyTorch.

Triton implementation: starting with the triton optimized example for Matmul
available in the official documentation as the reference, you are expected to
write a single (fused) triton kernel and employ autotuning available with Triton
to improve performance.

For PyTorch, the recommended version to use is torch 2.11.0.

Use the latest stable version of triton, which is 3.6.0.
https://github.com/triton-lang/triton

The optimized CUDA implementation can employ any abstractions available with
CUDA (other than pre-written libraries). Compile with nvcc -O3.

There should be thus five executions/implementations to compare, all on GPU.

1) PyTorch (eager) - CPU
1) PyTorch (eager) GPU
2) PyTorch Inductor GPU
3) Triton GPU
4) CUDA (GPU)


HOW TO TEST

Modify the reference to have all implementations read tensors from a file
and dump results as described below.

$ python <implementation> -N 512  < input.npz > output.npy

input.npz has multiple numpy arrays in a single file (zip archive). It can
loaded with np.load and accessed with the keywords A, B, C.

output.npy is the output numpy file.

For the CUDA implementation,

$ ./transpose-add -N 512 < input.dat > output.dat

where the `.dat` file is a text dump of the numpy tensors concatenated (for the
input) using the format

# First matrix
<nrows> <ncols>
<first row values with a space between two values>
...
<last row>
# Second matrix
<nrows> <ncols>
<first row>
...
<last row>

Ensure the -N option to specify the problem size. Ignore the comments starting
with #.

A validate.py script is provided to compare two output files with a specified
tolerance. For the CUDA program's output, convert from the text to the numpy
binary format with a simple script on your own.

https://www.csa.iisc.ac.in/~udayb/e0255/2026/asst-2/validate.py

Example:

$ validate.py output1.dat output2.dat -atol 0.8

WHAT TO SUBMIT

Please submit your assignment as a single zip or tar bzipped file with three
files in it:

1) transpose_add_torch.py
2) transpose_add_triton.py
3) transpose_add.cu
4) README.txt

The README.txt should include the performance evaluation and comparison with
analysis and insights into the relative strengths and limitations of the
approaches. Automatic formatting can be performed using pyflakes or black on
the Python files submitted.

The zipped file should be emailed with subject E0255 Asst-2b submission to
udayb@iisc.ac.in with CC to the TA by the deadline.