E0255 (2026) ASSIGNMENT 2b Deadline: 17 Apr 2026, 5 pm. This assignment is available at https://www.csa.iisc.ac.in/~udayb/e0255/2026/asst-2b.txt Instead of storing a copy of this file, please check at this URL in case there are updates or clarifications. Updates will be notified by email if any. DESCRIPTION Consider the following specification of tensor computations on 2-d tensors. The inputs will be in f32. transpose_T = T.t() T + transpose_T Implement this specification in PyTorch, Triton, and CUDA, while employing all optimizations for locality/parallelism taught in the course for the Triton and CUDA implementation. Benchmark performance individually, compare performances, and analyze the strengths and limitations of the three approaches. All execution and performance comparison has to be done on an nVidia GPU on the server you will be provided access to. While measuring performance, report the memory bandwidth as well your implementation sustains for the output tensor (in addition to the nsys kernel execution times). The reference PyTorch implementation for CPUs and GPUs with a test bench is already provided for reference at https://www.csa.iisc.ac.in/~udayb/e0255/2026/asst-2/transpose_add_torch.py Use this as a starting point. All tensors involved above are square-shaped of size NxN, and you are expected to benchmark varying N from 256 to 8192 in powers of two. The input data types are in f32. All performance benchmarking is to be done by measuring GPU kernel execution times alone using nsys (which avoid all host-side overheads and transfer overheads). Use the CSA server which has an NVIDIA Geforce 4090 RTX for the assignment. You will be provided access by the CSA admin. Benchmark both eager execution and torch.compile with Inductor for PyTorch. Triton implementation: starting with the triton optimized example for Matmul available in the official documentation as the reference, you are expected to write a single (fused) triton kernel and employ autotuning available with Triton to improve performance. For PyTorch, the recommended version to use is torch 2.11.0. Use the latest stable version of triton, which is 3.6.0. https://github.com/triton-lang/triton The optimized CUDA implementation can employ any abstractions available with CUDA (other than pre-written libraries). Compile with nvcc -O3. There should be thus five executions/implementations to compare, all on GPU. 1) PyTorch (eager) - CPU 1) PyTorch (eager) GPU 2) PyTorch Inductor GPU 3) Triton GPU 4) CUDA (GPU) HOW TO TEST Modify the reference to have all implementations read tensors from a file and dump results as described below. $ python -N 512 < input.npz > output.npy input.npz has multiple numpy arrays in a single file (zip archive). It can loaded with np.load and accessed with the keywords A, B, C. output.npy is the output numpy file. For the CUDA implementation, $ ./transpose-add -N 512 < input.dat > output.dat where the `.dat` file is a text dump of the numpy tensors concatenated (for the input) using the format # First matrix ... # Second matrix ... Ensure the -N option to specify the problem size. Ignore the comments starting with #. A validate.py script is provided to compare two output files with a specified tolerance. For the CUDA program's output, convert from the text to the numpy binary format with a simple script on your own. https://www.csa.iisc.ac.in/~udayb/e0255/2026/asst-2/validate.py Example: $ validate.py output1.dat output2.dat -atol 0.8 WHAT TO SUBMIT Please submit your assignment as a single zip or tar bzipped file with three files in it: 1) transpose_add_torch.py 2) transpose_add_triton.py 3) transpose_add.cu 4) README.txt The README.txt should include the performance evaluation and comparison with analysis and insights into the relative strengths and limitations of the approaches. Automatic formatting can be performed using pyflakes or black on the Python files submitted. The zipped file should be emailed with subject E0255 Asst-2b submission to udayb@iisc.ac.in with CC to the TA by the deadline.