E0358 (2025) ASSIGNMENT 2 Deadline: 21 Nov 2025, 5 pm. This assignment is available at https://www.csa.iisc.ac.in/~udayb/e0358/2025/asst-2.txt Instead of storing a copy of this file, please check at this URL in case there are updates or clarifications. Updates will be notified by email if any. DESCRIPTION Consider the following specification of tensor computations on 2-d tensors. The inputs will be in f16. X = matmul(A, B) Y = matmul(A, C) Z = X + Y W = Z * Z Implement this specification in PyTorch, Triton, and CUDA (w/ CUTLASS abstractions/optimizations), benchmark performance individually, compare performances, and analyze the strengths and limitations of the three approaches. All execution and performance comparison has to be done on GPUs. All tensors involved above are square-shaped of size NxN, and you are expected to benchmark varying N from 64 to 4096 in powers of two. The input data types for matmul are in f16 to enable use of tensor cores. All performance benchmarking is to be done by measuring GPU kernel execution times alone using nsys. Use the CSA mcastle server with a Geforce Turing GPU for the assignment. You will be provided access by the CSA admin. The PyTorch specification for CPUs is provided for reference. It can be modified for GPU execution and to confirm with the input/output specification described in the testing section below. Benchmark both eager execution and torch.compile with Inductor for PyTorch. https://www.csa.iisc.ac.in/~udayb/e0358/2025/dual_gemm_pointwise_torch_cpu.py Triton implementation: starting with the triton optimized example for Matmul available in the official documentation as the reference, you are expected to write a single (fused) triton kernel and employ autotuning available with Triton to improve performance. Use the latest stable version of triton, which is 3.5.0. https://github.com/triton-lang/triton The CUDA implementation should employ CUTLASS for better matmul performance. Compile with nvcc -O3. We will thus have four implementations to compare, all on GPU. 1) PyTorch (eager) 2) PyTorch inductor 3) Triton 4) CUDA + CUTLASS HOW TO TEST All implementations should read tensors from a file and dump results as described below. $ python -N 512 < input.npz > output.npy input.npz has multiple numpy arrays in a single file (zip archive). It can loaded with np.load and accessed with the keywords A, B, C. output.npy is the output numpy file. For the CUDA implementation, $ ./dual_gemm_pointwise -N 512 < input.dat > output.dat where the `.dat` file is a text dump of the numpy tensors concatenated (for the input) using the format # First matrix ... # Second matrix ... Ensure the -N option to specify the problem size. Ignore the comments starting with #. A validate.py script is provided to compare two output files with a specified tolerance. For the CUDA program's output, convert from the text to the numpy binary format with a simple script on your own. https://www.csa.iisc.ac.in/~udayb/e0358/2025/validate.py Example: $ validate.py output1.dat output2.dat -atol 0.8 WHAT TO SUBMIT Please submit your assignment as a single zip or tar bzipped file with three files in it: 1) dual_gemm_pointwise_torch.py 2) dual_gemm_pointwise_triton.py 3) dual_gemm_pointwise.cu 4) README.txt The README.txt should include the performance evaluation and comparison with analysis and insights into the relative strengths and limitations of the approaches. Automatic formatting can be performed using pyflakes or black on the Python files submitted. The zipped file should be emailed to udayb@iisc.ac.in by the deadline mentioned on the top.