Seminars

View all Seminars  |  Download ICal for this event

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Series: Department Seminar

Speaker: Dr. Raghu Prabhakar,SambaNova Systems, USA

Date/Time: Aug 19 14:00:00

Location: CSA #104

Abstract:
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Agentic AI systems like Composition of Experts (CoE) is an alternative modular approach that combines many small expert models to match or exceed the capabilities of monolithic LLMs. CoE lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting and dynamically switching between a large number of models can be prohibitively expensive and slow.

In this talk, I describe SambaNova\'s approach to address the challenges above with the SambaNova SN40L Reconfigurable Dataflow Unit (RDU). The SN40L RDU is a 2.5D CoWoS chiplet-based design containing two RDU dies on a silicon interposer. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. Model parameters reside in DDR, and actively used experts are cached and served from high bandwidth memory. Models are loaded from DDR to high bandwidth memory at over 1 TB/s in a single SN40L Node containing 8 RDUs. On-chip streaming dataflow enables an unprecedented level of operator fusion: entire decoder blocks can be automatically fused into a single kernel call. I will demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight SN40L RDU sockets over an unfused baseline. CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100, respectively.

Speaker Bio:
Raghu Prabhakar is a founding engineer and architect at SambaNova Systems. He specializes in hardware architecture and compilers for dataflow architectures. At SambaNova, Raghu has co-architected four generations of SambaNova\'s dataflow processor called the Reconfigurable Dataflow Unit (RDU). Raghu has published and refereed several publications in this area in venues like ASPLOS, ISCA, MICRO, HPCA, PLDI, ISSCC, and HotChips. His research was recognized by IEEE Micro as one of twelve

Host Faculty: R Govindarajan