Seminars
View all Seminars | Download ICal for this eventEfficient Large Scale Model Training (part 2)
Series: Department Seminar
Speaker: Dr. Jayashree Mohan Senior Researcher, Microsoft Research India
Date/Time: Apr 11 14:00:00
Location: CSA Lecture Hall (Room No. 112, Ground Floor)
Abstract:
This will be a 2 part lecture series focusing on systems and techniques that enable efficient and reliable model training at scale. Training a deep neural network at scale requires a holistic use of all the datacenter resources including CPU, memory, storage, network, and the accelerators (GPUs). Large model training not only requires us to distribute the training process across several GPUs and possibly nodes, but also do so in the most efficient manner while avoiding data- and communication-stalls. The first part of the lecture will focus on techniques that enable multi-GPU and multi-node training of large models. These include, data, model, pipeline, and tensor parallelism and discuss scenarios where each of these techniques are useful. We will talk also touch upon the communication primitives involved in training, and how these distributed training techniques alleviate communication stalls.
In the second part, we will discuss how to automatically parallelize DNN training by interleaving a subset of the techniques we have discussed based on the hardware and model characteristics. Finally, we will touch upon another important aspect of training efficiency, which is providing reliability via model checkpointing. Failures in hardware and software are inevitable during large model training, thereby necessitating low-cost checkpointing and training recovery. We will talk about why and how such reliable training can be achieved.
Relevant reading : DistBelief [NeurIPS ??12], Pipedream [SOSP ??19], GPipe [NeurIPS ??19], ALPA [OSDI??22], CheckFreq[FAST ??21], CheckNRun [NSDI??22]
Speaker Bio:
Jayashree Mohan is a Senior Researcher in the Systems group at Microsoft
Research Lab, India. Her research interests are broadly in computer systems,
filesystems, storage, and systems infrastructure for machine learning
applications. She graduated with a Ph.D in Computer Science from the University
of Texas at Austin in 2021, where she was advised by Prof.Vijay Chidamabaram.
Her dissertation focuses on accelerating machine learning training from a
storage perspective, by analyzing and mitigating data stalls.
Host Faculty: Arkaprava Basu