SeminarsView all Seminars | Download ICal for this event
Data Science at Scale: Scaling Up by Scaling Down and Out (to Disk)
Series: Department Seminar
Speaker: Dr. Prashant Pandey, Lawrence Berkeley National Lab. & University of California, Berkeley, USA
Date/Time: Jan 11 08:00:00
Location: CSA Seminar Hall (Room No. 254, First Floor)
The standard solution to scaling applications to massive data isÂ scale-out, i.e., use more computers or RAM. This talk presents my work on complementary techniques: scaling down, i.e., shrinking data to fit in RAM, and scaling to disk, i.e., organizing data on disk so that the application can stillÂ run fast. I will describe new compact and I/O-efficient data structures andÂ their applications in stream processing, computational biology, and storage.
Concretely, I show how to bridge the gap between the worlds of external memoryÂ and stream processing to perform scalable and precise real-time event-detectionÂ on massive streams. I show how to shrink genomic and transcriptomic indexes by a factor of two while accelerating queries by an order of magnitude compared to the state-of-the-art tools. I show how to improve file-system random-writeÂ performance by an order of magnitude without sacrificing sequential read/writeÂ performance.
Teams Meeting Link:
Dr. Prashant Pandey is a Postdoctoral Research Fellow at Lawrence Berkeley Lab and University of California Berkeley working with Prof. Kathy Yelick and Prof.Â Aydin Buluc. Prior to that, he spent one year as a postdoc at Carnegie MellonÂ University (CMU) working with Prof. Carl Kingsford. He obtained his Ph.D. inÂ 2018 in Computer Science at Stony Brook University and was co-advised by Prof.Â Michael Bender and Prof. Rob Johnson.
His research interests lie at the intersection of systems and algorithms. HeÂ designs and builds tools backed by theoretically well-founded data structuresÂ for large-scale data management problems across computational biology, streamÂ processing, and storage. He is also the main contributor and maintainer ofÂ multiple open-source software tools that are used by hundreds of users acrossÂ academia and industry.
During his Ph.D. he interned at Intel Labs and Google. While interning at IntelÂ Labs, he worked on an encrypted FUSE file system using Intel SGX. At Google, heÂ designed and implemented an extension to the ext4 file system for cryptographically ensuring file integrity. While at Google, he also worked onÂ the core data structures of Spanner, Googleâ€™s geo-distributed big database.
Host Faculty: R. Govindarajan