Seminars

View all Seminars  |  Download ICal for this event

Data Science at Scale: Scaling Up by Scaling Down and Out (to Disk)

Series: Department Seminar

Speaker: Dr. Prashant Pandey, Lawrence Berkeley National Lab. & University of California, Berkeley, USA

Date/Time: Jan 11 08:00:00

Location: CSA Seminar Hall (Room No. 254, First Floor)

Faculty Advisor:

Abstract:
The standard solution to scaling applications to massive data is scale-out, i.e., use more computers or RAM. This talk presents my work on complementary techniques: scaling down, i.e., shrinking data to fit in RAM, and scaling to disk, i.e., organizing data on disk so that the application can still run fast. I will describe new compact and I/O-efficient data structures and their applications in stream processing, computational biology, and storage.
Concretely, I show how to bridge the gap between the worlds of external memory and stream processing to perform scalable and precise real-time event-detection on massive streams. I show how to shrink genomic and transcriptomic indexes by a factor of two while accelerating queries by an order of magnitude compared to the state-of-the-art tools. I show how to improve file-system random-write performance by an order of magnitude without sacrificing sequential read/write performance.
Teams Meeting Link:
https://teams.microsoft.com/l/meetup-join/19%3ameeting_YmExNmNmZjMtODM1Zi00MDUxLWFkNmEtNjdmYThkZWIxNjkx%40thread.v2/0?context=%7b%22Tid%22%3a%226f15cd97-f6a7-41e3-b2c5-ad4193976476%22%2c%22Oid%22%3a%224bcd3d56-e405-4b06-99fb-27742262f261%22%7d

Speaker Bio:
Dr. Prashant Pandey is a Postdoctoral Research Fellow at Lawrence Berkeley Lab and University of California Berkeley working with Prof. Kathy Yelick and Prof. Aydin Buluc. Prior to that, he spent one year as a postdoc at Carnegie Mellon University (CMU) working with Prof. Carl Kingsford. He obtained his Ph.D. in 2018 in Computer Science at Stony Brook University and was co-advised by Prof. Michael Bender and Prof. Rob Johnson.
His research interests lie at the intersection of systems and algorithms. He designs and builds tools backed by theoretically well-founded data structures for large-scale data management problems across computational biology, stream processing, and storage. He is also the main contributor and maintainer of multiple open-source software tools that are used by hundreds of users across academia and industry.
During his Ph.D. he interned at Intel Labs and Google. While interning at Intel Labs, he worked on an encrypted FUSE file system using Intel SGX. At Google, he designed and implemented an extension to the ext4 file system for cryptographically ensuring file integrity. While at Google, he also worked on the core data structures of Spanner, Google’s geo-distributed big database.

Host Faculty: R. Govindarajan