Seminars

View all Seminars  |  Download ICal for this event

Heterogeneous Systems Resilience: From Research to Industry Standards

Series: Department Seminar

Speaker: Dr. Sudhanva Gurumurthi AMD, Austin

Date/Time: Sep 12 19:00:00

Location: https://tinyurl.com/yumys6wt

Abstract:
Reliability is a fundamental abstraction of computing. This abstraction is increasingly challenging to achieve at high node-level component densities and for large compute infrastructures. Industry standards have played a key role in enabling such scaling, by facilitating greater heterogeneity, tighter integration of compute and memory, and paving the way for new node and system architectures. Therefore, Reliability, Availability, and Serviceability (RAS) techniques that enhance resilience and intercept major industry standards are beneficial to the overall ecosystem using these standards.
We first explain why RAS is important for large-scale systems and outline some key RAS best practices for servers. We then present insights from studying reliability field data from production systems in data centers and an overview of tools and techniques developed to enhance resiliency and reliability. Finally, we show how the research influenced the RAS architecture and capabilities of two recent industry standards and their potential resilience benefits at scale.


MS Teams link: Link

Speaker Bio:
Sudhanva Gurumurthi is a Fellow at AMD, where he leads advanced development in RAS. Prior to joining industry, Sudhanva was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. He is a recipient of an NSF CAREER Award, a Google Focused Research Award, an IEEE Computer Society Distinguished Contributor recognition, and several other awards and recognitions. Sudhanva regularly serves on the program and organizing committees of major computer architecture conferences and has served as an editor for IEEE Micro Top Picks, IEEE Transactions on Computers, and IEEE Computer Architecture Letters. He also serves on the Advisory Council of the College of Science and Engineering at Texas State University. Sudhanva received his Ph.D. in Computer Science and Engineering from Penn State in 2005.

Host Faculty: Arkaprava Basu