BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//project/author//NONSGML v1.0//EN
CALSCALE:GREGORIAN
BEGIN:VEVENT
DTEND:20220912T120000Z
UID:c5e9fe6e01a5449b8c9ff3d0dbc80f93-327
DTSTAMP:19700101T120019Z
DESCRIPTION:Heterogeneous Systems Resilience: From Research to Industry Standards
URL;VALUE=URI:https://www.csa.iisc.ac.in/newweb/event/327/heterogeneous-systems-resilience-from-research-to-industry-standards/
SUMMARY:Reliability is a fundamental abstraction of computing. This abstraction is increasingly challenging to achieve at high node-level component densities and for large compute infrastructures. Industry standards have played a key role in enabling such scaling, by facilitating greater heterogeneity, tighter integration of compute and memory, and paving the way for new node and system architectures. Therefore, Reliability, Availability, and Serviceability (RAS) techniques that enhance resilience and intercept major industry standards are beneficial to the overall ecosystem using these standards. 
We first explain why RAS is important for large-scale systems and outline some key RAS best practices for servers. We then present insights from studying reliability field data from production systems in data centers and an overview of tools and techniques developed to enhance resiliency and reliability. Finally, we show how the research influenced the RAS architecture and capabilities of two recent industry standards and their potential resilience benefits at scale.


MS Teams link: https://tinyurl.com/yumys6wt
DTSTART:20220912T120000Z
END:VEVENT
END:VCALENDAR