BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//project/author//NONSGML v1.0//EN
CALSCALE:GREGORIAN
BEGIN:VEVENT
DTEND:20230210T120000Z
UID:43528966f5417c7f4bef358437e85ca7-401
DTSTAMP:19700101T120011Z
DESCRIPTION:IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
URL;VALUE=URI:https://www.csa.iisc.ac.in/newweb/event/401/iaso-a-fail-slow-detection-and-mitigation-framework-for-distributed-storage-services/
SUMMARY:Distributed systems, whether large or small, have to handle two kinds of failures. The first type is called &quot;fail-stop&quot; failure where a component, system, or process might stop operating completely. These are easy to detect and have fairly standard ways of being handled. The second type of failure is called &quot;fail-slow&quot; failure and is characterized by some part of the system experiencing degraded performance, but is not completely non-functional. Such failures are much harder to detect and can cause a series of cascading failures. The talk focuses on the detection, mitigation and resolution of such fail-slow failures through a framework conceived and built in Nutanix, called IASO. IASO is a peer-based, non-intrusive fail-slow detection framework that Nutanix has deployed in customer sites and which has helped mitigate a large number of incidents before they cascaded into complete cluster outages. The talk focuses on the design of IASO with highlight on how the various choices made are essential in a real-world setting. This talk is based on a paper that presented at Usenix ATC 2019.
DTSTART:20230210T120000Z
END:VEVENT
END:VCALENDAR