View all Seminars  |  Download ICal for this event

IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services

Series: Department Seminar

Speaker: Yasaswi Kishore and Shyam Sankaran

Date/Time: Feb 10 11:30:00

Location: CSA Seminar Hall (Room No. 254, First Floor)

Faculty Advisor:

Distributed systems, whether large or small, have to handle two kinds of failures. The first type is called "fail-stop" failure where a component, system, or process might stop operating completely. These are easy to detect and have fairly standard ways of being handled. The second type of failure is called "fail-slow" failure and is characterized by some part of the system experiencing degraded performance, but is not completely non-functional. Such failures are much harder to detect and can cause a series of cascading failures. The talk focuses on the detection, mitigation and resolution of such fail-slow failures through a framework conceived and built in Nutanix, called IASO. IASO is a peer-based, non-intrusive fail-slow detection framework that Nutanix has deployed in customer sites and which has helped mitigate a large number of incidents before they cascaded into complete cluster outages. The talk focuses on the design of IASO with highlight on how the various choices made are essential in a real-world setting. This talk is based on a paper that presented at Usenix ATC 2019.

Speaker Bio:
Yasaswi Kishore and Shyam Sankaran have been working as software developers in Nutanix for more than 5 years. They are part of the metadata storage team. They have worked on building and optimising metadata subsystems for distributed block/file and object storage products, developing a cloud-based distributed database built using RocksDB in Nutanix. Shyam holds a masters degree from IISc, Bangalore and Yasaswi holds a bachelors degree from PESIT, Bangalore.

Host Faculty: K V Raghavan