Seminars

View all Seminars  |  Download ICal for this event

Improving Reliability and Performance of Datacenter Systems via Coherence

Series: Department Seminar

Speaker: Adarsh Patil, Univ. of Edinburgh

Date/Time: Feb 11 11:00:00

Location: Online Meeting

Abstract:
Reliability and performance are key metrics for modern datacenter machines.
Co-designing for these introduces delicate trade-off decisions for system
architects. In this talk I present 2 works, where we aim to improve both
reliability and performance of modern shared memory hardware in the datacenter
by designing tailored coherence protocols.
<br>
In the first work, we aim to combat increased memory system failure rates. We
propose Dvé, a hardware-driven replication mechanism where data blocks are
replicated in 2 different sockets across a cache-coherent NUMA system. Such an
organization has the advantage of offering two independent points of access
to data which enables: (a) strong error correction that can recover from a range
of faults affecting any of the components in the memory, upto and including the
memory controller, and (b) higher performance by providing another nearer point
of memory access. Dvé realizes both of these benefits via Coherent Replication,
a technique that builds on top of existing cache coherence protocols.
Coherent Replication keeps the replicas in sync for reliability and provides coherent
access to the replicas during fault-free operation for performance.
Dvé introduces a unique design point that offers higher reliability and
performance flexibly on-demand.
<br>
In the second work, we propose to improve reliability and performance of
function-as-a-service (FaaS) deployments. The FaaS model allows applications to
be decomposed into a workflow of stand-alone functions which are instantiated
and executed on-demand in the cloud. The stateless nature of this model forces
functions to store/retrieve data from a remote object store, thereby adding
latency. Our work Bolt, uses all-hardware memory disaggregation to
build an object store for FaaS applications. Bolt builds on top of the latest
cache-coherent attachment technologies for off-chip memory peripherals like GenZ,
CXL or NVLink2 to enable an all-hardware solution. It adds an object granularity 
caching mechanism to cache objects in hardware caches at compute nodes, hence
improving performance of FaaS functions. Bolt then adds an inter-node cache
coherence mechanism that ensures the data in the compute node caches is consistent. <br>
Bolt??s coherence ensures reliable operation in such a loosely coupled system by
providing an asynchronous, non-blocking protocol which ensures forward progress
during partial system failures.
<br>
Teams Meeting Link: <br> <a href="Link

Speaker Bio:
Adarsh Patil is a 3rd year PhD student at the University of Edinburgh. His research focus lies broadly in the area of memory systems design. Notably his works have targeted optimizing DRAMs for heterogeneous architectures, TLB organization for virtualization, specifying coherence protocols and memory consistency for reliability. He received his masters degree from CSA, IISc.

Host Faculty: R. Govindarajan