Seminars

View all Seminars  |  Download ICal for this event

Operating System and Hypervisor Support for Mitigating the Address Translation Wall

Series: Ph.D. Colloquium

Speaker: Ashish Panwar

Date/Time: Jan 03 17:00:00

Location: CSA Seminar Hall (Room No. 254, First Floor)

Faculty Advisor: K. Gopinath and Arkaprava Basu

Abstract:
Virtual memory has proven to be an extremely powerful abstraction for its programmability benefits. Unfortunately, virtual memory is becoming a performance bottleneck due to the address translation wall. Modern applications with large memory footprints necessitate frequent page table walks to perform the virtual to physical address translation. Consequently, the hardware spends 30-50% of the total CPU cycles in servicing TLB misses alone. Virtualization and non-uniform memory access (NUMA) architectures further exacerbate this overhead. For example, virtualized systems involve two-dimensional page table walks that require up to 24 memory accesses for each TLB miss, with current 4-level page tables. The address translation performance drops further on NUMA systems, depending on the distance between the CPU and page tables. These overheads will increase in the future, where deeper page tables and multi-tiered memory systems will enable even larger applications. Virtual memory, therefore, is showing its age in the era of data-centric computing.
This thesis investigates the role of an operating system (OS) and hypervisor in improving the address translation performance. First, we focus on huge pages that can significantly reduce the frequency and cost of TLB misses. Huge pages are widely available in modern systems e.g., x86 architecture supports 2MB and 1GB huge pages, in addition to regular 4KB pages. While huge pages are great in theory, real-world OSs have often delivered disappointing performance while using them. This is because memory management of huge pages is fraught with multiple challenges. We propose several enhancements in OS-level policies and mechanisms to make huge pages beneficial, even under multi-dimensional constraints such as latency, capacity, and fairness.
Second, we investigate the effect of NUMA on address translation performance. NUMA architectures mandate careful data placement to hide the effect of variable memory access latency from applications. Several decades of research on NUMA systems have optimized access to user-level application data. However, prior research has ignored the access performance of kernel data, including page tables, due to their small memory footprint. We argue that it is time to revisit page table management for NUMA-like systems.

The core contributions of this thesis include four systems: Illuminator, HawkEye, Trident, and vMitosis, as summarized below:
Illuminator: We first expose some subtle implications of external memory fragmentation on huge pages. We show that despite proactive measures employed in the memory management subsystem of Linux, unmovable kernel objects (e.g., inodes, page tables, etc.) can deny huge pages to user applications. In a long-running system, unmovable objects fragment physical memory, often permanently, and cause high de-fragmentation overheads. Over time, their effects manifest in performance regressions, OS jitter, and latency spikes. Illuminator effectively clusters kernel objects in a subset of physical memory regions and makes huge page allocations feasible even under heavily fragmented scenarios..
HawkEye: In this work, we deal with OS-based huge page management policies that need to balance complex trade-offs between TLB coverage, memory bloat, latency, and the number of page faults. In addition, we consider performance and fairness issues that appear under fragmentation when memory contiguity is limited. In HawkEye, we propose asynchronous page pre-zeroing to simultaneously optimize for low latency and few page faults. We propose automatic bloat recovery to effectively deal with the trade-offs between TLB coverage and memory bloat at runtime. HawkEye addresses the performance and fairness challenges by allocating huge pages based on their estimated profitability.

Trident: Illuminator and HawkEye try to extract maximum benefits from 2MB huge pages. However, recent findings have shown that even after employing 2MB pages, more than 20% of the total CPU cycles are wasted in handling TLB misses for data center applications. We address this problem using 1GB huge pages that provide up to 1TB per-core TLB coverage on modern systems. Leveraging insights from our earlier work, we propose a multi-level huge page framework called Trident that judiciously allocates 1GB, 2MB, and 4KB pages as deemed suitable at runtime.
vMitosis: In this work, we focus on the effect of NUMA on address translation in virtualized servers. We show that page table walks often involve remote memory accesses on NUMA systems that can slow down large memory applications by more than 3x. Interestingly, the slow down observed due to remote page table accesses can even outweigh that of accessing remote data, even though page tables consume less than 1% memory of overall application footprint. vMitosis mitigates the effect of NUMA on page table walks by enabling each core to handle TLB misses from its local socket. We achieve this by judiciously migrating and replicating page tables across NUMA sockets.

Overall, with this thesis, we show that adequate OS and hypervisor support can help virtual memory thrive even in the era of data-centric computing. We have implemented our proposed systems in the Linux OS kernel and KVM hypervisor. Our optimizations are transparent to the users, and using them does not require any hardware or application modifications.
Teams link:
https://teams.microsoft.com/l/meetup-join/19%3ameeting_YjA2OGFmMmYtNjBjMi00NDJlLTk2NTItZjI5YTBlOTY3Yjgw%40thread.v2/0?context=%7b%22Tid%22%3a%226f15cd97-f6a7-41e3-b2c5-ad4193976476%22%2c%22Oid%22%3a%2282f39501-c5b2-4bfb-87c3-f17ca74c00b6%22%7d

Speaker Bio:
Ashish Panwar is a Ph.D. candidate at Indian Institute of Science. He is expected to graduate by February 2022. During PhD, he has worked on improving memory access performance and mitigating NUMA overheads for large memory CPU-based applications. He is currently involved in optimizing GPU software runtime to accelerate machine learning workloads, especially deep neural networks. Prior to Ph.D., he worked at Intel and NetApp for a total of three years in Bangalore, India.

Host Faculty: Arkaprava Basu