Seminars

View all Seminars  |  Download ICal for this event

Optimizations for Static and Dynamic Processing of Out-of-GPU-Memory Graphs

Series: Ph.D. Colloquium

Speaker: Ullas Aparanji

Date/Time: Jul 13 11:00:00

Location: CSA Auditorium, (Room No. 104, Ground Floor)

Faculty Advisor: R Govindarajan

Abstract:
Graphs are a foundational abstraction in computer science, providing a powerful computational model for problems arising in various domains. The massive scale of contemporary graphs, often containing billions of edges, poses significant challenges for efficient processing. Graphics Processing Units (GPUs), owing to their high degree of parallelism and large memory bandwidth, have become an attractive platform for accelerating graph algorithms. However, GPU device memory is severely limited relative to CPU RAM. While real-world graphs of practical interest can have edges numbering in the hundreds of billions or even trillions, modern GPUs offer at most a few tens (and upto 100GB) of gigabytes of on-device memory. Further, many real-world graphs continuously evolve over time, involving new vertices/edges to be inserted and/or some vertices/edges to be deleted. Such evolving graphs are referred to as dynamic graphs.

This thesis addresses the problem of efficiently processing such out-of-GPU-memory graphs, particularly targeting propagation-based graph algorithms such as Connected Components (CC), Breadth-First Search (BFS), Single Source Shortest Paths (SSSP), and Page Rank (PR). Our work focuses on improving performance of both large and dynamic graphs.

The central enabling data structure throughout this thesis is the SubCSR, introduced by the Subway framework. The SubCSR is a compacted variant of the Compressed Sparse Row (CSR) graph representation, restricted to only the active vertices in a given iteration of a graph algorithm. Since most iterative graph algorithms operate on a frontier that is substantially smaller than the full graph, constructing and loading the SubCSR for each iteration makes out-of-core GPU computation tractable. We note that the construction of the SubCSR consumes a significant 30??35% of the execution time. Building atop this foundation, this thesis identifies three distinct categories of optimizations.

The first contribution addresses how the graph algorithm should be executed once the SubCSR is loaded onto the GPU. Traditional vertex-centric GPU graph algorithms follow either a push-based or a pull-based propagation style. We introduce a hybrid traversal strategy that combines both paradigms within a single iteration: each vertex thread first pulls the value from its neighbours, performs certain operation (corresponding to the specific graph application that is executed), updates itself, and immediately pushes this updated value back to those neighbours. This double propagation within a single superstep significantly reduces the total number of iterations to convergence, with an associated reduction in the cost of SubCSR construction. The Hybrid strategy achieves a geomean improvement of 13% over the push baseline (equivalent to Subway) for Connected Components and 62% for PageRank, with speedups exceeding 2.3? on average over the Gunrock framework. Complementing this, we propose a Reuse optimization that amortises SubCSR construction cost by reusing the SubCSR from a prior iteration for a few successive iterations. This approximation of the active vertex set does not affect algorithmic correctness and yields improvements of up to 20.2% for out-of-GPU-memory graphs. An additional family of Adaptive Heuristics automatically selects the best traversal strategy at each iteration based on graph structure, providing a further reduction in execution time of 21.5% over the plain Hybrid baseline. Combining all three optimizations yields geomean reduction of upto 40.1% across all graphs over the plain Hybrid baseline.

The second contribution addresses dynamic graphs. Recomputing graph properties from scratch after each update is prohibitively expensive; incremental and decremental algorithms instead leverage previously computed values to reprocess only the vertices affected by the change. However, all prior work on dynamic GPU graph algorithms assumed the graph fits entirely in GPU memory. We propose gSubCSR, a framework that synthesises the Graphvine dynamic graph data structure with the SubCSR framework for out-of-core computation. Experimental evaluation shows that incremental computation costs only a small fraction of the corresponding static algorithm??s time: under 8.5% for CC, 17.3% for BFS, 11.3% for SSSP, and 48.3% for PR on out-of-GPU-memory graphs. Compared to a Unified Virtual Memory (UVM) baseline under the same memory oversubscription, gSubCSR achieves speedups of 2.99??3.42? for BFS and SSSP, and 2.47??2.9? for CC. We additionally propose a novel single-pass decremental algorithm for path-based algorithms (BFS and SSSP), which processes impacted vertices in a single traversal rather than the two-pass approach of prior work, achieving speedups of 1.71??2.4? over RisGraph and IncBoost while completing in under 19.5% of the time required for full static recomputation.

When the SubCSR for a single iteration does not fit in GPU memory, existing approaches partition the active set such that edges of partitions can be held in the GPU memory and process them sequentially. We identify two independent opportunities within this partitioned model. First, cross-partition early propagation: results computed for an earlier partition are immediately visible to later partitions within the same superstep, accelerating convergence and reducing total iteration count. Second, intra-partition pipelining: by subdividing each partition into tasks placed on separate CUDA streams, SubCSR construction for one task is overlapped with kernel execution for the previous task, hiding a significant fraction of construction overhead. These two dimensions are integrated into a unified Adaptive Execution Strategy that dynamically tunes the number of partitions and streams based on the active vertex count at each iteration. The combined approach achieves geomean speedups of 1.34? for BFS, 1.20? for CC, 1.36? for PR and 1.11? for SSSP over the single-partition, single-stream Subway baseline.

Host Faculty: R Govindarajan