SeminarsView all Seminars | Download ICal for this event
Revisiting Statistical Techniques for Cardinality Estimation in RDBMS
Series: M.Tech (Research) Thesis Colloquium- ON-LINE
Speaker: Mr. Dhrumilkumar ShahM. Tech (Research) studentDepartment of Computer Science and Automation
Date/Time: Mar 22 14:00:00
Location: Microsoft Teams - ON-LINE
Faculty Advisor: Prof. Jayant R. Haritsa
Relational Database Management Systems (RDBMS) constitute the backbone of today's information-rich society, providing a congenial environment for handling enterprise data during its entire life cycle of generation, storage, maintenance, and processing. The Structured Query Language (SQL) is the standard interface to query the information present in RDBMS-based storage. Because of the declarative nature of SQL, the query optimizer inside the database engine needs to come up with an efficient execution plan for a given query. To do so, database query optimizers are critically dependent on accurate row-cardinality estimates of the intermediate results generated on the edges of the execution plan tree.
Unfortunately, the histogram and sampling-based techniques commonly used in industrial database engines for cardinality estimation are often woefully inaccurate in practice. As a result, query optimizers produce sub-optimal query execution plans, leading to inflated query response times. This lacuna has motivated a recent slew of papers advocating the use of machine-learning techniques for cardinality estimation. However, these new approaches have their own limitations regarding training overheads, output explainability, incorporating dynamic updates, handling of workload drift, and generalization to unseen queries.
In this work, we take a relook at the traditional techniques and investigate whether they can be made to work satisfactorily when augmented with light-weight data structures. Specifically, we present GridSam, which essentially combines histograms and sampling in a potent partnership incorporating both algorithmic and platform innovations.
From the algorithmic perspective, GridSam first creates a multi-dimensional grid overlay structure by partitioning the data-space on "critical" attributes (Histogram), and then performs dynamic sampling from query-specific regions of the grid to capture correlations (Sampling). A heuristic-based methodology is used to determine the critical grid dimensions. Further, insights from Index-based Join Sampling (IBJS) technique are leveraged to direct the sampling in multi-table queries. Finally, learned-indexes are incorporated to reduce the index-probing cost for join sampling during the estimation process.
From the platform perspective, GridSam leverages the massive parallelism offered by current GPU architectures to provide fast grid setup times. This parallelism is also extended to the run-time estimation process.
A detailed performance study on benchmark environments indicates that GridSam computes cardinality estimates with accuracies competitive to contemporary learning-based techniques. Moreover, it does so while achieving orders-of-magnitude reduction in setup time. Further, the estimation time is in the same ballpark as both traditional and learning-based techniques. Finally, a collateral benefit of GridSamâ€™s simple design is that, unlike learned estimators, it is natively amenable to dynamic data environments.
Microsoft Teams Link: