SeminarsView all Seminars | Download ICal for this event
Enhancing Coverage and Robustness of Database Generators
Series: M.Tech (Research) Colloquium- ONLINE
Speaker: Mr. Rajkumar SanthanamM.Tech (Research) studentDept. of Computer Science and Automation
Date/Time: Jul 23 11:00:00
Location: Microsoft Teams - ONLINE
Faculty Advisor: Prof. Jayant R. Haritsa
Generating synthetic databases that capture essential data characteristics of client databases is a common requirement for enterprise database vendors. This need stems from a variety of use-cases, such as application testing and assessing performance impacts of planned engine upgrades. A rich body of literature exists in this area, spanning from the early techniques that simply generated data ab-initio to the contemporary ones that use a predefined client query workload to guide the data generation. In the latter category, the aim specifically is to ensure volumetric similarity -- that is, assuming a common choice of query execution plans at the client and vendor sites, the output row cardinalities of individual operators in these plans are similar in the original and synthetic databases.
Hydra is a recently proposed data regeneration framework that provides volumetric similarity. In addition, it also provides a mechanism to generate data dynamically during query execution, using a minuscule database summary. Notwithstanding its desirable characteristics, Hydra has the following critical limitations: (a) limited scope of SQL operators in the input query workload, (b) poor scalability with respect to the number of queries in the input workload, and (c) poor volumetric similarity on unseen queries. The data generation algorithm internally uses a linear programming (LP) solver that throttles the workload scalability. This not only puts a threshold on the training (seen) workload size but also reduces the accuracy for test (unseen) queries. Robustness towards test queries is further adversely affected by design choices such as a lack of preference among candidate synthetic databases, and artificial skew in the generated data.
In this work, we present an enhanced version of Hydra, called High-Fidelity Hydra (HF-Hydra), which attempts to address the above limitations. To start with, we expand the SQL operator coverage to also include the LIKE operator, and, in certain restricted settings, projection-based operators such as GROUP BY and DISTINCT. To sidestep the challenge of workload scalability, HF-Hydra outputs not one, but a suite of database summaries such that they collectively cover the entire input workload. The division of the workload into the associated sub-workloads is governed by heuristics that aim to balance robustness with LP solvability.
For generating richer database summaries, HF-Hydra additionally exploits metadata statistics maintained by the database engine. Further, the database query optimizer is leveraged to make the choice among the various candidate databases. The data generation is also augmented to provide greater diversity in the represented values. Finally, when a test query is fired, HF-Hydra directs it to the database summary that is expected to provide the highest volumetric similarity.
We have experimentally evaluated HF-Hydra on a customized set of queries based on the TPC-DS decision-support benchmark framework. We first evaluated the specialized case where each training query has its own summary, and here HF-Hydra achieves perfect volumetric similarity. Further, each summary construction took just under a second and the summary sizes were just in the order of a few tens of kilobytes. Also, our dynamic generation technique produced gigabytes of data in just a few seconds.
For the general setting of a limited set of summaries representing the training query workload, the data generated by HF-Hydra was compared with that from Hydra. We observed that HF-Hydra delivers more than forty percent better accuracy for outputs from filter nodes in the plans, while also achieving an improvement of about twenty percent with regard to join nodes. Further, the degradation in volumetric similarity is minor as compared to the perfectly accurate one-summary-per-query scenario, while the summary production is significantly more efficient due to reduced overheads on the LP solver.
Microsoft Teams link: