Seminars

View all Seminars  |  Download ICal for this event

HYDRA: A Dynamic Approach to Database Regeneration

Series: Ph.D Research Thesis Colloquium

Speaker: Anupam Sanghi, Ph.D (Engg) student Dept of Computer Science and Automation

Date/Time: Jun 21 10:00:00

Location: CSA Seminar Hall (Room No. 254, First Floor)

Faculty Advisor: Prof. Jayant R. Haritsa

Abstract:
Database software vendors often need to generate synthetic databases for a variety of applications, including (a) Testing database engines and applications, (b) Data masking, (c) Benchmarking, (d) Creating what-if scenarios, and (e) Assessing performance impacts of planned engine upgrades. The synthetic databases are targeted toward capturing the desired schematic properties (e.g.~keys, referential constraints, functional dependencies, domain constraints), as well as the statistical data profiles (e.g., value distributions, column correlations, data skew, output volumes) hosted on these schemas.
Several data generation frameworks have been proposed in the last two decades. It started from the ab-initio generation tools that use standard mathematical distributions and do not depend on the client databases or query workloads. Subsequently, tools that generate data using column distributions became prominent. However, none of these mechanisms could mimic the customer query-processing environments satisfactorily. The more contemporary school of thought is generating workload-aware data that uses query execution plans from the customer workloads as input and guarantees volumetric similarity. That is, the intermediate row-cardinalities obtained at the client and vendor sites are very similar when matching query plans are executed. This similarity helps to preserve the multi-dimensional layout and flow of the data, a prerequisite for achieving similar performance on the clients workload. However, even in this category, the existing frameworks are crippled by one or more of the following limitations: (a) inability to provide a comprehensive algorithm to handle the queries based on core relational algebra operators, namely, select, project, and join; (b) inability to scale to big data volumes; (c) inability to scale to large input workloads; and (d) poor accuracy on unseen queries.
In this work, motivated by the above lacuna, we present HYDRA, a data regeneration tool that materially addresses the above challenges by adding functionality, dynamism, scale, and robustness. Firstly, the extended workload coverage is obtained by providing a comprehensive solution to support queries based on select-project-join relational algebra operators. Specifically, the constraints are modeled using a linear feasibility problem, in which each variable represents the volume of a region of the data space. These regions are computed using a scheme of partitioning strategies. For example, to encode the filter constraints, our region-partitioning approach divides the data space into the provably minimum number of regions, thereby reducing the existing solutions complexity by many orders of magnitude. Our projection subspace division and projection isolation strategies address the critical challenges in incorporating projection-inclusive constraints. By modeling referential constraints over denormalized equivalents of the tables, Hydra delivers a comprehensive solution that also additionally handles join constraints.
Secondly, a unique feature of our data regeneration approach is that it delivers a database summary as the output rather than the static data itself. This summary is of negligible size and depends only on the query workload and not on the database scale. It can be used for dynamically generating data during query execution. Therefore, the enormous time and space overheads incurred by prior techniques in generating and storing the data before initiating analysis are eliminated. Specifically, the summaries for complex Big Data client scenarios comprising over a hundred queries are constructed within just a few minutes, requiring only a few 100 KBs of storage. We have evaluated the proposed ideas using both synthetic benchmarks such as TPC-DS and real-world benchmarks based on Census and IMDB databases.
Thirdly, to improve accuracy towards unseen queries, Hydra additionally exploits metadata statistics maintained by the database engine. Specifically, it adds an objective function to the linear program to pick a solution with improved inter-region tuple distribution. Further, a uniform distribution of tuples within regions is generated to get a spread of values. In a nutshell, these techniques facilitate careful selection of a desirable database from the candidate synthetic databases and also provide metadata compliance.
Lastly, as a proof of concept, the Hydra framework has been prototyped in a Java based-tool that provides a visual and interactive demonstration of the data regeneration pipeline. The tool has been warmly received by both academic and industrial communities.

Speaker Bio:

Host Faculty: