Applying OLAP Pre-Aggregation Techniques to Speed Up Aggregate Query Processing in Array Databases
- Large multidimensional arrays of data are common in a variety of scientific applications. In the past, arrays have typically been stored in files, and then manipulated by customized programs operating on those files. Nowadays, with science moving toward computational databases, the trend is toward a new class of database, the array database. In the broadest sense, the array database supports various types of multidimensional array data, including remote-sensor data, satellite imagery, and data resulting from scientific simulations. As with traditional databases for business applications, analytics in array databases often involves the extraction of general characteristics from large repositories. This requires efficient methods for computing queries that involve data summarization, such as aggregate queries. A typical solution is to pre-compute the whole or parts of each query, and then save the results of those queries that are frequently submitted against the database and those that can be used to compute the results of similar future queries. This process is known as pre-aggregation. Unfortunately, pre-aggregation support for array databases is currently limited to one specific operation, scaling (zooming), and to two-dimensional datasets (images). In this aspect, database technology for business applications is much more mature. Technologies such as On-Line Analytical Processing (OLAP) provide the means to analyze business data from one or multiple sources, and thus facilitate the decision making process. In OLAP, the information is viewed as data cubes. These cubes are typically stored in relational tables, or in multidimensional arrays, or in a hybrid model. In order to enable fast interactive multidimensional data analysis, database systems frequently pre-compute and store the results of aggregate queries. While there are some valuable research results in the realm of OLAP pre-aggregation techniques with varying degrees of power and refinement, not enough work has been done and reported for array databases. The purpose of this thesis is to investigate the application of OLAP pre-aggregation techniques with the objective of speeding up aggregate operations in array databases. In particular, we consider enhancing aggregate computation in Geographic Information Systems (GIS) and remote-sensing imaging applications. To this end, we describe a set of fundamental operations in GIS based on a sound algebraic framework. This allows us to identify those operations that require data summarization and that therefore may benefit from pre-aggregation. We introduce a conceptual framework and cost model for rewriting basic aggregate queries in terms of pre-aggregated data, and conduct experiments to assess the performance of our algorithms. Results show that query response times can be substantially reduced by strategically selecting the pre-aggregate with the least cost in terms of execution time. We also investigate the problem of selecting a set of queries for pre-aggregation, but failed to find an analytical solution for all possible types of aggregate queries. Nevertheless, we present a framework and algorithms for the selection of scaling operations for pre-aggregation considering 2D, 3D, and 4D datasets. The results of our experiments with 2D datasets outperform the results of image pyramids, the current technique used to speed up scaling operations on 2D datasets. Furthermore, our experiments on 3D and 4D datasets show that query response types can also be substantially reduced by intelligently selecting a set of scaling operations for pre-aggregation. The work presented in this thesis is the first of its kind for array databases in scientific applications.