Raw Array Distributed Caching

Given a set of raw files that contain portions of an array and an online dynamic query workload, we have to determine which cells to cache in the distributed memory of an array database system. Rather than having each node manage its memory buffer to cache local cells, we aim for a global caching infrastructure that automatically identifies both the cells to cache and the instance where to cache them. Our ultimate goal is to provide both instant access to distributed multi-dimensional raw arrays and optimal query performance through caching. Distributed caching for arrays poses two main challenges. The first challenge is to determine the cells kept in the cache at query time. This is a problem because cache misses over raw arrays are very expensive. Even if we have only one cache miss for a cell, it requires us to scan all the raw files whose bounding box contains the cell. However, at most one file contains the required cell. The second challenge comes from the distributed nature of array database. The conventional approach caches the requested data at the origin instance, where it is stored. This approach works well in content delivery networks (CDN), where data co-locality is not required for query execution. However, most of the array queries specify shape-based relationships between cells in the form of stencil operators and their generalization to similarity join. Given the current placement of raw and cached data, the challenge is to coordinate and organize the caches across nodes in order to preserve data co-locality and provide efficient data access.

We design a distributed caching framework that computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. Each query range generates a finer granularity bounding box that allows advanced pruning of the raw files that require inspection. This guarantees that - after a sufficiently large number of queries - only relevant files are scanned. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We model cache eviction and placement as cost-based heuristics that generate an effective cache eviction plan and reorganize the cached data based on a window of historical queries. We design efficient algorithms for each of these stages. In the long run, the reorganization improves cache data co-locality by grouping relevant portions of the array and by balancing the computation across nodes.

The array distributed caching framework works as follows. First, the execution engine sends the subarray and the shape in the array similarity join query to the cache coordinator which has to determine the chunk pairs that have to be joined. While these can be immediately inferred from the catalog for loaded data, there is no chunking - or the chunking is not known - for raw data. The naive solution to handle this problem is to treat each file as a chunk and perform the similarity join between pairs of files. Since a file covers a much larger range and is not organized based on cell locality, this solution accesses many unnecessary cells and uses the cache budget inefficiently. Our approach is to build in-memory chunks incrementally based on the query, i.e., workload-driven chunking. These chunks are built by the cache nodes by accessing the files that overlap with the query - the only data the cache coordinator forwards to the nodes. Instead of extracting chunks only for the queried data, the cache nodes create higher granularity chunks for the entire file. Since chunk building requires raw file access - which is expensive - caching is very important for efficient processing. In the best case, the entire file can be cached. Otherwise, a cache replacement policy is required to determine the chunks to keep in memory. Once evicted from memory, the chunks are lost since they have to be recreated from the raw data. While it is possible to materialize the evicted chunks in some representation, e.g., R-tree, this replicates data and adds another dimension to the problem. Instead of allowing each node to execute a local cache replacement algorithm, e.g., LRU, we develop a global cost-based caching mechanism executed at the coordinator. This algorithm takes the metadata of all the created chunks across nodes and determines the chunks to be evicted from the cache in a unified way. This may incur chunk transfer across nodes. Since query evaluation also requires chunk transfer and replication, we go one step further and combine query processing with cache replacement. Specifically, cache replacement piggybacks on the query execution plan and eliminates extra chunk transfer. Moreover, the location of a cached chunk is computed by considering the relationship with other chunks in the historical query workload. This is implemented as follows. Upon receiving the chunk metadata from the nodes, the coordinator query optimizer computes the optimal query execution plan which specifies what node joins each of the chunk pairs. This is passed to the cache coordinator which computes the cache replacement plan by combining the execution plan with chunk usage statistics and chunk co-locality information. The cache replacement plan informs every node on what chunk replica to cache and which chunks to evict. Finally, these two plans are forwarded to the nodes for execution. The engine executes the query and the cache executes the replacement plan, in this sequence.

This project is supported by a DOE Early Career Award. The results are published in the following papers:
  1. Distributed Caching for Processing Raw Arrays by W. Zhao, F. Rusu, B. Dong, K. Wu, A. Y. Q. Ho, and P. Nugent
  2. Distributed Caching for Complex Querying of Raw Arrays by W. Zhao, F. Rusu, B. Dong, K. Wu, A. Y. Q. Ho, and P. Nugent

UC Merced | EECS | Home | Array Databases

Last updated: Wednesday, July 24, 2019