The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

February 18, 2024

A fundamental challenge in computer science is designing efficient data structures. Given how important and time consuming this is, the authors of The Data Calculator (SIGMOD 2018) ask if we can automate parts of the process. While I do not find their system to be applicable to the general problem of data structure design, it has some neat ideas, many of which may be useful for automatically optimizing and tuning implementations.

The General Idea

The authors ask how we can automate the data structure design process. Even though the paper considers only read-only dictionaries, this is still an extremely difficult task. The authors’ approach is to frame data structure design as an optimization problem, which requires addressing two major design challenges:

How do we enumerate the design space of possible data structures?
How do we reason about their performance without needing to actually implement and run the design?

The key idea that the authors propose is to use design primitives – elementary designs that combine to form more complex designs. To enumerate the design space, the authors define a set of data layout primitives which, when combined in different ways, describe all possible data structures that their system can generate. To reason about performance, the authors also define a set of data access primitives. These are effectively micro-benchmarks of simple data access operations that can be combined to predict the performance of more complex operations. The costs of these simple operations can be benchmarked on the target hardware platform, and then the results from these benchmarks can be used to predict the performance of a given data structure design.

With this innovation in mind, the authors design and implement the Data Calculator, a system for semi-automated data structure design. The Data Calculator supports two distinct design modes: (i) comparing the effectiveness of different design choices by predicting their efficiency, and (ii) auto-completing partial designs. The data calculator operates on an abstract description of the data structure that allows it to answer these questions without needing to implement and run the design first.

Data Layout Primitives

Suppose that we are indexing some sequence of key-value pairs. We can think of the index construction process as recursively applying a transformation to the input data. For example, if we are building a B-tree, then the transformation extracts the root node’s keys and partitions the remaining data into the root’s children. We then recursively apply this transformation to each child until we reach the target depth of the tree. In this sense, the entire structure of the B-tree can be described by this single transformation. Other data structures such as B$^+$-trees may be described by two transformations: one for the internal nodes and another for the leaves.

This is how data structure designs are specified in the Data Calculator – one or two transformations are given, which together describe the entire layout of the data structure. The Data Calculator’s limit to at most two transformations is somewhat arbitrary, but the cost of automated design grows exponentially with the number of transformations, and it turns out that a surprising number of data structures can be described with only two transformations. If we can describe a data structure in this way, then the question becomes how we can enumerate all possible transformations. This would give us a search space that we could use to automatically design data structures.

There are many ways to transform the input data, so it is not immediately obvious how we can obtain this search space. But as the authors point out, there is often a lot of overlap among common transformations. For example, the transformation that creates internal nodes for B-trees and B$^+$-trees both partition the data into child regions even though they differ in how they extract the keys. Following this observation, the authors aim to describe all possible operations that a transformation may consist of. These are the data layout primitives mentioned earlier. Some examples of data layout primitives are partitioning the input data, recursively applying transformations to partitioned data, extracting keys from the data in various ways, and many more. In fact, there are 22 data layout primitives defined in the paper, including some fairly esoteric ones like skip list links and embedded bloom filters. A transformation is then described as an assignment of values to each of these 22 primitives. For some primitives, this assignment is an integer (e.g., the fanout of the tree), and for others, it is a boolean (e.g., whether to use bloom filters or not) or one of several predefined options (e.g., the way to extract keys and/or values).

Data Access Primitives

It is now possible to enumerate the search space of possible designs by evaluating every possible combination of data layout primitives. However, to find the best design in this search space, we still need a way to predict query cost for each potential design. Like how we decomposed designs into their primitive components, one way to estimate the cost would be to break an entire query operation into primitive access patterns. If we had a reasonable way to estimate the cost of these primitive access patterns, then we could potentially derive the aggregate cost of the whole operation.

As with data layout primitives, the ways we search different data structures often share similarities in their access patterns, such as random access, binary search, etc. The authors define a set of 24 such access patterns, which they call data access primitives. They then provide a complex and manually-curated flow chart describing how to obtain the sequence of primitive operations from any given design (i.e, combination of data layout primitives). Then to estimate the cost of a design, all we need is an estimate of the individual primitive operations.

The authors provide analytical equations describing the approximate costs of each primitive operation as a function of its input size, where the input size is denoted by $n$. For example, the equation used for binary search is $a n + b \lg n + c$. Clearly, the cost will depend on both the hardware and the input data, so the constants $a$, $b$, and $c$ act as parameters capturing the hardware and workload characteristics. These constants then need to be estimated for the target hardware deployment and workload. This is done by running a set of microbenchmarks, each corresponding to a data access primitive, written with the sole purpose of estimating these parameters. The microbenchmark is run for samples of various input sizes from a representative workload, and then regression is used to fit the corresponding analytical equation to this data. Once we have found these parameters, we can estimate the cost of arbitrary designs deployed on the target platform.

System Design and Evaluation

The Data Calculator provides several features to aide in data structure design. The first is simply predicting the performance of a given design. Given a design specification, a cost profile, and a sample workload, the data calculator performs the steps described above to estimate the cost. This is useful for easily comparing different designs. The second feature is auto-completing designs. Given a design specification, the user can specify a few parameters that they would like the Data Calculator to automatically fill in. For example, I could use the Data Calculator to determine what the branching factor for my B-tree should be. A more complex example given in the paper includes finding the best combination of hash tables and B-trees for a specific workload. The data calculator then evaluates different combinations of choices and reports the most efficient design according to its cost predictions. Since the search space is restricted to just a few parameters, it is very feasible to solve this as an optimization problem via a greedy search.

The authors perform a handful of experiments to evaluate the Data Calculator. They show that the Data Calculator’s cost prediction is fairly accurate, although they only evaluated it for inputs of roughly 800 KB, and they only evaluated standard existing designs, not new designs. They also gave examples of using the data calculator to make design decisions, but the effectiveness of these automated designs was not analyzed in depth.

Discussion

This paper introduces some very interesting ideas, but also has many limitations that are not discussed. The work revolves around the premise that there is some finite set of fundamental design primitives from which any possible data structure can be expressed. In fact, the authors describe their contribution as working towards “build[ing] the periodic table of data structures”. Initially, I was doubtful that such a set of primitives existed, and after reading the paper, I am even more confident in my assertion.

The issue is that data structure design is based on the complex mathematical structure of the problem. Like how the organization of valence electrons comes from underlying scientific models, the set of possible primitives should come out of some fundamental theory of data structure design. It seemed like a challenging problem, but I was nevertheless intrigued and eager to see how the authors had developed a model of this form. But to my dismay, the authors did not derive their primitives from any underlying theory. What the authors actually do is more along the lines of deconstructing all existing designs they could think of. This does allow them to combine the best features of these existing designs, but new data structures are almost always based on novel insights that have not yet been explored, and as a result would require new primitives to be expressed. This severely limits the applicability of the method, since it means that truly new designs cannot be discovered without adding new primitives. Moreover, there is also a scalability issue, especially if ADTs more complex than read-only dictionaries are considered. Not only will finding a usable set of primitives become more challenging, but performance estimation will become nearly impossible. Since performance estimates depend on a hand-curated flow chart, designing this float chart will easily become intractable.

But even with these limitations, if you are interested in a read-only dictionary, the Data Calculator could still be useful for finding the best combination of existing designs. Unfortunately, even in this case, I find the paper’s evaluation to be incomplete, which makes it hard to conclude much. As I mentioned, they only evaluate the Data Calculator’s predictive abilities for inputs of roughly 800 KB. This is smaller than the L2 cache on some embedded devices! Especially since their use case is large-scale database systems, this is not a large enough input to infer anything meaningful. The more complex behaviors will be happening once you are dealing with large data sets, and it is not clear how well their cost estimation works for inputs at these scales. Moreover, these tests only evaluate standard data structures (linked lists, B-trees, etc.) that have been explicitly included in their set of design primitives. I would expect more exotic designs to be harder to obtain accurate predictions for. They are also (presumably) only running single-threaded workloads, while virtually all applications I can imagine would be multi-threaded, which further complicates performance prediction. Lastly, since the authors only evaluate the Data Calculator’s auto-complete capabilities on artificial examples and do so without evaluating the performance of the automated design, it is hard to say how capable it is at actually improving a real-world system.

With all that said, I do think this general idea could be useful for specific domains where the scope is sufficiently limited. For example, maybe there are some real-world systems that use specific ADTs where some type of fine-tuning based on this method would work very well. Even though there are some pieces missing, it is an interesting approach and may find some applications in the future.

Review