「Designing Data-Intensive Applications」Chapter 3

Storage and Retrieval

Data Structures That Power Your Database

An index is an additional structure that is derived from the primary data.

An important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes.

Hash Indexes

Keep an in-memory hash map where every key is mapped to a byte offset in the data file. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote. When you want to look up a value, use the hash map to find the offset in the data file, seek to that location and read the value.

A better solution is to break the log into segments of a certain size. We can then perform compaction(throw away duplicate keys in the log and keep only the most recent update for each key) on several segments together at the same time, and write the merged segment to a new file. The merging and compaction can be done in a background thread. After the merging process is complete, we switch read requests to using the new merged segment, and the old segment files can simply be deleted.

In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on.


  • Appending and segment merging are sequential write operations, which are generally much faster than random writes.
  • Concurrency and crash recovery are much simpler if segment files are append-only or immutable.
  • Merging old segments avoids the problem of data files getting fragmented over time.


  • The hash table must fit in memory.
  • Range queries are not efficient.

SSTables and LSM-Trees

Now we make two changes to the format of our previous segment files:

  1. The sequence of key-value pairs is sorted by keys stored by RB-Tree or AVL Tree, and this format is called Sorted String Table, or SSTable for short.
  2. Each key only appears once within each merged segment file (the compaction process already ensures that).

Merging segments is like the merge sort algorithm. When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.

In order to find a particular key in the file, you only need an in-memory index to tell you the offsets for some of the keys, rather than an index contains all the keys in memory.

Now our storage engine works as follows:

  • When a write comes in, add it to an in-memory balanced tree data structure(memtable).
  • When the memtable gets bigger than some threshold, write it out to disk as an SSTable file. The new SSTable file becomes the most recent segment of the database.
  • In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
  • From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
  • If the database crushes, the most recent writes (which are in the memtable but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended, its only purpose is to restore the memtable after a crash. Every time the memtable is written out to an SSTable, the corresponding log can be discarded.

Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.

The LSM-tree algorithm can be slow when looking up keys that do not exist in the database. In order to optimize this kind of access, storage engines often use additional Bloom filters.

Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary. This is implemented with a key-value structure where the key is a word and the value is the list of IDs of all the documents that contain the word. In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed.

There are also different strategies to determine the order and timing of how SSTables are compacted and merged:

  • Size-Tiered Compaction

The idea here is to trigger compaction when we have enough similarly sized SSTables. These are merged together, to form one larger SSTables. Later, when several large SSTables have accumulated, they will be merged to form one even-larger SSTable – and so on.  Size-Tiered Compaction is great if rows are written once and never modified (or written a few times and then not modified again). In that case, each row will eventually end up being written as a whole to one compacted SSTable, and reads are efficient. But, if our use case involves continuously modifying existing rows, with size-tiered compaction, each row will always be split across several SSTables, making reads slow.

  • Date-Tiered Compaction

This strategy was developed to improve performance for time-series use cases. For more details, see:

  1. http://www.scylladb.com/2016/08/30/date-tiered-compaction-strategy/
  2. https://labs.spotify.com/2014/12/18/date-tiered-compaction/
  3. https://www.datastax.com/dev/blog/datetieredcompactionstrategy
  4. https://www.datastax.com/dev/blog/dtcs-notes-from-the-field


B-trees break the database down into fixed-size pages. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

The number of references to child pages in one page of the B-tree is called the branching factor.

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).

An additional complication of updating pages in place is that careful concurrency control is required if multiple threads are going to access the B-tree at the same time. This is typically done by protecting the tree’s data structures with latches (lightweight locks).

B-tree optimizations

  • Instead of overwriting pages and maintaining a WAL for crash recovery, some databases use a copy-on-write scheme.
  • We can save space in pages by not storing the entire key, but abbreviating it.
  • Many B-tree implementations try to lay out the tree so that leaf pages appear in sequential order on disk.
  • Additional pointers have been added to the tree.
  • B-tree variants such as fractal trees borrow some log-structured ideas to reduce disk seeks.

Comparing B-Trees and LSM-Trees

Advantages of LSM-trees:

  • LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification, and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree.
  • LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
  • Lower write amplification and reduced fragmentation are still advantageous on SSDs.

Downsides of LSM-trees:

  • Compaction process can sometimes interfere with the performance of ongoing reads and writes.
  • The disk’s write bandwidth needs to be shared between the initial write and the compaction threads running in the background.
  • If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes.
  • An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics.

Other Indexing Structures

  • A secondary index can easily be constructed from a key-value index. The main difference is that keys are not unique; This can be solved in two ways: either by making each value in the index a list of matching row identifiers (like a postings list in a full-text index) or by making each key unique by appending a row identifier to it.
  • The key in an index is the thing that queries search for, but the value can be one of two things: it could be the actual row in question, or it could be a reference to the heap file, which stores data in no particular order.
  • In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index.  In MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key (rather than a heap file location).
  • A compromise between a clustered index (storing all row data within the index) and a non-clustered index (storing only references to the data within the index) is known as a covering index, which stores some of a table’s columns within the index.
  • A concatenated index simply combines several fields into one key by appending one column to another.
  • Multi-dimensional indexes are a more general way of querying several columns at once, which is particularly important for geospatial data(R-tree).
  • The in-memory index in Lucune is a finite state automaton over the characters in the keys, similar to a trie. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance.

Transaction Processing or Analytics?

Data Warehousing

A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.

Stars and Snowflakes: Schemas for Analytics

Column-Oriented Storage

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

Besides only loading those columns from disk that are required for a query, we can further reduce the demands on disk throughput by compressing data. One technique that is particularly effective in data warehouses is bitmap encoding.

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles.

Sort Order in Column Storage

Note that it wouldn’t make sense to sort each column independently, because then we would no longer know which items in the columns belong to the same row. The data needs to be sorted an entire row at a time, even though it is stored by column.

Another advantage of sorted order is that it can help with compression of columns. That compression effect is strongest on the first sort key. The second and third sort keys will be more jumbled up, and thus not have such long runs of repeated values.

Having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store. But the big difference is that the row-oriented store keeps every row in one place (in the heap file or a clustered index), and secondary indexes just contain pointers to the matching rows. In a column store, there normally aren’t any pointers to data elsewhere, only columns containing values.

Writing to Column-Oriented Storage

An update-in-place approach is not possible with compressed columns, instead we have to use LSM-trees.

Aggregation: Data Cubes and Materialized Views

Data warehouse queries often involve an aggregate function, such as COUNT , SUM , AVG , MIN , or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. One way of creating such a cache is a materialized view.

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data. The database can do that automatically, but such updates make writes more expensive, which is why materialized views are not often used in OLTP databases. In read-heavy data warehouses they can make more sense.

The advantage of a materialized data cube is that certain queries become very fast because they have effectively been precomputed. The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data.

Leave a Reply

Your email address will not be published. Required fields are marked *