「Designing Data-Intensive Applications」Chapter 2

Data Models and Query Languages

The limits of my language mean the limits of my world.

                 — Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)

Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.

Relational Model Versus Document Model

The goal of the relational model was to hide that implementation detail behind a cleaner interface.

If data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect
between the models is sometimes called an impedance mismatch.

There are advantages to having standardized lists and letting users choose from a drop-down list or autocompleter:

  • Consistent style and spelling
  • Avoiding ambiguity
  • Ease of updating
  • Localization support
  • Better search

The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.

The document model has limitations: for example, you cannot refer directly to a nested item within a document.

Document databases are sometimes called schemaless, but that’s misleading. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it).

Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not entirely deserved: most relational database systems execute the ALTER TABLE statement in a few milliseconds. MySQL is a notable exception—it copies the entire table on ALTER TABLE, which can mean minutes or even hours of downtime when altering a large table—although various tools exist to work around this limitation.

The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason (i.e., the data is heterogeneous).

If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality. If data is split across multiple tables, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.

The locality advantage only applies if you need large parts of the document at the same time. The database typically needs to load the entire document, even if you access only a small portion of it, which can be wasteful on large documents. On updates to a document, the entire document usually needs to be rewritten—only modifications that don’t change the encoded size of a document can easily be performed in place. For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document. These performance limitations significantly reduce the set of situations in which document databases are useful.

Query Languages for Data

An imperative language tells the computer to perform certain operations in a certain order. In a declarative query language, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.

  • A declarative query language is attractive because it is typically more concise and easier to work with.
  • It also hides implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
  • Declarative languages often lend themselves to parallel execution.

CSS selector and XPath selector are both declarative languages for specifying the styling of a document.

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.

Graph-Like Data Models

Property Graphs

A property graph has these elements:

  • a set of vertices:
    • each vertex has a unique identifier.
    • each vertex has a set of outgoing edges.
    • each vertex has a set of incoming edges.
    • each vertex has a collection of properties defined by a map from key to value.
  • a set of edges:
    • each edge has a unique identifier.
    • each edge has an outgoing tail vertex.
    • each edge has an incoming head vertex.
    • each edge has a label that denotes the type of relationship between its two vertices.
    • each edge has a collection of properties defined by a map from key to value.

Some important aspects of property graph model are:

  • Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.
  • Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices —both forward and backward.
  • By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.

Cypher is Neo4j’s open graph query language:

Triplestore(RDF) and SPARQL

In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:

  1. A value in a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. For example, (lucy, age, 33) is like a vertex lucy with properties {“age”:33}.
  2. Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in (lucy, marriedTo, alain) the subject and object lucy and alain are both vertices, and the predicate marriedTo is the label of the edge that connects them.

SPARQL is a query language for triple-stores using the RDF data model:

The Foundation: Datalog

Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object):

By repeated application of rules 1 and 2, the within_recursive predicate can tell us all the locations in North America (or any other location name) contained in our database:

Now rule 3 can find people who were born in some location BornIn and live in some location LivingIn.

Other data models

  • Researchers working with genome data often need to perform sequencesimilarity searches, which means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are similar, but not identical. None of the databases described here can handle this kind of usage, which is why researchers have written specialized genome database software like GenBank.
  • Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider now work with hundreds of petabytes. At such a scale custom solutions are required to stop the hardware cost from spiraling out of control.
  • Full-text search is arguably a kind of data model that is frequently used alongside databases.

Leave a Reply

Your email address will not be published. Required fields are marked *