「Designing Data-Intensive Applications」Chapter 4

Encoding and Evolution

 

In order for the system to continue running smoothly, we need to maintain compatibility in both directions:

  • Backward compatibility: newer code can read data that was written by older code.
  • Forward compatibility: older code can read data that was written by newer code.

Formats for Encoding Data

Programs usually work with data in two different representations:

  1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).
  2. When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes.

The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).

Language-Specific Formats

Many programming languages come with built-in support for encoding in-memory objects into byte sequences. However, they also have a number of deep problems:

  • The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.
  • In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes.
  • Versioning data is often an afterthought in these libraries.
  • Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought.

JSON, XML, and Binary Variants

JSON, XML, and CSV are textual formats, and thus somewhat human-readable, they also have some subtle problems:

  • There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string that happens to consist of digits. JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.
  • JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding).
  • There is optional schema support for both XML and JSON.
  • CSV does not have any schema, so it is up to the application to define the mean‐ing of each row and column.

Let’s look at an example of MessagePack, a binary encoding for JSON:

Thrift and Protocol Buffers

To maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value. To maintain forward compatibility, you can only remove a field that is optional (a required field can never be removed), and you can never use the same tag number again.

Changing the datatype of a field may raise a risk that values will lose precision or get truncated.

Thrift has a dedicated list datatype, which is parameterized with the datatype of the list elements, which has the advantage of supporting nested lists.

Protocol Buffers does not have a list or array datatype, but instead it has a repeated marker for fields: the same field tag simply appears multiple times in the record. This has the nice effect that it’s okay to change an optional (single-valued) field into a repeated (multi-valued) field. New code reading old data sees a list with zero or one elements (depending on whether the field was present); old code reading new data sees only the last element of the list.

Avro

Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages:

There are no tag numbers in the schema, and there is nothing to identify fields or their datatypes: A string is just a length prefix followed by UTF-8 bytes, an integer is encoded using a variable-length encoding.

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible. Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema.

With Avro, forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader, backward compatibility means that you can have a new version of the schema as reader and an old version as writer. To maintain compatibility, you may only add or remove a field that has a default value.

How does the reader know the writer’s schema?

  • Large file with lots of records: in this case, the writer of that file can just include the writer’s schema once at the beginning of the file.
  • Database with individually written records: the simplest solution is to include a version number at the beginning of every encoded record, and to keep a list of schema versions in your database.
  • Sending records over a network connection: the reader and the writer negotiate the schema version on connection setup and then use that schema for the lifetime of the connection.

The Merits of Schemas

Binary encodings have a number of nice properties:

  • They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.
  • The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date.
  • Keeping a database of schemas allows you to check forward and backward compatibility of schema changes.
  • For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.

Modes of Dataflow

Dataflow Through Databases

In a database, the process that writes to the database encodes the data, and the process that reads from the database decodes it. Backward compatibility and forward compatibility are both required for databases.

Dataflow Through Services: REST and RPC

It is reasonable to assume that all the servers will be updated first, and all the clients second. Thus, you only need backward compatibility on requests, and forward compatibility on responses. The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:

  • Thrift, gRPC and Avro RPC can be evolved according to the compatibility rules of the respective encoding format.
  • In SOAP, requests and responses are specified with XML schemas.
  • RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request parameters for requests. Adding optional request parameters and adding new fields to response objects are usually considered changes that maintain compatibility.

Message-Passing Dataflow

Using a message broker has several advantages compared to direct RPC:

  • It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
  • It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
  • It avoids the sender needing to know the IP address and port number of the recipient.
  • It allows one message to be sent to several recipients.
  • It logically decouples the sender from the recipient.

A difference compared to RPC is that message-passing communication is usually one-way: a sender normally doesn’t expect to receive a reply to its messages.

Message brokers are used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic.

The actor model is a programming model for concurrency in a single process. Each actor typically represents one client or entity, it may have some local state (which is not shared with any other actor), and it communicates with other actors by sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t need to worry about threads, and each actor can be scheduled independently by the framework.

A distributed actor framework essentially integrates a message broker and the actor programming model into a single framework. In distributed actor frameworks, this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes.

Leave a Reply

Your email address will not be published. Required fields are marked *

*