Summary

The book begins with a quick look at the characteristics of systems (reliability, maintainability…) to set a common language to talk about. This part assumes a single machine.

Then it jumps into different data models from relational to graph. Although it’s brief, it’s an interesting look at the essence of them and the goals that each one pursues. For example, it focuses on the convenience of different models for representing many-to-many relationships as the basis for new needs.

Data models without persistence might be interesting from a philosophical point of view but not from a practical one, so storage must be studied. The third chapter explains OLTP vs OLAP systems. Within OLTP systems there’s a great analysis of log-like vs B-tree-like structures.

This part continues with an in-depth look at the lower detail: the encoding. It compares several approaches, from language-dependent serialization to textual formats (CSV, JSON…) or binary, schema-based formats (Protocol Buffers, Avro…). It ends with a summary of data flows, data protocols for communication (RPC, databases, REST, actor model…

Part II: Distributed Data

The first step is replication, which is driven by the need for availability, improving the latency, scalability or disconnected operation. The main approaches are single-leader, multi-leader, and leaderless replication. Replication can be synchronous or asynchronous. Replication lag is a challenge for read-after-write consistency, monotonic reads, and consistent prefix reads.

For scalability, the main tool might be partitioning. The two main approaches are key range partitioning and hash partitioning. For secondary indexes, there are document-partitioned indexes and term-partitioned indexes. And you need a way to route requests, either client know, a routing tier, or the nodes themselves.

Transactions try to make concurrency easier. The transaction chapter describes the whole set of issues that can occur (dirty reads, dirty writes, read skew, lost updates, write skew, phantom reads), the different isolation levels with the guarantees that they provide (read committed, snapshot isolation, serializable). It ends with three different approaches to serialization (literally in order, two-phase locking, and serializable snapshot isolation).

In distributed environments, a whole range of problems may occur, from network failures to pauses. Systems must be prepared to detect and tolerate them for reliability.

Consistency and consensus begin learning about linearizability, appealing but slow. A looser guarantee is causality, and requires less coordination and has better tolerance for network issues. But not everything can be implemented with causal guarantees. Consensus algorithms lay in the middle. Consensus, which is coming to an agreement among several nodes, is very useful for many tasks and guarantees, but become really hard if you need fault tolerance. Anyway, it comes with a cost.

Part III: Derived Data

UNIX tools are a great example of batch processing tools because its simplicity and composability make them really powerful. They’re limited to single nodes, though. MapReduce can be seen as the UNIX tools within a cluster, with a distributed filesystem and small jobs that are used in workflows (but with an intermediate state). Distributed batch processing must solve partitioning and fault-tolerance, and implement join mechanisms.

Steam processing is similar to batching but unbounded. There are two types of message brokers: AMQP/JMS style (with producers/consumers and explicit message processing acknowledgment) and log-based (all messages in a partition are read by a consumer). Log-based has similarities with the DB replication log. You could even plug a change data capture system on top of it to turn a DB into an event stream.

The final chapter exposes the author’s vision about the future of data. As there’s no chance for a one-size-fits-all solution, we come to the realization that some data integration is mandatory, and batch processing and event streams help with that. Thinking about that and dataflows as transformations, the author proposes unbundling the components of a database and building an application by composing those loosely coupled components. Anyway, we must be aware of fault tolerance and performance balance. And, finally, there’s a section about data ethics.

Depending on your knowledge and expertise you’ll find more or less new stuff in this book. For example, once you have a little background with databases you’ll probably know most of Part I, but even in these foundation chapters, there are some gems, such as explaining serialization in depth, including Protocol Buffer, Thrift or Avro. I find these chapters useful even if they don’t provide a lot of new information, because they help in remembering details and tradeoffs. Comparing overlapping technologies and designs is always useful.

Despite the title, it’s also a must read if you’re interested in distributed systems, as a consequence of Internet-scale data needing distributed databases. The book spends many pages around distributed systems failures, consensus…

While it’s not a book about specific technologies, it mentions many of the state-of-the-art ones with context and explaining the differences. You can learn how come ZooKeeper seems omnipresent or the different guarantees provided by different distributed databases.

Although it’s mostly abstract and generic, it also exposes real scenarios and tooling best practices, such as “how do people load the output of a MapReduce job into a database”.

References! I mean… REFERENCES! There are chapters with +100 references. You might be interested in reading all of them if, let’s say, you’re preparing for a position in Google, but otherwise it’s overwhelming. I will certainly go back to the book from time to time to recall some topics, and that will be a great moment to go through the references :) In fact, the heavy use of internal references through the book makes me think that the author is aware of this usage pattern ;)

It does a remarkable job of capturing the essence of the underlying common characteristics. The climax of this might be the parallelism between creating an index in a DB and creating a new follower replica, but there are many cases. The last chapter about the future leverages that into a broader vision of data systems.

As a guide for future work and application design, the idea of unbundling the database is really powerful and feels like the natural consequence of the essence distillation effort that the author does. He goes to the underlying characteristics and the problem that features solve and propose a loosely coupled design for applications. Something worth exploring!

As a final closure, he writes down his thoughts about data ethics. There are many relevant thoughts and memorable sentences, I’ll just pick one:

A blind belief in the supremacy of data for making decisions is not only delusional, it is positively dangerous

IMHO this book is a modern classic, a must read for every software engineer and developer. I’m certain that it will be reread it from time.

Notes

Click to see my collection of notes. It’s not a summary at all but a mix of highlights, definitions, important stuff or things that I didn’t know.

Part I: Foundations of Data Systems

1. Reliable, Scalable, and Maintainable Applications

(…) there are datastores that are also used as mes‐ sage queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.

Reliability: the system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

Scalability: as the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Maintainability Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Reliability

Note that a fault is not the same as a failure. A fault is (…) one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.

there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.

Scalability

Describing load

we need to succinctly describe the current load on the system; only then can we discuss growth questions. Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else.

Describing performance

In a batch processing system such as Hadoop, we usually care about throughput —the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. In online systems, what’s usually more important is the service’s response time—that is, the time between a client sending a request and receiving a response.

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.

Approaches for coping with load

Scaling up vs. scaling out

It is conceivable that distributed data systems will become the default in the future, even for use cases that don’t handle large volumes of data or traffic.

Maintainability

Operability + Simplicity + Evolvability

2. Data Models and Query Languages

The network, hierarchical and document models