Simplified database consistency, Part 1: Obligatory CAP Theorem musings

Brewer’s Conjecture was formalized as the CAP theorem when this paper was published. It’s a good read. In this blog entry, I will talk about my philosophical take on the CAP theorem.

Let’s break down the three letters of the CAP Theorem.

  • Consistency – This represents the “correctness” of a database.
  • Availability – This represents the ability of a database to serve requests in a performant manner.
  • Partition Tolerance – This represents the ability of a database to operate with node or net failure.

The CAP theorem is commonly stated as “pick any two, you can’t have the third one.” In reality, there’s a tension between all points of the CAP triangle, and each edge in the triangle is a nebulous continuum. First, let’s break down what “consistency” really means.

The defining standard of a consistent database is ACID. I will restate it slightly differently for programmers who may not be database architects:

A database which follows ACID constructs will have the same state as a database that has a global read-write lock on the entire system, with the property that any unexpected write failure results in rollback of all changes before the write lock is released.

(This isn’t quite accurate, but is good enough for this blog.)

Obviously, we can’t have a global read-write lock in most scenarios: we’d be sacrificing too much availability. So, how do we achieve the appearance of a single global read-write lock while serving all requests fast enough? My contention is: you can’t. If it walks like a duck, if it quacks like a duck — then it’s a duck. The only way to achieve highest consistency levels is to have a global read-write lock or something just as inefficient. Therein lies the most obvious A-C tradeoff.

From there, the tension between A-C is a continuum which I would state as: “the more the system has to synchronize on, the more availability that is traded off.” NoSQL databases often describe themselves as fully “AP” or fully “CP”; in practice, “CP” and “AP” are meaningless descriptions, and it is intractable for a database to achieve either full availability or full consistency.

To simplistically demonstrate this continuum, consider these following assertions:

  • If a transaction relies on record-level consistency, then record locks are needed.
  • If a transaction relies on a range of rows being consistent, then multiple record locks are needed.
  • If a transaction relies on a table being consistent, then table locks are needed.
  • If a transaction relies on multiple tables being consistent, then multi-table locks are needed.

In addition to the above, if any of these applies over a distributed system, then there has to be some sort of global lock/synchronization/consensus dance which is hard to execute efficiently. Even the Paxos algorithm relies on the existence of some global monotonically increasing sequence to be perfect, which brings its own set of tradeoffs.

There are ways to push back on having actual write locks, usually at the expense of space: for instance, it’s possible to avoid non-repeatable reads without acquiring a table lock in a NSM format database (that is, append-only records; every update is a tombstone followed by an append of a new record) simply by tagging records with a timestamp, then note the current timestamp at the start of a read, and not reading any records with a timestamp above the current timestamp. But these pushbacks are effective only to a point; we don’t want to be creating a copy of a table to achieve table-level consistency, for instance.

What about the “P” component, partition tolerance? If one thinks of “P” as not just network partitions, but as network latency, out of order messages, and so on, then it becomes apparent that P cannot be code-sacrificed in a distributed system, simply because partitions are always ongoing to some degree or another in a distributed system. The only databases which can be honestly described as anywhere near “AC” are traditional RDBMSes.

One might be tempted to think of NoSQL databases such as Cassandra as pushing the end of the AP triangle. But there’s more to the rabbit hole. Column-oriented stores are the true extreme end of the AP triangle, breaking down record-level consistency to element-level consistency. More on this in Part II.

Posted in Databases | Tagged , | Leave a comment

Simplified database consistency, Part 0: Not-so-simplified reference tables

A proper Database Theory 101 class would begin with the CAP Theorem – that it is impossible to have consistency, availability, and partition tolerance in the same database. This is a first in a blog series explaining the CAP theorem and all its nuances, and how it applies to distributed databases.

Part 0 starts out with reference material with minimal explanation, because I spent a lot of time putting together definitions from various sources to one place, and I’d like it in the cloud instead of my aging stack of filled notebooks.

The real Part I will go back to the basics: the CAP theorem and how it has been proven.

Database anomalies (sometimes called “phenomena”), which I sourced from various points:

Anomaly Name Description
Dirty Read Transaction T1 can read uncommitted data from a different transaction T2.
Non-Repeatable Read / Fuzzy Read Two reads within the same transaction T1 are not guaranteed to contain the same data within rows due to transaction T2 having been executed a write to these rows in between.
Phantom Read Two queries within the same transaction T1 return different rows due to transaction T2 having inserted or deleted rows in between.
Read Skew T1 reads a row x. T2 then modifies rows x and y, and commits. T1 then reads modified y, which may have been dependent on a row x that had not been committed at a time. (Classic example: calculating bank balances)
Dirty Write Two transactions change the same record at the same time.
Lost Updates Two transactions update the same row. The transaction that did the update first is now lost and was never seen by the other transaction.
Write Skew Two transactions concurrently read an overlapping data set, make disjoint updates, and then commit without seeing updates. For example: A bank has a rule that the combined balance of an user’s accounts must be 0 or greater. An user with 2 accounts has a balance of $100 each for a total of $200. With write skew, the user can initiate simultaneous withdrawals of $200 from each account, withdrawing $400 total, which is not detected because each individual transaction thinks the net balance is $200.
Real-Time Violation Transaction T1 sees transaction t2, but not necessarily all of the transactions that have occurred before t2. Applicable to a replicated distributed database.
Forward Freshness Applicable to replicated distributed databases, T1 is started before T2. T2 is considered committed by the database, and T1 is allowed to read data from T2, meaning that T1 does not actually read off a “snapshot” at the time it started. Opacity is lost.

Table of isolation levels defined by ANSI/ISO SQL:

Isolation Level Locks Acquired Disallowed Phenomenons
READ UNCOMMITTED Write lock on record only for duration of operation N/A
READ COMMITTED Write lock on record for duration of transaction Dirty reads
REPEATABLE READ Both read and write locks acquired on affected rows for duration of transaction Non-repeatable reads
SERIALIZABLE Range locks for duration of transaction Phantom reads


The next two tables are mostly from this paper which was quite influential in my old company. Extended consistency levels in distributed databases:

Continue reading »

Posted in Databases | Tagged , | Leave a comment