Database page layouts

Just about every major database deals with memory in terms of pages. “Page” here refers to a continuous block of memory of fixed size that is typically a multiple of Linux kernel page size (historically, 4096B). Rows can be arranged within a page in various ways.

The granddaddy of all database page layouts is the NSM format, which most traditional RDBMSes use. Rows are appended sequentially to a page. Updates to a row are usually done by marking a tombstoned bit (or setting some snapshot identifier), following by appending the updated row.

NSM page format.

NSM page format.

  • Pros: Fastest write performance of all page layouts — simple sequential append to a logical file. Easy to access the whole row and return it. Lots of techniques to index the row itself exist.
  • Cons: Not easy to project on variable-length rows. SIMD performance not accessible on columns.

The traditional alternative to NSM is the DSM layout, which is the classical “column-oriented” database page layout. Individual elements within a row are teased out, and are grouped by their own columns.

Column-oriented DSM page layout.

Column-oriented DSM page layout.

It’s my belief that column-oriented databases sound amazing when one first hears of them, but think through them critically, and they’re not as amazing. Their classical advantage is in query performance — just slurp up a bunch of fixed-length elements while reading, maybe even load them to a SIMD register and perform calculations in parallel.

Except that… the overhead in splitting them up, storing them separately, and stitching columns back together is significant, especially when taking consistency in consideration. Whereas a row is the atomic block of consistency in NSM databases, the individual element is the atomic block in column-oriented stores.

The query performance advantage only applies to a narrow set of queries. It is non-trivial to do projections on columns that involve filtering on a separate column. It’s not surprising that HBase, probably the most widely used column-oriented database, advises against creating more than 2-4 “column families” (their terminology for separate groupings of elements). In short, the advantage of DSM is limited to a narrow set of queries that arguably do not see a lot of real world value.

  • Pros: Fast theoretical query & calculation performance. Possible improved update I/O performance due to not having to write the whole row on update.
  • Cons: Significant overhead in writes and reads of whole rows; consistency will add more overhead to queries due to a lower granularity to track. Best-case query & project scenarios are not necessarily common-case scenarios.

The PAX page layout groups rows by their constituent elements, on the same page.

PAX page layout.

  • Pros: Eliminates most of the major disadvantages of the DSM model.
  • Cons: Challenging implementation. Some disadvantages persist, they just don’t require hopping pages. Real-world performance data lacking.

In addition to the above three formats, an endless variety exist. Fractured mirrors take advantage of replication to combine multiple page layouts and to try and have the best of all worlds. NoSQL databases typically implement some variety of the above, or use their own format. HBase, for instance, records its updates to a WAL and otherwise keeps its rows in a memory cache, and flushes batches to an “immutable” logical file sorted by row identifier.

Posted in Databases | Tagged | Leave a comment

Simplified database consistency, Part 1: Obligatory CAP Theorem musings

Brewer’s Conjecture was formalized as the CAP theorem when this paper was published. It’s a good read. In this blog entry, I will talk about my philosophical take on the CAP theorem.

Let’s break down the three letters of the CAP Theorem.

  • Consistency – This represents the “correctness” of a database.
  • Availability – This represents the ability of a database to serve requests in a performant manner.
  • Partition Tolerance – This represents the ability of a database to operate with node or net failure.

The CAP theorem is commonly stated as “pick any two, you can’t have the third one.” In reality, there’s a tension between all points of the CAP triangle, and each edge in the triangle is a nebulous continuum. First, let’s break down what “consistency” really means.

The defining standard of a consistent database is ACID. I will restate it slightly differently for programmers who may not be database architects:

A database which follows ACID constructs will have the same state as a database that has a global read-write lock on the entire system, with the property that any unexpected write failure results in rollback of all changes before the write lock is released.

(This isn’t quite accurate, but is good enough for this blog.)

Obviously, we can’t have a global read-write lock in most scenarios: we’d be sacrificing too much availability. So, how do we achieve the appearance of a single global read-write lock while serving all requests fast enough? My contention is: you can’t. If it walks like a duck, if it quacks like a duck — then it’s a duck. The only way to achieve highest consistency levels is to have a global read-write lock or something just as inefficient. Therein lies the most obvious A-C tradeoff.

From there, the tension between A-C is a continuum which I would state as: “the more the system has to synchronize on, the more availability that is traded off.” NoSQL databases often describe themselves as fully “AP” or fully “CP”; in practice, “CP” and “AP” are meaningless descriptions, and it is intractable for a database to achieve either full availability or full consistency.

To simplistically demonstrate this continuum, consider these following assertions:

  • If a transaction relies on record-level consistency, then record locks are needed.
  • If a transaction relies on a range of rows being consistent, then multiple record locks are needed.
  • If a transaction relies on a table being consistent, then table locks are needed.
  • If a transaction relies on multiple tables being consistent, then multi-table locks are needed.

In addition to the above, if any of these applies over a distributed system, then there has to be some sort of global lock/synchronization/consensus dance which is hard to execute efficiently. Even the Paxos algorithm relies on the existence of some global monotonically increasing sequence to be perfect, which brings its own set of tradeoffs.

There are ways to push back on having actual write locks, usually at the expense of space: for instance, it’s possible to avoid non-repeatable reads without acquiring a table lock in a NSM format database (that is, append-only records; every update is a tombstone followed by an append of a new record) simply by tagging records with a timestamp, then note the current timestamp at the start of a read, and not reading any records with a timestamp above the current timestamp. But these pushbacks are effective only to a point; we don’t want to be creating a copy of a table to achieve table-level consistency, for instance.

What about the “P” component, partition tolerance? If one thinks of “P” as not just network partitions, but as network latency, out of order messages, and so on, then it becomes apparent that P cannot be code-sacrificed in a distributed system, simply because partitions are always ongoing to some degree or another in a distributed system. The only databases which can be honestly described as anywhere near “AC” are traditional RDBMSes.

One might be tempted to think of NoSQL databases such as Cassandra as pushing the end of the AP triangle. But there’s more to the rabbit hole. Column-oriented stores are the true extreme end of the AP triangle, breaking down record-level consistency to element-level consistency. More on this in Part II.

Posted in Databases | Tagged , | Leave a comment

Simplified database consistency, Part 0: Not-so-simplified reference tables

A proper Database Theory 101 class would begin with the CAP Theorem – that it is impossible to have consistency, availability, and partition tolerance in the same database. This is a first in a blog series explaining the CAP theorem and all its nuances, and how it applies to distributed databases.

Part 0 starts out with reference material with minimal explanation, because I spent a lot of time putting together definitions from various sources to one place, and I’d like it in the cloud instead of my aging stack of filled notebooks.

The real Part I will go back to the basics: the CAP theorem and how it has been proven.

Database anomalies (sometimes called “phenomena”), which I sourced from various points:

Anomaly Name Description
Dirty Read Transaction T1 can read uncommitted data from a different transaction T2.
Non-Repeatable Read / Fuzzy Read Two reads within the same transaction T1 are not guaranteed to contain the same data within rows due to transaction T2 having been executed a write to these rows in between.
Phantom Read Two queries within the same transaction T1 return different rows due to transaction T2 having inserted or deleted rows in between.
Read Skew T1 reads a row x. T2 then modifies rows x and y, and commits. T1 then reads modified y, which may have been dependent on a row x that had not been committed at a time. (Classic example: calculating bank balances)
Dirty Write Two transactions change the same record at the same time.
Lost Updates Two transactions update the same row. The transaction that did the update first is now lost and was never seen by the other transaction.
Write Skew Two transactions concurrently read an overlapping data set, make disjoint updates, and then commit without seeing updates. For example: A bank has a rule that the combined balance of an user’s accounts must be 0 or greater. An user with 2 accounts has a balance of $100 each for a total of $200. With write skew, the user can initiate simultaneous withdrawals of $200 from each account, withdrawing $400 total, which is not detected because each individual transaction thinks the net balance is $200.
Real-Time Violation Transaction T1 sees transaction t2, but not necessarily all of the transactions that have occurred before t2. Applicable to a replicated distributed database.
Forward Freshness Applicable to replicated distributed databases, T1 is started before T2. T2 is considered committed by the database, and T1 is allowed to read data from T2, meaning that T1 does not actually read off a “snapshot” at the time it started. Opacity is lost.

Table of isolation levels defined by ANSI/ISO SQL:

Isolation Level Locks Acquired Disallowed Phenomenons
READ UNCOMMITTED Write lock on record only for duration of operation N/A
READ COMMITTED Write lock on record for duration of transaction Dirty reads
REPEATABLE READ Both read and write locks acquired on affected rows for duration of transaction Non-repeatable reads
SERIALIZABLE Range locks for duration of transaction Phantom reads


The next two tables are mostly from this paper which was quite influential in my old company. Extended consistency levels in distributed databases:

Continue reading »

Posted in Databases | Tagged , | Leave a comment