Data Stuff: Storage Wars

Ellison and Amazon.

In the ongoing war of words between Amazon and Larry Ellison, Amazon published best practices for migrating an Oracle database to Amazon RDS, and it’s chock full of saucy details. For instance, I wasn’t aware of the “Amazon Database Freedom Program”, which just gives free AWS credits to anyone ‘migrating from commercial engines’. “Are you operating with old world databases”, a slide asks, with hilarious icons that make it very clear who they’re talking about. It’s quite incredible the number and assortment of tools that Amazon has built to “free” people from their Oracle installations. Take a look.

In a more eye raising move, Amazon Distinguished Engineer James Hamilton threw some more direct shade at Oracle, calling them out for his constant lying about Amazon’s use of Oracle technologies. And there’s Andy Jassy’s twete which I’m just going to quote here in full:

In latest episode of "uh huh, keep talkin' Larry," Amazon’s Consumer business turned off its Oracle data warehouse Nov 1 and moved to Redshift. By end of 2018, they'll have 88% of their Oracle DBs (and 97% of critical system DBs) moved to Aurora and DynamoDB. #DBFreedom

You’ll note that with this much shade and anger, that number is still 88%. I expect that Larry will be auditing Amazon this December.

Storage.

If you’re building a new database, what are the criteria you’ll use to pick your storage layer? The textbook answer is that there’s a handful of questions you’d ask yourself:

  • What workloads do I need to support? The common wisdom is “B-Tree: fast reads, slow writes, LSM tree: fast writes, slow reads.” This is a little sketchy, because there’s a huge variety of possibilities in compaction algorithms that give you different flavors of read and write speeds under different workloads (for instance, if your writes are bursty, higher write-amplification as LSM compactions happen asynchronously don’t hurt you much)
  • How much do I care about write amplification? Typically LSMs have much worse write amplification than B-Trees due to the frequency of compactions, but again, you get to control when these happen.
  • Are we comparing to an in-place BTree (like InnoDB) or a copy-on-write BTree (which as a nice benefit provides Snapshot Isolation for free)? Importantly, what kinds of isolation guarantees are we relying on the storage layer to provide? LSMs also have the nice copy-on-write property since compactions are inherently asynchronous, but if we are relying on our storage engine providing this, we do need to compare strictly to copy-on-write BTrees.
  • How much do I care about space amplification? BTrees suffer from more fragmentation (due to unfilled leaf nodes), versus LSM trees which can trade off write amplification for space amplification by… further tweaking the compaction algorithm.
  • LSM-based storage engines are a very rich design space, because you can tweak your compaction algorithm without changing the underlying file and disk formats! But this also means that any fair comparison between BTrees and LSMs requires looking at the correct comparable LSM compaction algorithm.

The careful CTO will thoughtfully examine each of these trade-offs, discuss with stakeholders, run meticulous performance benchmarks, … and then choose RocksDB.

Look, I’m not even aware of any remotely mature competitors to RocksDB, and certainly none can match the trustworthiness of being in production at Facebook, and more generally, being in widespread use by other databases. It’s got a ridiculously long list of features that it supports. Every now and then somebody comes to my insect-themed database and asks if we can swap out RocksDB for another storage engine, and we start to enumerate the list of reasons why not… and quickly give up because even 10% of the list is enough to make the point. Basically if you’re designing a new database, there are so many other factors to keep in mind, that RocksDB sort of takes the role of IBM in that nobody’s database ever failed because of their choice of RocksDB.

We haven’t even discussed more sophisticated data structures like Bw-Trees. But fundamentally, if you want to build a BTree based storage engine, all that engineering is on you. As a database vendor, you’ve got 4 main worries: A,C,I, and D. Pick RocksDB and you’re basically getting the A and D for free and your worries are down to C, and I. It’s not a difficult choice.

There’s a general principle at play here, which is that as a software component gets more complex, the reasons for using it shift from technical to social.

FoundationDB.

FoundationDB’s new release includes features for multi-region deployments:

FoundationDB 6.0 introduces native multi-region support to dramatically increase your database's global availability. Seamless failover between regions is now possible, allowing your cluster to survive the near-simultaneous loss of an entire region with no service interruption. These features can be deployed so clients experience low-latency, single-region writes.

The gist of what has changed is that they baked in “regions” as a concept that allows users to express the globe-level topology of their cluster. As “planet-scale” becomes more of a buzzword expect to see more of this, sooner in the lifecycles of products.

We’re at the point on the capability vs. suitability curve where we have real problems we need to solve (how do we wrangle the complexity of a database whose component’s communication is limited significantly by the speed of light) but I’m not sure we yet have enough experience as an industry to say what the right abstractions are. Christopher Meiklejohn has been beating this drum for a while:

We posit that striving for distributed systems that provide “single system image” semantics is fundamentally flawed and at odds with how systems operate in the physical world.

I will note that this is a direct violation of Rule 11 of Codd’s 12 Rules of Databases, which only goes to show that every day we stray further from Codd’s light.