Data Stuff: Etched in Stone

This is the third edition of Data Stuff, a weekly newsletter about databases by Arjun Narayan and Justin Jaffray.

People are worried about durability.

Actually, no, I lied. When it comes to ACID, people are generally worried about the A, C, and I, but if you wanted to pick one of the ACID properties to declare "solved", there's a pretty good argument for picking D (durability) and not worrying about it further. Durability is pretty clear: if I acknowledge your write, I shouldn't lose it. That's not to say the meaning of durability hasn't evolved, or that people don't have different durability needs depending on their circumstances, or that ARIES isn’t tricky to get right; just that the space of solutions is much better understood, generally[1].

Before the mainstream expectation of replicating your database, a write was generally considered durable if it had been written to some medium that will survive a power cycle. Nowadays those with an investment in distributed databases (disclosure: I work for a company that sells insect-themed replication for money) would probably be happier if that definition was broadened to mean "written to multiple disks such that it can survive the permanent disappearance of some minority fraction of said disks." But the reality is that stronger durability guarantees necessarily come with trade-offs, like worse latency (you want those disks to be far apart because of floods or mushroom clouds or whatever your threat model is). If you value latency above durability you might be okay with losing writes sometimes if it gets you a better response time. Some databases recognize that users often have different needs in this way, and expose this tradeoff as a tactile knob affixed to the side of your server with "MULTI-NODE DURABILITY" scribbled in sharpie at one extreme and "LOW LATENCY" at the other.

So it was in this context that MongoDB was once again Jepsened, with a focus on MongoDB Sharded Clusters:

We did, however, uncover problems with MongoDB’s causal consistency: it doesn’t work unless users use both read and write concern majority, and the causal consistency documentation made no mention of this. While MongoDB will reject causal requests with the safer linearizable read level, and the unsafe write concern unacknowledged, it will happily accept intermediate levels, like write concern 2 or read level local. Since many users use sub-majority operations for performance reasons, and since causal consistency is typically used for high-performance local operations which do not require coordination with other cluster nodes, users could have reasonably assumed that causal sessions would ensure causal safety for their sub-majority operations; they do not. […] This interpretation hinges on interpreting successful sub-majority writes as not necessarily successful: rather, a successful response is merely a suggestion that the write has probably occurred, or might later occur, or perhaps will occur, be visible to some clients, then un-occur, or perhaps nothing will happen whatsoever.

We note that this remains MongoDB’s default level of write safety.

There’s a common misconception that the value provided by Jepsen was in combining randomized testing with a deep understanding of distributed systems. That’s obviously crucial to the whole project, but the more human-centric approach is where Jepsen really shines — Jepsen explicitly calls attention to Mongo’s choice to turn the durability knob to the “LATENCY” angle by default.

First, the good news. If you read the early Jepsens, its pretty clear that like the knob labeled “SAFETY” was not connected to anything - just a piece of decorative plastic. The good news now is that the “DURABLE” side of the knob works! The bad news is that performance guidelines, documentation, and the default out of the box setup push you to twist the knob the other way. So we’ve advanced as a society from “unsafe at any speed” to “safe in first gear”. But like… the car still goes straight to second when you step on the gas. But you have the option to drive in first gear. I’m not sure how happy I am at this “advance”.

Now — to be fair — these tradeoffs are very real! But like, as an industry, we’re still not being entirely honest when marketing one side of the knob. Take for instance the paper introducing the YCSB benchmark: it briefly acknowledges that these tradeoffs exist, but that’s it. In retrospect, the paper captures the zeitgeist of the late aughts in that it has a sense of moral urgency to strip down ACID and sell it for parts in order to fund its QPS addiction. But even today there aren’t really benchmarks that capture the nuance present in this trade-off. You sort of have to go back to TPC-C[2] to even have a real discussion about needing isolation guarantees in your benchmarks. But even that one was declawed—it only requires snapshot isolation to achieve serializable executions. You would think we’d have some real benchmarks that weren’t, you know, 30 years old, that would be clearer about these tradeoffs.

Anyway, all of this stirs up a wave of nostalgia for the early Jepsens. In a sense MongoDB has always been driving our industry forward! Looking back at the first two analyses — of Redis and MongoDB — they are still remarkably entertaining reads, and they really spurred the industry to demand more rigor from vendors of distributed systems. MongoDB’s like the patron saint of Jepsen analyses, really, being analyzed four times now (disclosure, I work for a company that sells insect-themed Jepsen passing consistency for money).

And finally, in exciting news, the most recent Jepsen comes from a new author, expanding the team from one to two. This presumably means twice as much Jepsen content going forward. I’ll be sure to stay up to date with all things Jepsen in this newsletter. Anyway, thanks to Kyle, Carly Rae’s yearns will live on in our hearts, forever our muse of dropped messages and missed calls.

Persistence pays off.

Intel’s “Optane DC Persistent Memory” will finally be available on Google Cloud:

Today, we’re excited to announce the alpha availability of virtual machines with 7TB of total memory utilizing Intel Optane DC persistent memory. With native 7TB virtual machines, GCP customers have the ability to scale up their workloads while benefiting from all the infrastructure capabilities and flexibility of Google Cloud, including on-demand provisioning, Live Migration and flexible scaling up and down.

Presumably this announcement is intended primarily for today’s in-memory database vendors (there’s an explicit callout to SAP HANA); the average commercial user of a cloud platform doesn’t have many obvious actions they can take in response.

I’m nervous for our incoming persistent memory future; a clear delineation between the uses for volatile and non-volatile storage was actually a pretty solid architectural restriction. I mean, to be honest, seeing how the sausage gets made has made me pretty respectful for like… relying on the recover-from-log part of the whole durability story when you screw up the internal state. But if you’re gonna move to a world where there’s just ‘the state’, we’re going to need a lot more confidence in our ability to maintain rigorous invariants on our data structures no matter what happens. We’re losing the key invariant we’ve relied on as an industry since inception: turning it off and on again.

Intel also issued a press release regarding the widespread availability of Optane:

How It’s Different: As a part of today’s news, Intel announced unique capabilities delivered by Intel Optane DC persistent memory through two special operating modes – App Direct mode and Memory mode. Applications that have been specifically tuned can take advantage of App Direct mode to receive the full value of the product’s native persistence and larger capacity. In Memory mode, applications running in a supported operating system or virtual environment can use the product as volatile memory, taking advantage of the additional system capacity made possible from module sizes up to 512 GB without needing to rewrite software.

I guess the idea is like, if you don’t rewrite your software to specifically make use of the architectural changes persistent memory will allow, you can treat it as big, cheap RAM. Which I guess like, sure. But if you just stop there, that isn’t a speed-of-light performance improvement for most databases in the general case. The real ride is going to come in getting the bold new rearchitectures to just always do things without the durable logging. But… I have my doubts. These database scientists are so preoccupied with whether or not they can, they aren’t stopping to think if they should. Are we ready yet as an industry to stand behind a new database that has to remain stable from first start and will never have the escape hatch of power cycling?

Rockset.

I’m going to be honest. In 2018 the MVP for a data product is pretty complicated. Vendors have to convince potential users that they check all sorts of complicated boxes. So when I saw the announcement for Rockset: it honestly took me a while to figure out what exactly was being sold here. After reading through their architecture docs I’ve decided I’m just going to talk about the SQL dialect they’ve created.

And the SQL dialect is actually pretty smart! People love Postgres’s JSONB data type, but it falls victim to the operator soup problem that tends to come up when you bolt new data models on top of old ones, and never mind that it gives up a lot of the querying capabilities of SQL. There are pushes to provide a more relational interface to such schemaless data, but a SQL dialect that’s designed from the ground up to support this kind of flexible schema strikes seems like a reasonable approach to get something more usable.

I guess the idea behind the name “Rockset” is like, “it’s RocksDB, but for ETL and, oh, also you don’t have to do the “L” part.” Since ideally the data from your data lakes will just keep streaming into your “converged index” (details are sparse on what the “converged index” is). Anyway here are some thoughts from Peter Bailis introducing Rockset.

Things happen (in some serial order).

Uber introduces Peloton, a resource scheduler for diverse cluster workloads. So that’s another of Andy Pavlo’s names gone (P.S. hey Andy, renew your SSL cert), after Larry Ellison tookself-driving databases’. I guess it’s only a matter of time before T-Pain comes for Ottertune. Neo4J raised an 80 million dollar Series E. HashiCorp raised a 100 million dollar Series D. Yugabyte has an in-depth reply to Dan Abadi’s “Calvin vs Spanner” blog post.

To contact the author of this story email datastuff@ristret.com, editor@ristret.com, slashdevslashnull@ristret.com, or heck, mlevine51@ristret.com.

Footnotes.

[1] Although the introduction of new storage devices like non-volatile memory is expanding this space and demanding new research. [2] The TPC-C specification, section 3.4 lays out isolation phenomena that must not be present. It’s pretty rudimentary — it just lists some specific anomaly information that should not be present — a historical approach that needs updating.