Data Stuff: Flora and Fauna

This is Data Stuff, a weekly newsletter about databases by the distributed team of Justin Jaffray and Arjun Narayan, speaking with the illusion of a single monolithic server.

This is the first edition, and due to publication issues is slightly behind schedule. Events in the last two days will be covered in the next edition of Data Stuff. The contents of this newsletter are purely the opinion of the composite fictional author, so, like, definitely not the opinion of Justin's or Arjun's employer.

Are we building too many databases?

Andy Pavlo is worried:

I am now aware of 4 major silicon valley companies building an internal distributed OLTP DBMS. Each story is almost the same:

  1. They want a version of Spanner w/o TrueTime.
  2. They evaluated @CockroachDB + TiDB + @Comdb2 + @YugaByte + @FoundationDB but decided to build their own

Perhaps this is what progress looks like? I mean, back in the heady days of NoSQL, engineers would be asked to ship an innocuous feature and discover that it required transactions or joins or some other bleeding-edge invention, and they'd spend a week accidentally writing client-side join algorithms in Ruby, or do a write and then spin-lock on reading the value they just wrote until they read what they wrote. Those engineers might not even have noticed that they were building a database.

Now, apparently some software companies are giving a few of their engineers time to focus and build transactions and join algorithms in a principled fashion. This is progress! Maybe one day they'd even limit themselves to one such piece of software per company, but that might be too much to ask for at the moment. Let's just be glad that we're moving companies from the "accidentally built a database” column to the “intentionally built a database” column.

Programmers don’t like clocks.

Appealing to programmers’ natural disdain for clocks is generally a good strategy. FaunaDB — a NewSQL database based on Calvin (disclaimer: I work at Cockroach Labs, a competitor to Fauna, and own some options and stuff) — has published a blog post describing how their strictly serializable transactions work. Importantly, unlike Spanner, Fauna doesn't need the magic clocks (this is a very good property to have, since unless you’re Spanner you don’t have the magic clocks).

The underlying algorithm used to achieve serializability is roughly comparable to what is described in "A Critique of Snapshot Isolation by Yabandeh and Ferro:

  1. Perform all your reads at a snapshot in time, then
  2. when you go to commit, make sure none of those reads have changed (if they have, throw away everything and start over).

This is a form of OCC, and as usual the workhorse here is the validation phase, where it's verified that the reads you did are still acceptable to base your writes off of. If you’re willing to accept some (generally quite small) staleness in your read-only transactions, you can sidestep (2) and just read from a local replica. However, opting in to stale reads explicitly gives up strict serializability. In FaunaDB if you want strictly serializable reads they’re available to you, but you now need to go through global consensus and eat a round trip’s worth of latency. There’s a lively Hacker News discussion about this.

Proofs.

I hope referencing HN comments doesn’t become a theme here, but it seems that every time there’s discussion of a distributed system a similar thread shows up (paraphrased):

Commenter 1: I wish the authors would publish a proof of their protocol. I would rather read that.
Commenter 2: provides links to a paper describing the algorithm and the protocol
Commenter 1: No no, I mean a proof, like in TLA+ or Coq.

First of all I think the linked commenter is being a little leading here; they must know there almost certainly does not exist a proof of correctness written in those systems or anything resembling them. But I think HN (and the rest of the programming world)’s fascination with mechanical verification hints at the fact that programmers fundamentally misunderstand what “proof” is.

I guess the reasoning is like, if the computer tells me that a sequence of bits is correct, it’s inherently more convincing than if a trusted, competent human being tells me the same thing about a page of prose (ignoring that I probably need a trusted, competent human being to tell me if the computer-aided proof says anything of any value in the first place).

It seems unlikely to me that the proof written with the intent of being computer-verifiable will be more understandable than the proof written with the intent of being understandable. In general, making things more formal — and a computer verifiable proof is, necessarily, as formal as possible — makes them harder to understand, not easier. This is a problem because proofs usually exist to convince humans that something is true. You could change the term “proof” to “sufficiently convincing statement” [1] and it would mean roughly the same thing. In fact, it would probably increase public understanding.

Anyway, the AWS people have a thoughtful piece from 2014 discussing how exactly they use computer-aided proofs to convince themselves that the systems they built are correct. But the high level principles mostly involve writing precise specifications of the properties that they want, which is something that is useful even if you don’t mechanically verify that your production code meets your spec. And just the existence of a TLA+ or Coq proof isn’t convincing if you’re not going to do the work of understanding the specification. But Hacker News always wants to know if a mechanical proof exists, but never really discuss the spec itself, which is the more important part.

Open Source.

MongoDB created a new “open source” license to prevent third-party vendors selling commercial software as a service without sharing the source back. The main complaint from MongoDB seems to be that their existing license (Affero GPL v3) has some loopholes, which they’re closing, but the spirit is basically still open source. Some people are upset about this because the OSI hasn't yet said this is kosher open source. Others are upset that the OSI has unilaterally decided that they’re the arbiters of the phrase “open source”.

Other database vendors have expressed concerns about the broad language present in the SSPL:

I wonder if folks at MongoDB have given this enough thought. What if a cloud service does not use the MongoDB query language and instead offers a slightly different interface to query and save JSON objects? Would it be allowed?

I mean, realistically, MongoDB isn’t going to go around suing people who use their software for its intended purpose, but I guess it would be comforting if the option wasn’t there.

Snowflake.

The checks have barely cleared from Snowflake’s $260 million round 9 months ago, and Snowflake computing has already raised another $450 million. This pace of growth as the cloud-native data warehouse requires supporting more than just AWS, and as part of their expansion, there’s a blog post about the challenges in building an Azure-hosted offering that I liked.

Things happen (in some serial order).

Mark Callaghan has interesting thoughts about LSM compaction algorithms. Stanford Professor Peter Bailis launched a startup in the AI + data analytics space, raising $14 million from Andreessen-Horowitz. Postgres 11 was released, now with just-in-time compilation. There's a paper in last month’s VLDB surveying compiled and vectorized query execution. Cloudera and Hortonworks are merging. Elastic’s IPO was a success, almost doubling from the initial price.

[1]: For the appropriate definition of “sufficient”, of course.