Reporter might have to rollback Aurora story

This is the second edition of Data Stuff, a weekly newsletter about databases by Arjun Narayan and Justin Jaffray.

Housekeeping.

I know it's only been three days since the last newsletter, but I’m still working out the cadence of this thing. For now let’s say there will be a newsletter each week, and the exact day of release is still being worked out.

There will soon be an email version of this newsletter that you can subscribe to, hopefully before the next issue. There is an RSS feed, which is also linked on the page header.

Naked Screaming Andy Pavlo.

CNBC claims that Amazon’s Prime Day outage was caused by their switch to Aurora. Don’t miss the zinger at the end:

Patrick Moorhead, principal analyst at Moor Insights & Strategy, said the incident shows how hard it is for older applications, like those used in Amazon's warehouses, to move off Oracle, which has spent decades working with the world's largest enterprises. "AWS Aurora is designed for forward-looking applications and Oracle for more legacy applications," he said.

Amazon CTO Werner Vogels disputes the story, but confirms that zinger. He says that the problem is indeed in the Amazon software stack, not the AWS stack. So the real explanation is that Amazon is a poor fit as a customer for AWS Aurora, because Amazon is a legacy application.

But is this even a real story? The more interesting outcome is this Andy Pavlo tweetstorm about his interactions with the reporter (in which he backs up Vogels and calls into question the validity of the whole article). I’ll just fast forward to the best tweet:

What happened next was crazy. He called me up at 6am PST and started shouting at me. He called me a liar and said that I was being paid by Amazon to run cover about the outage. We yelled at each other for 30 minutes while I sat in my hotel in my underwear. It was awesome.

Honestly I’m just plain jealous of the reporter. I once tried prank calling the CMU database hotline (+1-844-88-CMUDB) at 3am hoping to catch Andy with his pants down (“Is your database running? Why don’t you go catch it?”), but alas it turns out the hotline is just a voicemail box. This reporter actually got to have all the fun.

But even if this story is blown out of proportion, it’s still a good parable about the switching costs of databases. One of the reasons database performance is notoriously difficult to measure is that it can’t really be boiled down to a concrete number; OLTP workloads are so highly variant that any database that hasn’t played workload whack-a-mole over a decade (hence the aphorism of “it takes a decade to stabilize a new DBMS”) is bound to have some pathological cases. In Aurora’s case, it’s claimed to be SAVEPOINT handling:

In one question, engineers were asked why Amazon's warehouse database didn't face the same problem "during the previous peak when it was on Oracle." They responded by saying that "Oracle and Aurora PostgreSQL are two different [database] technologies" that handle "savepoints" differently.

I don’t know how the software in question was using savepoints in such a way that it exposed this alleged shortcoming in Aurora, but I do like to imagine that an Aurora engineer has recently asked the question “why were you using 10,000 savepoints in a transaction, anyway?”. Probably while rubbing their temples. I actually don’t even know if Aurora handles savepoints wildly differently from vanilla Postgres, but I mean, that doesn’t seem…completely…unreasonable? Another reminder of one of the cardinal rules of DBMS development:

Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

Savepoints and rollbacks.

You generally want your relational database to maintain data integrity. For instance, you expect that all of your foreign key references are valid (this is called referential integrity). A cart_items table that represents the items in a user’s shopping cart should hopefully only refer to items which actually exist (for example, in the items table). Providing this is actually pretty hard! I don’t mean “hard” as in “tricky to implement”, I mean “hard” as in, like, “requires extra transaction isolation machinery.”

The default isolation level in Postgres is, perhaps surprisingly, read committed:

postgres=# show transaction isolation level;
 transaction_isolation
-----------------------
 read committed
(1 row)

read committed says that you can only read data from other transactions which have committed, or are guaranteed to commit (no “dirty read” anomalies). This is a pretty weak guarantee (it’s near the bottom left on the graph of consistency models), and it’s actually not sufficient to provide referential integrity. For instance, in a straightforward implementation of read committed, foreign keys would be broken all the time because of write skew, an anomaly that’s permitted by read committed. Postgres recognizes this and decides that referential integrity of foreign keys and secondary indexes is too important to be left up to the database user. So it quietly does extra work behind the scenes to ensure that that operations on them take long write locks, no matter what the isolation level is.

This sounds like a good thing! I want my database to silently make things safe. Isn’t not caring kind of the point? The problem with this hybrid isolation is that it’s “serializable for me, but not for thee”, and there are two dual ways in which this is misleading:

  1. it gives users the impression that their own hand-rolled denormalizations will be treated the same way as these special ones. They’re trained to expect that these denormalizations “just work”, except if they try it, there will be write skew all over the place.
  2. it means some things which are generally true about an “idealized” read committed are not true about Postgres (or most other databases’) read committed.

(1) is an obvious UX trap, but (2) is a bigger deal than it seems. read committed is a world in which the database can always just resolve conflicts on its own without escalating to the client, but serializable and repeatable read (isolation levels that prohibit write skew) do not. Users in Postgres read committed might believe they never have to deal with client-side retries, but consider this scenario:

  1. Table y references table x, which initially contains two records, called 1 and 2.

  2. Two transactions begin at the same time. One of them deletes record 1 and the other deletes record 2.

  3. Both transactions try to insert into y a record which references the record in x that the other deleted, and commit*.*

What should happen? We can’t commit both transactions, since the inserted records will be orphaned due to the records they reference being deleted (this is an example of write skew). In general, if a transaction can’t commit, SQL databases punt the problem back to the application, which must either give up or restart the transaction and try again. We can see this in action in Postgres:

user1=# create table x (a int primary key);
CREATE TABLE
user1=# create table y (b int references x (a));
CREATE TABLE
user1=# insert into x values (1), (2);
INSERT 0 2
user1=# begin;
BEGIN
user1=# delete from x where a = 1;
DELETE 1
----on a separate connection----
user2=# begin;
BEGIN
user2=# delete from x where a = 2;
DELETE 1
user2=# insert into y values (1); # here, this transaction begins blocking waiting to see what user1's transaction does
----back to user 1----
user1=# insert into y values (2);
ERROR:  deadlock detected

If Postgres wants to maintain referential integrity here, it has no choice but to force one of the transactions to abort and try again (which is what it does). But it’s not clear that many apps know they have to be ready for this! The Postgres docs themselves are a bit misleading here:

(under the section “13.2.2. Repeatable Read Isolation Level”) Applications using this level must be prepared to retry transactions due to serialization failures.

Which hints that applications below this level (like read committed) do not, which as we just saw, is not true. I strongly suspect there are many subtly broken apps out in the wild due to not realizing this limitation of read committed.

One thing clients can do with long complicated transactions is to create savepoints along the way — these are markers that indicate chunks of work inside a transaction that are intermediate points to rollback to, rather than rolling back all the way to the start (these are also sometimes called ‘nested transactions’). You can use these if you know that the stuff you do earlier is unlikely to conflict, so when you do get that deadlock detected error, you rollback to the savepoint and try again, rather than all the way to the start[1].

You might imagine that these savepoints are cheap to create but expensive to rollback to, a fact that’s dismissed by application developers because “rollbacks don’t happen in read committed”. Until until you hammer the foreign keys. Now I don’t know if this is what happened with Aurora (there’s little to no public detail beyond ‘savepoints’) but it certainly is… plausible? Lots of database documentation implies there’s a usability upside (no deadlocks) to giving up serializability, but it’s questionable in practice (disclosure, I work at a company that sells insect-themed serializability for money).

Outage.

There was a large GitHub outage last week, primarily caused by a “network partition and subsequent database failure.” The one small upside to this was that people got yet another chance to make “but I thought git was decentralized” jokes.

It’s pretty clever (and funny!) the way GitHub signaled to users there was a human behind the status updates, by making them all slightly different:

02:51 EDT We are currently in the later stages of a restore operation, with the aim of serving fully consistent data within the next 2 hours.
03:35 EDT Our restore operations are proceeding as expected, on track for serving fully consistent data within the next 1.5 hours.
04:19 EDT Our restore operations are still on track to serve consistent data within the hour. We've posted an update at https://blog.github.com/2018-10-21-october21-incident-report/.
05:47 EDT We continue to monitor restores which are taking longer than anticipated. We estimate they will be caught up in an hour and a half.
07:56 EDT The majority of restore processes have completed. We anticipate all data stores will be fully consistent within the next hour.
11:09 EDT Background jobs remain paused as we near completion of validation of data consistency.

Though amusing, my heart goes out to the SREs living through the sad story these status updates and timestamps tell. I’m eagerly waiting on the promised post-mortem, since they claim it’s a database failure due to a network partition. We’ll definitely be talking about it in detail. Meanwhile, we’ll have to tide ourselves over with rereading Dan Luu’s list of past post-mortems.

Things happen (in some serial order).

FoundationDB is having a summit, including talks like “How FoundationDB powers Snowflake’s metadata”. This Hacker News comment adds a little more color. Heidi Howard (of Flexible Paxos fame) gave a talk on consensus. Tim Cook warns against the data-industrial complex. Harvard’s Stratos Idreos published “The Periodic Table of Data Structures”. There’s some B-Trees and LSM-trees in there. Citus donates 1% of its equity to Postgres organizations. DigitalOcean launches managed Postgres and MySQL databases. MongoDB’s sharded clusters feature was Jepsened (to be covered in detail next newsletter). Larry Ellison says that Amazon won’t be able to migrate off Oracle by the 2020 target. He also says “Amazon Web Services Customers Can Hack AWS Cloud And Steal Data”, reminding me that this is good opportunity to reread this great analysis of Oracle Cloud capex.

To contact the author of this story email datastuff@ristret.com, editor@ristret.com, slashdevslashnull@ristret.com, or heck, mlevine51@ristret.com.

Footnotes.

[1] This is unfortunately not true in MySQL, which obliterates savepoints when it detects deadlocks.