Dropbox wrote about their cross-shard transaction system:
While the product use-cases at Dropbox are usually a good fit for collocation, over time we found that certain ones just aren’t easily partitionable. As a simple example, an association between a user and the content they share with another user is unlikely to be collocated, since the users likely live on different shards. Even if we were to attempt to reorganize physical storage such that related colos land on the same physical shards, we would never get a perfect cut of data.
It seems a generally agreed-upon truth that no matter how carefully you partition your workload, there will always be some proportion of transactions that won’t be colocated. Dropbox pegs this at 5-10%:
Although two-phase commit was a fairly natural fit for Edgestore’s existing workload, it is not a silver bullet for those looking to improve their consistency guarantees. Edgestore data was already well-collocated, which meant that cross-shard transactions ended up being fairly rare in practice—only 5-10% of Edgestore transactions involve multiple shards.
I wonder if there’s some universal law of how many cross-shard transactions you can’t partition your way out of. TPC-C specifies that number to be 1%:
A supplying warehouse number (OL_SUPPLY_W_ID) is selected as the home warehouse 99% of the time and as a remote warehouse 1% of the time. This can be implemented by generating a random number
[1 .. 100];
It seems obviously better, even in a system that supports cross-shard transactions, to have as few of them as possible. 5-10% seems to me to be a pretty reasonable target.
If TPC-C intended their choice of 1% to be a moral judgement on what the appropriate number of cross-shard transactions is, Dropbox seems to say that they probably underestimate that number a bit. I think the more likely explanation is that in the days when TPC-C was conceived, cross-shard transactions in general-purpose databases weren’t really a mainstream thing, and so this clause is more likely to be saying “don’t shard your database” than it is to be saying “implement cross-shard transactions well”.
The Dropbox post is interesting, there’s some open questions for me—the scariest thing for me in 2PC is that a well-timed leader failure can cause all the participants to get stuck, and they don’t really go into how they resolve that practically.
If someone is trying to sell you a distributed database, what questions can you ask to get a good high-level understanding of it? Some guess who-esque ideas that spring to mind:
The CAP theorem is pithy enough to have become the defining distributed systems trade-off of our time. The general idea is that if someone asks you what your friend on the other side of town wants to do for dinner, and her phone is dead, you can either
Much has been written about how the CAP theorem doesn’t capture many of the subtleties present in the design of a modern distributed system. That's not to say it's not a good building block for a solid intuition around distributed systems, but it's such a lossy projection of a larger number of factors that using it to immediately partition the space of systems into buckets can be not especially helpful, as its impact on the actual day-to-day operation of real distributed systems is actually not that big. Its scope is pretty explicitly limited, as most discussions of it will quickly point out. The CAP theorem posits that a short network partition will effect a correspondingly short loss of either consistency or availability.
Anyway, GitHub was down recently:
At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.
What I like about this outage is that it exposes the CAP theorem as the vast, vast oversimplification that it is. CAP is talked about as the fundamental trade-off of distributed systems, and yet in this situation it only predicts 43 seconds of downtime. Where is the chasm between theory and practice here? Does anyone really care about that 43 seconds of downtime? Well, yes, probably, but clearly whatever trade-off GitHub was forced to make due to CAP wasn’t really all that meaningful in the grand scheme of things.
Don’t miss the very poetic part of the story where GitHub engineers manually resolve merge conflicts:
Ultimately, no user data was lost; however manual reconciliation for a few seconds of database writes is still in progress.
The genesis of NoSQL was like, “ah, no, you don’t need that. Oh, you definitely don’t need this. This is just slowing you down,” so I find it sort of goofy that NoSQL vendors are slowly adding in these features. The moral arc of transactions is long, but it sure looks like it bends towards SQL:
Production-ready with its 3.0 release, Scylla’s global secondary indexes can scale to any cluster size, unlike the local-indexing approach adopted by Apache Cassandra. Secondary indexes allow the querying of data through non-primary key columns.
It’s kind of like, pretty fundamental to the design of Cassandra that you don’t have global secondary indexes. Once you start to sneak up on “contra to the reason your software was conceived” territory I think it might be time to re-evaluate the tack you’re taking. This isn’t hating on Cassandra, it’s just that like, it was conceived as a tool whose value was in what it let you not do, and if you’re going to just like, change it to let you do that, maybe you should take a step back to ask if maybe you should do something else.
MemSQL has a free tier now, for up to 128GB of the Mem. Heptio was acquired by VMWare. Andy Pavlo has an update to his guide to applying for PhDs in databases. What he neglects to mention is that if you really want to do a databases PhD, you should seriously consider working in distributed systems! There’s more money, SOSP is in much better locations than SIGMOD, and there’s the huge benefit of doing the same database research while willfully ignoring the prior work (this situation is not unique to databases). Finally, if you’re looking for Raft shade, get it in two flavors from Joe Hellerstein or from Gün Sirer, take your pick. (Disclosure, the quoted tweets are comments on statements by my employer, who sells insect-themed Rafts at scale).
To contact the author of this story email firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, or heck, email@example.com.