First time seeing someone call Spanner, CockroachDB, and YugabyteDB a "distribut...

seedless-sensat · on Jan 9, 2024

I think this con is very real:

> Related tables and indexes are not necessarily stored together, meaning typical operations such as joins and evaluating foreign keys or even simple index lookups might incur an excessive number of internal network hops. The relatively strong transactional guarantees that involve additional locks and coordination can also become a drag on performance.

You handwaved this away saying you can just store an entire table on a single node, but that defeats many of the benefits of these sharded SQL databases.

Edit: Also, before attacking the author's biases, it seems fair to disclose you appear to work at Yugabyte

franckpachot · on Jan 9, 2024

In the case of YugabyteDB, here is how we avoid "excessive number" of networks hops

- true Index Only Scan. PostgreSQL doesn't store the MVCC visibility in indexes and have to look at the table even in case of Index Only Scan. YugabyteDB has a different implementation of MVCC with no bloat, no vacuum and true Index Only Scan. Here is an example: https://dev.to/yugabyte/boosts-secondary-index-queries-with-... This is also used for reference table (duplicate covering indexes in each regions)

- Batching reads and writes. It is not a problem to add 10ms because you join two tables or check a foreign key. What would be problematic is doing that for each rows. YugabyteDB batches the read/write operations as much as possible. Here are two examples: https://dev.to/franckpachot/series/25365

- Pushdowns to avoid sending rows that are discarded later. Each node can apply PostgreSQL expressions to offload filtering to the storage nodes. Examples: https://dev.to/yugabyte/yugabytedb-predicate-push-down-pbb

- Loose index scan. With YugabyteDB LSM-Tree indexes, one index scan can read multiple ranges, which avoids multiple roundtrips. An example: https://dev.to/yugabyte/select-distinct-pushdown-to-do-a-loo...

- Locality of transaction table. If a transaction touches to only one node, or zone, or region, a local transaction table is used, and is promoted to the right level depending on what the transaction reads and writes.

Most of the times when I've seen people asking to store tables together, it was premature optimization, based on opinions rather than facts. When they try (with the right indexes of course) they appreciate that the distribution is an implementation detail that the application doesn't have to know. Of course, there are more and more optimizations in each release. If you have a PostgreSQL application and see low performance, please open a git issue.

I'm also working for Yugabyte as Developer Advocate. I don't always feel the need to precise it as I'm writing about facts, not marketing opinions, and who pays my salary has no influence on the response time I see in execution plans ;)

fuy · on Jan 9, 2024

Hey Franck, just wanted to say I appreciate your database writings. I read a whole bunch of them over the years, and always found them interesting and educational.

magden · on Jan 9, 2024

> You handwaved this away, saying you can just store an entire table on a single node, but that defeats many of the benefits of these sharded SQL databases.

I just clarified one-liners listed under the closing "Cons" section. My intention was not to say that the author is utterly wrong. Marco is a recognized expert in the Postgres community. It only feels like he was too opinionated about distributed SQL DBs while wearing his Citus hat.

> Also, before attacking the author's biases, it seems fair to disclose that you appear to work at Yugabyte.

I'm sorry if I sounded biased in my response. I'm with the YugabyteDB team right now, but that's not my first and I bet not the last database company. Thus, when I respond on my personal accounts, I try to be as objective as possible and don't bother mentioning my current employment.

Anyway, I'm very positive to see that this article got traction on HN. As a Postgres community member, I truly love what's happening with the database and its ecosystem. The healthy competition within the Postgres ecosystem is a big driver for the database growth that's becoming the Linux of databases.

dikei · on Jan 9, 2024

>> First time seeing someone call Spanner, CockroachDB, and YugabyteDB a "distributed key-value store with SQL"

That was the first thing come to my mind when I read the paper on Spanner and CockroachDB (haven't read the paper on YugabyteDB yet) though, and surely I'm not the only one.

franckpachot · on Jan 10, 2024

Referring to Distributed SQL as a key-value store is like defining monolithic databases as a block storage. It reduces it to an internal structure. What makes it a database is what is on top: ACID transaction, full SQL features, relational tables, JSON document, foreign keys,... which are not available when sharding is done on top of SQL