sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

knz · 2020-04-14T10:24:16Z

There are users out there who think it's a good idea to regularly drop a node and create a new node to replace it. It's also called "churning" or "repaving".

It seems OK at the surface (it fits nicely in the CockroachDB vision) but there's a technical gotcha: every new node gets a fresh node ID.

The regular addition of new node IDs in the system has not been designed for nor tested thoroughly. (Conversely, certain parts of CockroachDB assume that node IDs remain "small" even though the data type is 32 bit)

We need to create a list of features / mechanisms that are reliant on node ID and audit them for resource consumption or correctness issues that grow with the total number of node IDs issued:

unique_rowid() shares some bits of timestamp with some bits of the node ID. Beyond a certain value, we lose the pseudo time-ordering of the rowid column. (This seems like a major gotcha.)
memory and disk usage (think: node descriptors that linger)
network traffic (think: gossip)
keying of certain data structures (think: caches, lookup accelerators, etc)
generation of certain IDs (txn IDs, SQL session IDs, query cancellation IDs, etc)

Once we have this list, we need to create dedicated tests that inject large numbers of node IDs and exercise these items to verify that they behave in a way that's reasonable and expected.

For context a 30 node cluster repaved every day (e.g. to follow OS upgrades) means that we get node ID 2^15 after just 3 years. We don't want a user/customer knocking at our door saying “I have this 3-year old deployment that was just fine a minute ago, and today all my traffic has stopped.”

Jira issue: CRDB-4408

The text was updated successfully, but these errors were encountered:

knz · 2020-04-14T10:25:28Z

cc @petermattis - for a particular customer going in deployment this year I would really prefer if we could side-step this issue entirely and have them reuse their data directories.

knz · 2020-04-14T10:43:15Z

@bdarnell can you help me work through this? Andy K would like me to help make node additions more robust, and I think I don't understand node IDs enough.

bdarnell · 2020-04-14T15:00:05Z

For context a 30 node cluster repaved every day (e.g. to follow OS upgrades) means that we get node ID 2^15 after just 3 years.

I believe the particular customer you're referring to is planning on a 14-day repaving cadence. If you're replacing every node every day you're going to have more immediate problems with the speed of decommissioning and rebalancing. So the 2^15 mark isn't quite so close, but it's still a good time to be thinking about these things.

bdarnell · 2020-04-14T15:23:57Z

For unique_rowid():

This function was designed with the assumption that node IDs would generally be small, although I think the actual requirement is that the difference between the max node id and the minimum live node ID be less than 2^15. So uniform repaving actually fares pretty well here by refreshing every node's ID. I don't think there'd be a problem here, although I could be missing something.

What happens if the assumptions are violated and node IDs are too far apart (if you churn through a lot of nodes but you also let some of your older nodes stick around. Maybe some sort of auto-scaling setup with a minimum cluster size so that the first N nodes never churn)?

If node IDs are not unique modulo 2^15, there is a chance unique_rowid() will return values that are not unique. (this function is roughly 15 bits of node ID combined with a timestamp with 10µs resolution. I think that means that if you're generating 1000 unique_rowids per node per second, about 1% of them will collide). If that happens, the main consequences would be surprising unique constraint violations on tables using SERIAL primary keys, or even more surprisingly, on tables with no PK at all (since our hidden rowid column uses unique_rowid() to generate its values). Retrying these inserts would (probably) work, although the unique constraint violation error doesn't look retryable.

Additionally, the high bits of the node ID (above 15) get mixed in with the timestamp, which helps a little bit with uniqueness, but means that the IDs are a little less likely to be monotonic across nodes. This acts a little like increasing clock offsets - 300k node IDs would be kind of like another millisecond of clock offset for determining the chance of one node generating a unique id behind one generated previously. I'm not worried about this.

It's not clear that we can do a better job of generating unique 64 bit values locally on each node if node IDs can exceed 15 bits. Applications for which this is a concern can work around it in several ways:

Check for unique constraint violations when inserting serial values and retry them
Avoid the use of SERIAL unless it is set to be backed by a SEQUENCE instead of unique_rowid. Use a 128-bit value instead. Currently in CRDB that would be a UUID; implementing ULID (builtins: consider adding ULID-based UUID generator #24733) may be useful if you want the ordering properties of SERIAL without the size constraint).
Ensure that all tables have an explicit PK to avoid the implicit serial rowid.

On the CRDB side, we could potentially alleviate problems here by defaulting to a UUID or ULID for tables without a PK, and auditing our own system tables to make sure that there are no problematic uses of SERIAL.

knz · 2020-04-14T15:27:51Z

(For completeness I'd like to also explore directions where we reuse old node IDs after their original nodes have been fully decommissioned. Does this even make sense?)

bdarnell · 2020-04-14T15:32:20Z

Yes, reusing old node IDs is worth discussing. I'm not sure what exactly that might require, but I think it's doable. The main thing you'd need to do off the top of my head is ensure that the old node ID is no longer referenced in any range descriptors and guarantee that two new nodes won't try to reuse the same old node ID.

bdarnell · 2020-04-14T15:33:26Z

One more component for the list: the timeseries system and the web UI.

knz added the C-investigation Further steps needed to qualify. C-label will change. label Apr 14, 2020

knz mentioned this issue Apr 14, 2020

pgwire: implement cancellation protocol #34520

Closed

lunevalex added the A-kv-decom-rolling-restart Decommission and Rolling Restarts label Aug 6, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

ajwerner mentioned this issue Dec 1, 2021

rfc: mini RFC for refactoring node IDs #73309

Merged

rafiss mentioned this issue Aug 10, 2023

builtins: GenerateUniqueID could theoretically generate collisions #108050

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

knz commented Apr 14, 2020 •

edited by cockroach-jira-scripts

Loading

knz commented Apr 14, 2020

knz commented Apr 14, 2020

bdarnell commented Apr 14, 2020

bdarnell commented Apr 14, 2020

knz commented Apr 14, 2020

bdarnell commented Apr 14, 2020

bdarnell commented Apr 14, 2020

sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

Comments

knz commented Apr 14, 2020 • edited by cockroach-jira-scripts Loading

knz commented Apr 14, 2020

knz commented Apr 14, 2020

bdarnell commented Apr 14, 2020

bdarnell commented Apr 14, 2020

knz commented Apr 14, 2020

bdarnell commented Apr 14, 2020

bdarnell commented Apr 14, 2020

knz commented Apr 14, 2020 •

edited by cockroach-jira-scripts

Loading