Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

Open
knz opened this issue Apr 14, 2020 · 7 comments
Open

sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470

knz opened this issue Apr 14, 2020 · 7 comments
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team

Comments

@knz
Copy link
Contributor

knz commented Apr 14, 2020

There are users out there who think it's a good idea to regularly drop a node and create a new node to replace it. It's also called "churning" or "repaving".

It seems OK at the surface (it fits nicely in the CockroachDB vision) but there's a technical gotcha: every new node gets a fresh node ID.

The regular addition of new node IDs in the system has not been designed for nor tested thoroughly. (Conversely, certain parts of CockroachDB assume that node IDs remain "small" even though the data type is 32 bit)

We need to create a list of features / mechanisms that are reliant on node ID and audit them for resource consumption or correctness issues that grow with the total number of node IDs issued:

  • unique_rowid() shares some bits of timestamp with some bits of the node ID. Beyond a certain value, we lose the pseudo time-ordering of the rowid column. (This seems like a major gotcha.)
  • memory and disk usage (think: node descriptors that linger)
  • network traffic (think: gossip)
  • keying of certain data structures (think: caches, lookup accelerators, etc)
  • generation of certain IDs (txn IDs, SQL session IDs, query cancellation IDs, etc)

Once we have this list, we need to create dedicated tests that inject large numbers of node IDs and exercise these items to verify that they behave in a way that's reasonable and expected.

For context a 30 node cluster repaved every day (e.g. to follow OS upgrades) means that we get node ID 2^15 after just 3 years. We don't want a user/customer knocking at our door saying “I have this 3-year old deployment that was just fine a minute ago, and today all my traffic has stopped.”

Jira issue: CRDB-4408

@knz knz added the C-investigation Further steps needed to qualify. C-label will change. label Apr 14, 2020
@knz
Copy link
Contributor Author

knz commented Apr 14, 2020

cc @petermattis - for a particular customer going in deployment this year I would really prefer if we could side-step this issue entirely and have them reuse their data directories.

@knz
Copy link
Contributor Author

knz commented Apr 14, 2020

@bdarnell can you help me work through this? Andy K would like me to help make node additions more robust, and I think I don't understand node IDs enough.

@bdarnell
Copy link
Contributor

For context a 30 node cluster repaved every day (e.g. to follow OS upgrades) means that we get node ID 2^15 after just 3 years.

I believe the particular customer you're referring to is planning on a 14-day repaving cadence. If you're replacing every node every day you're going to have more immediate problems with the speed of decommissioning and rebalancing. So the 2^15 mark isn't quite so close, but it's still a good time to be thinking about these things.

@bdarnell
Copy link
Contributor

For unique_rowid():

This function was designed with the assumption that node IDs would generally be small, although I think the actual requirement is that the difference between the max node id and the minimum live node ID be less than 2^15. So uniform repaving actually fares pretty well here by refreshing every node's ID. I don't think there'd be a problem here, although I could be missing something.

What happens if the assumptions are violated and node IDs are too far apart (if you churn through a lot of nodes but you also let some of your older nodes stick around. Maybe some sort of auto-scaling setup with a minimum cluster size so that the first N nodes never churn)?

If node IDs are not unique modulo 2^15, there is a chance unique_rowid() will return values that are not unique. (this function is roughly 15 bits of node ID combined with a timestamp with 10µs resolution. I think that means that if you're generating 1000 unique_rowids per node per second, about 1% of them will collide). If that happens, the main consequences would be surprising unique constraint violations on tables using SERIAL primary keys, or even more surprisingly, on tables with no PK at all (since our hidden rowid column uses unique_rowid() to generate its values). Retrying these inserts would (probably) work, although the unique constraint violation error doesn't look retryable.

Additionally, the high bits of the node ID (above 15) get mixed in with the timestamp, which helps a little bit with uniqueness, but means that the IDs are a little less likely to be monotonic across nodes. This acts a little like increasing clock offsets - 300k node IDs would be kind of like another millisecond of clock offset for determining the chance of one node generating a unique id behind one generated previously. I'm not worried about this.

It's not clear that we can do a better job of generating unique 64 bit values locally on each node if node IDs can exceed 15 bits. Applications for which this is a concern can work around it in several ways:

  • Check for unique constraint violations when inserting serial values and retry them
  • Avoid the use of SERIAL unless it is set to be backed by a SEQUENCE instead of unique_rowid. Use a 128-bit value instead. Currently in CRDB that would be a UUID; implementing ULID (builtins: consider adding ULID-based UUID generator #24733) may be useful if you want the ordering properties of SERIAL without the size constraint).
  • Ensure that all tables have an explicit PK to avoid the implicit serial rowid.

On the CRDB side, we could potentially alleviate problems here by defaulting to a UUID or ULID for tables without a PK, and auditing our own system tables to make sure that there are no problematic uses of SERIAL.

@knz
Copy link
Contributor Author

knz commented Apr 14, 2020

(For completeness I'd like to also explore directions where we reuse old node IDs after their original nodes have been fully decommissioned. Does this even make sense?)

@bdarnell
Copy link
Contributor

Yes, reusing old node IDs is worth discussing. I'm not sure what exactly that might require, but I think it's doable. The main thing you'd need to do off the top of my head is ensure that the old node ID is no longer referenced in any range descriptors and guarantee that two new nodes won't try to reuse the same old node ID.

@bdarnell
Copy link
Contributor

One more component for the list: the timeseries system and the web UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

4 participants