-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/kv: better story for fast churn / repaving of nodes wrt node IDs #47470
Comments
cc @petermattis - for a particular customer going in deployment this year I would really prefer if we could side-step this issue entirely and have them reuse their data directories. |
@bdarnell can you help me work through this? Andy K would like me to help make node additions more robust, and I think I don't understand node IDs enough. |
I believe the particular customer you're referring to is planning on a 14-day repaving cadence. If you're replacing every node every day you're going to have more immediate problems with the speed of decommissioning and rebalancing. So the 2^15 mark isn't quite so close, but it's still a good time to be thinking about these things. |
For This function was designed with the assumption that node IDs would generally be small, although I think the actual requirement is that the difference between the max node id and the minimum live node ID be less than 2^15. So uniform repaving actually fares pretty well here by refreshing every node's ID. I don't think there'd be a problem here, although I could be missing something. What happens if the assumptions are violated and node IDs are too far apart (if you churn through a lot of nodes but you also let some of your older nodes stick around. Maybe some sort of auto-scaling setup with a minimum cluster size so that the first N nodes never churn)? If node IDs are not unique modulo 2^15, there is a chance Additionally, the high bits of the node ID (above 15) get mixed in with the timestamp, which helps a little bit with uniqueness, but means that the IDs are a little less likely to be monotonic across nodes. This acts a little like increasing clock offsets - 300k node IDs would be kind of like another millisecond of clock offset for determining the chance of one node generating a unique id behind one generated previously. I'm not worried about this. It's not clear that we can do a better job of generating unique 64 bit values locally on each node if node IDs can exceed 15 bits. Applications for which this is a concern can work around it in several ways:
On the CRDB side, we could potentially alleviate problems here by defaulting to a UUID or ULID for tables without a PK, and auditing our own system tables to make sure that there are no problematic uses of SERIAL. |
(For completeness I'd like to also explore directions where we reuse old node IDs after their original nodes have been fully decommissioned. Does this even make sense?) |
Yes, reusing old node IDs is worth discussing. I'm not sure what exactly that might require, but I think it's doable. The main thing you'd need to do off the top of my head is ensure that the old node ID is no longer referenced in any range descriptors and guarantee that two new nodes won't try to reuse the same old node ID. |
One more component for the list: the timeseries system and the web UI. |
There are users out there who think it's a good idea to regularly drop a node and create a new node to replace it. It's also called "churning" or "repaving".
It seems OK at the surface (it fits nicely in the CockroachDB vision) but there's a technical gotcha: every new node gets a fresh node ID.
The regular addition of new node IDs in the system has not been designed for nor tested thoroughly. (Conversely, certain parts of CockroachDB assume that node IDs remain "small" even though the data type is 32 bit)
We need to create a list of features / mechanisms that are reliant on node ID and audit them for resource consumption or correctness issues that grow with the total number of node IDs issued:
unique_rowid()
shares some bits of timestamp with some bits of the node ID. Beyond a certain value, we lose the pseudo time-ordering of therowid
column. (This seems like a major gotcha.)Once we have this list, we need to create dedicated tests that inject large numbers of node IDs and exercise these items to verify that they behave in a way that's reasonable and expected.
For context a 30 node cluster repaved every day (e.g. to follow OS upgrades) means that we get node ID 2^15 after just 3 years. We don't want a user/customer knocking at our door saying “I have this 3-year old deployment that was just fine a minute ago, and today all my traffic has stopped.”
Jira issue: CRDB-4408
The text was updated successfully, but these errors were encountered: