-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix a race in tenant creation #107666
server: fix a race in tenant creation #107666
Conversation
Previously, scanTenantsForRunnableServices() was not holding the mutex when SELECTing for the existing tenant names, which means that the following may happen: - scanTenantsForRunnableServices() sees that only the system tenant exists - createServerEntryLocked() then adds another tenant while holding the mutex - scanTenantsForRunnableServices() takes the lock and stops the tenant that was just created because only the system tenant should be alive (which is wrong) This patch changes scanTenantsForRunnableServices() to take the mutex before SELECTing for the existing tenants in order to avoid the race. Epic: none Fixes: cockroachdb#107434 Release note: None
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
let's merge this - i need it in a different PR too! bors r+ |
This PR was included in a batch that timed out, it will be automatically retried |
Build failed (retrying...): |
Build succeeded: |
107820: db-console: delete unused vars and enforce eslint rule r=maryliag a=xinhaoz This commit turns the eslint rule no-unused-vars to errors. It removes all unused vars in the db-console application. Epic: none Release note: None 107824: server: prevent deadlocks in server orchestration r=lidorcarmel,andrewbaptist a=knz Fixes #107564. Fixes #107791. Supersedes #107666. The previous fix in this area (5ca5703) correctly identified the case where `createServerEntryLocked()` was called concurrently with `scanTenantsForRunnableServices()`, in which case we ran the risk of immediately tearing down the new server because it hadn't be picked up by `getExpectedRunningTenants()`. However, the fix was incorrect: it was causing the controller mutex to be held through `getExpectedRunningTenants()`, which itself can hang. In that case, a cascading failure could result. This patch changes the fix (and thus continues to solve the original problem) by ensuring we only look at entries to remove that existed prior to the call to `getExpectedRunningTenants()`. No mutex needs to be held here. Release note: None Epic: CRDB-28893 Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Previously, scanTenantsForRunnableServices() was not holding the mutex when SELECTing for the existing tenant names, which means that the following may happen:
This patch changes scanTenantsForRunnableServices() to take the mutex before SELECTing for the existing tenants in order to avoid the race.
Epic: none
Fixes: #107434
Fixes: #107343
Fixes: #107154
Release note: None