-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/catalog/lease: high priority reads in lease acquisition of OFFLINE descs may starve bulk operations #61798
Comments
As an update, a reproduction was confirmed following the steps above. The command cockroach/pkg/sql/schema_changer.go Line 391 in c9dd3b1
|
The short term fix which we'll backport will be to lease descriptors in the |
Fixes: cockroachdb#61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore)
Fixes: cockroachdb#61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore)
62611: sql: lease acquisition of OFFLINE descs may starve bulk operations r=ajwerner a=fqazi Fixes: #61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore) 62764: sql: gate global_reads zone attribute on cluster version and enterprise license r=nvanbenschoten a=nvanbenschoten Fixes #61039. This commit gates the use of the global_reads zone configuration attribute on the cluster version and on the use of an enterprise license. The first half of this is necessary for correctness. The second half is not, because follower reads won't be served from global read followers without an enterprise license, but a check when the zone configs are set improves UX. 62778: geo/geomfn: refactor logic for point in polygon optimization r=otan a=andyyang890 Release note: None 62815: build: upgrade go to 1.15.10 r=otan a=rickystewart Pick up a patch to a compiler bug (see #62521). * [ ] Adjust the Pebble tests to run in new version. * [x] Adjust version in Docker image ([source](./builder/Dockerfile)). * [x] Rebuild and push the Docker image (following [Basic Process](#basic-process)) * [x] Bump the version in `WORKSPACE` under `go_register_toolchains`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases). * [x] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)). * [ ] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release. * [ ] Bump the go version in `go.mod`. You may also need to rerun `make vendor_rebuild` if vendoring has changed. * [x] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh#L40-42)). * [x] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`). * [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects. * [ ] Adjust `GO_VERSION` in the TeamCity agent image ([setup script](./packer/teamcity-agent.sh)) and ask the Developer Infrastructure team to deploy new images. Resolves #62809. Release note (general change): Upgrade the CRDB binary to Go 1.15.10. 62823: schemaexpr: remove inapplicable TODO r=mgartner a=mgartner Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Andy Yang <ayang@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Fixes: cockroachdb#61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore)
62954: sql: Re-enable multi_region_backup test r=arulajmani a=ajstorm With #60835 merged, this test no longer flakes. I've stressed it on my GCE worker now for a while an it's all good. Resolves #60773. Release note: None 62959: sql: lease acquisition of OFFLINE descs may starve bulk operations r=ajwerner a=fqazi Fixes: #61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore) Co-authored-by: Adam Storm <storm@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Fixes: cockroachdb#61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore)
Fixes: cockroachdb#61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore)
Fixes: cockroachdb#61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore)
Describe the problem
In (#46170, #46384) we made changes to use high priority read transaction in lease acquisitions in order to break deadlocks. This solution isn't wonderful: if users use
PRIORITY HIGH
on their DDL transactions, they can still hit these deadlocks. Given people tend not to do that, this proved to be a pretty good and pragmatic solution to a very serious problem.When code attempts to resolve descriptors it goes through the leasing mechanism. This is true even if the descriptor does not exist. This is made much worse in a case where you've got a multi-region cluster.
The scenario that causes specific pain is when you've got an application running and trying to hit some tables but you're in the process of restoring those tables. As part of the restore, we need to update the descriptors. If the restore job is far from the descriptor table leaseholder, then there's a long period of time for interceding queries to push the intent. After getting pushed, the query will need to refresh. If the pushing is happening more frequently than the refresh's duration, we'll get stuck in livelock.
For what it's worth, all of this is modestly speculative. We saw the pushing but I don't know that we've captured enough data.
To Reproduce
In theory, start a restore in a global cluster and get the leaseholder on the system config span in a different region than the restore coordinator. Then spam the database being restored and see if you can hold off the restore.
Some ideas
Expected behavior
The restore finishes.
The text was updated successfully, but these errors were encountered: