-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schemachange: attempting to update succeeded job over and over #38088
Comments
cc @roncrdb |
It seems like there's certainly something going wrong during truncation. We should fix that. All of that said, I'm wary that this or other bugs can get the schema changer into such a hot loop: acquire lease, hit error, release lease. It seems like it might be a good idea to rate-limit the acquisition of a lease of a given ID per-node, to defend against cases, like this in the future. I'm not quite sure where we should do that since I don't really know where all the adoption stuff happens and if there is a long-lived, stateful place we could do that in, or if we'd be better off just making a |
It has always seemed cray to me to retry schema changes indefinitely on any unrecognized error, as we do. I think we should change that policy - sometimes revert the schema change, sometime just pause it and require the admin to click a button. And separately I think the schema change lease needs to get out of that descriptor. |
without our own reproduction, I'm wary of trying to fix this just by fixing the error itself since I'm not completely sure how we're getting there, but I do think it'd make sense to maybe delay the on-error release of the lease by, say, a minute. |
So in the meantime there now appear to be 7 runaway schema changes on this cluster. I have a fresh dump of the data in the system range in question, and here's the counts for the descriptor versions.
The range is now 4GB in size :) - we've disabled the write backpressure on this range. There appears to be something funky with the GC process too since it seems to be constantly 2 days behind the 1 day behind target; I'll track that separately. @dt I think we should do what we can to find out what's going on with these jobs. The users here were willing to run custom binaries in the past. Internal communication with them in https://cockroachdb.zendesk.com/agent/tickets/3392 You interested in pulling on this? I'm interested in pulling on the GC angle. |
if needed, we could manually delete the descriptors from system.descriptor which would stop the schema changer from trying to mess with them |
hmm, I actually I guess even |
As far as I'm concerned, I'm more interested in understanding the bug that causes the zombie schema changes rather than getting the system range back into a good shape. |
Seeing this happen here. |
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).
45573: storage: trigger GC based on SysCount/SysBytes r=ajwerner a=tbg If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: #42765 3. aborting transactions in a busy loop: #38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in #44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable). Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).
Closing this now that schema changes run inside a job, making this error not possible anymore. We also changed this:
We don't do this anymore; now if we fail, we try to roll back if applicable, but ultimately if we hit an error that's not one of the whitelisted "retriable" errors we'll fail the job permanently. Even for schema changes that are looping for some reason, you can now pause those jobs. |
A customer cluster got all gunked up because a schema change (or a table truncation?) fails a couple of times a second with the following amusing message:
Why someone is trying to update the progress of a succeeded job, I do not know. Two nodes racing on finishing the schema change maybe?
The schema change in question is:
The table has id: 4191 and state: DROP and drop_job_id: 456021744522723331
These schema change retries kill us because, with every one, we seem to acquire and release the "schema change lease" for this table (I can see this by diffing consecutive versions of the descriptor) which eventually leads to the system config range being unable to accept writes because it's gotten too big and it can't be split.
Debug.zip here (internal only)
@dt you want this one?
The text was updated successfully, but these errors were encountered: