-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backfiller: Index/Column backfiller ResumeSpans need to be reworked #36601
Comments
This is not going to be straightforward to implement because we read the checkpoint at the gateway before each backfill chunk to run the next backfill evaluation chunk over (1-2 minutes). It probably makes the most sense to extend the chunk time to like 15 minutes and setup a timer to ping the schema lease from the gateway? |
Yeah, that's what I'm looking at doing: moving the coordinator to run the lease extender and distsql flows in parallel goroutines so the flow can run as long as it wants, then making the checkpoint interval how long between writing checkpoints, but not forcing the processor to exit after doing so. Even better might be to have the processors emit rows with finished spans and let the gateway coalesce those / choose when to checkpoint as that'd make it easily to frequency-cap it, but I think just making the existing flow run longer between checkpointing would be a good first step. |
you'll also have to update the logic on the gateway that updates the progress since we don't want to update % progress every 15 minutes( or whatever interval you choose) |
@spaskob via email @danhhz pointed me towards the stats Sampler process which sends back progress info like we discussed but via distsql flow metadata as opposed to actual rows: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/distsqlrun/sampler.go#L214 This looks like a nice, clean way to do it compared to sticking encoded bytes in the rows themselves. I'm not quite sure how to get at this metadata from outside the flow so doing it this way might require inserting an extra processor in the flow itself (or tweaking the API on the receiver) so it might not be quite drop-in for the current backfiller flow. I'm looking at this for IMPORT and I'll let you know what I find. |
Before this change index backfill workers would write their updated spans concurrently to the job's row which can potentially lead to blow up in the jobs table. This PR is a first step in improving this. Instead of updating the finished spans in each worker, we propagate this info to the gateway node and do bulk updates there. Next step is to let the workers run longer by not canceling them at 2 mins which is currently done for historical reasons. Fixes cockroachdb#36601. Release note: None
39773: jobs: improve the status reporting of index backfill jobs r=spaskob a=spaskob Before this change index backfill workers would write their updated spans concurrently to the job's row which can potentially lead to blow up in the jobs table. This PR is a first step in improving this. Instead of updating the finished spans in each worker, we propagate this info to the gateway node and do bulk updates there. Next step is to let the workers run longer by not canceling them at 2 mins which is currently done for historical reasons. Fixes #36601. Release note: None Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
Currently if a new node participates in a cluster with old version nodes the index backfill will be busy waiting since the different versions use different ways of reporting status. We detect this situation and revert to the old status reporting style if necessary. Touches cockroachdb#36601. Release note: None
40288: jobs: make new index backfill status tracking bwd compatible r=spaskob a=spaskob Currently if a new node participates in a cluster with old version nodes the index backfill will be busy waiting since the different versions use different ways of reporting status. We detect this situation and revert to the old status reporting style if necessary. Touches #36601. Release note: None Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
I’ve been looking at the chunk backfiller stuff a bit this morning/afternoon and I’m a bit worried about it / resume spans.
ResumeSpans are currently the to-do spans for the backfiller, and are stored in the job details. While it starts as a single span covering the whole table, but partitioned processors each update it (splitting it up into many subspans as they complete individual subspans.
This is similar to the pattern that previously got us in trouble with
RESTORE
andIMPORT
: repeatedly rewriting the job details to mark done/not done spans. In addition to the just normal load/contention concerns, the particular issue with this was that when keys included large fields (e.g. a longer string), these spans could be large, and repeated rewriting all of them on every update resulted a large amount of MVCC revision data for a single job record -- indeed, it'd sometimes mean a single job row could easily exceed the max range size and since ranges cannot split a row, cause oversize ranges.The conclusion from that was that job details and job progess need to be separate : The job details can be large and documents the definition of the job, what work it involves, etc. Execution of the job should not need to rewrite the details as it makes progress. The job progress is what can be rewritten often, but must not be large.
We should eventually move the frequently rewritten done/not done span into to the job Progress field where they belong, and consider if we can fine a cheaper (maybe track more in-mem with a loss-y representation is persisted, with that only mattering if you need to resume from said persisted state).
In the short term though we should rewrite ResumeSpans less often: a given processor should be able to complete all the spans assigned to it before it is forced to re-write the whole resume spans, and should choose to do time-based checkpoints less often too.
The text was updated successfully, but these errors were encountered: