sql: move schema change lease to schema change job. #34211

vivekmenezes · 2019-01-24T02:19:37Z

The schema change lease is stored in the table descriptor. What this means is when it gets updated every 2-3 minutes, the system config span gets gossiped. The schema change lease is also modifying the descriptor without changing the Version on the descriptor. It is best that it be moved to the schema change job.

andreimatei · 2019-01-24T20:00:18Z

Worse than the thing getting gossipped, the range can grow too large at which point the cluster is completely stuck.
Besides moving the lease away, I think we should also make sure that taking and releasing this lease doesn't leave old MVCC versions around, and also we probably should be taking it and releasing it in a loop like it happened in #34082.
And also more generally I think we need to be more judicious around schema change retry. I don't think it's a sane policy to just retry things endlessly like we apparently do. I'm thinking we should find ways of getting an operator into the loop when things go awry.

andreimatei · 2019-01-24T20:02:37Z

FWIW, this is the descriptor with the flipping lease. I thought I linked it from #34211, but it looks like I forgot:
https://gist.github.com/andreimatei/11566c4ed239dd6ce01cae8e550ddf12

andreimatei · 2019-05-20T18:57:02Z

@dt I believe this issue is a biggie. This came up again recently; we've gotten multiple reports for 19.1 about writes to system config ranges being backpressured because the range is too large - at which point the cluster is pretty hosed.
We haven't figured out exactly how the system config range gets that big, but the suspicion is on schema changes - either one that runs for a long time, or one that keeps retrying, which according to what I was saying in #34082 seems to result in this schema change lease being taken over and over.

cc @nvanbenschoten

37605: storage: don't backpressure writes to the system config span r=andreimatei a=nvanbenschoten We're seeing in multiple issues (e.g. #35842, #37337, #37530, etc.) that write backpressure is kicking in on the system config span range itself. This is a serious issue because it means that write backpressure can't be disabled using the `kv.range.backpressure_range_size_multiplier` cluster setting. The reason for the system config span growth is still somewhat unclear and it is being tracked #34211 (comment). It's likely that it has to do with a runaway schema change that continuously grabs and releases a schema change lease. This is an issue and should be fixed, but letting this cause such devastation to a cluster is a problem. To address this, this commit disables write backpressure on the system config span. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

dt · 2019-06-06T18:14:52Z

@andreimatei do you have any idea which schema change might be at fault here? They're supposed to renew their leases ever 2.5min so with a 25h TTL I'd expect no more than 600 MVCC revisions of a given desc, but I heard someone say we're seeing thousands per minute so something seems off.

andreimatei · 2019-06-06T21:26:30Z

I have to dig through the logs, which I'll do hopefully later today (I'll send you a link to them). The last time when we saw this, it was about a schema change retrying endlessly.

andreimatei · 2019-06-07T00:50:57Z

The latest incarnation of this has #38088 as the cause for the schema change retries.

thoszhang · 2020-04-30T21:35:57Z

Closing since schema change leases (i.e. TableDescriptor.Lease) don't exist anymore.

vivekmenezes added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-schema-descriptors Relating to SQL table/db descriptor handling. labels Jan 24, 2019

vivekmenezes mentioned this issue Jan 24, 2019

storage: gc queue processing times out during txn push/intent resolution #34082

Closed

andreimatei assigned dt May 20, 2019

andreimatei added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label May 20, 2019

nvanbenschoten mentioned this issue May 20, 2019

storage: don't backpressure writes to the system config span #37605

Merged

This was referenced May 23, 2019

release-19.1: storage: don't backpressure writes to the system config span #37781

Merged

release-2.1: storage: don't backpressure writes to the system config span #37782

Merged

thoszhang closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: move schema change lease to schema change job. #34211

sql: move schema change lease to schema change job. #34211

vivekmenezes commented Jan 24, 2019

andreimatei commented Jan 24, 2019

andreimatei commented Jan 24, 2019

andreimatei commented May 20, 2019

dt commented Jun 6, 2019

andreimatei commented Jun 6, 2019

andreimatei commented Jun 7, 2019

thoszhang commented Apr 30, 2020

sql: move schema change lease to schema change job. #34211

sql: move schema change lease to schema change job. #34211

Comments

vivekmenezes commented Jan 24, 2019

andreimatei commented Jan 24, 2019

andreimatei commented Jan 24, 2019

andreimatei commented May 20, 2019

dt commented Jun 6, 2019

andreimatei commented Jun 6, 2019

andreimatei commented Jun 7, 2019

thoszhang commented Apr 30, 2020