Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: move schema change lease to schema change job. #34211

Closed
vivekmenezes opened this issue Jan 24, 2019 · 7 comments
Closed

sql: move schema change lease to schema change job. #34211

vivekmenezes opened this issue Jan 24, 2019 · 7 comments
Assignees
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@vivekmenezes
Copy link
Contributor

The schema change lease is stored in the table descriptor. What this means is when it gets updated every 2-3 minutes, the system config span gets gossiped. The schema change lease is also modifying the descriptor without changing the Version on the descriptor. It is best that it be moved to the schema change job.

@vivekmenezes vivekmenezes added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-schema-descriptors Relating to SQL table/db descriptor handling. labels Jan 24, 2019
@andreimatei
Copy link
Contributor

Worse than the thing getting gossipped, the range can grow too large at which point the cluster is completely stuck.
Besides moving the lease away, I think we should also make sure that taking and releasing this lease doesn't leave old MVCC versions around, and also we probably should be taking it and releasing it in a loop like it happened in #34082.
And also more generally I think we need to be more judicious around schema change retry. I don't think it's a sane policy to just retry things endlessly like we apparently do. I'm thinking we should find ways of getting an operator into the loop when things go awry.

@andreimatei
Copy link
Contributor

FWIW, this is the descriptor with the flipping lease. I thought I linked it from #34211, but it looks like I forgot:
https://gist.github.com/andreimatei/11566c4ed239dd6ce01cae8e550ddf12

@andreimatei andreimatei added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label May 20, 2019
@andreimatei
Copy link
Contributor

@dt I believe this issue is a biggie. This came up again recently; we've gotten multiple reports for 19.1 about writes to system config ranges being backpressured because the range is too large - at which point the cluster is pretty hosed.
We haven't figured out exactly how the system config range gets that big, but the suspicion is on schema changes - either one that runs for a long time, or one that keeps retrying, which according to what I was saying in #34082 seems to result in this schema change lease being taken over and over.

cc @nvanbenschoten

craig bot pushed a commit that referenced this issue May 20, 2019
37605: storage: don't backpressure writes to the system config span r=andreimatei a=nvanbenschoten

We're seeing in multiple issues (e.g. #35842, #37337, #37530, etc.) that write backpressure is kicking in on the system config span range itself. This is a serious issue because it means that write backpressure can't be disabled using the `kv.range.backpressure_range_size_multiplier` cluster setting.

The reason for the system config span growth is still somewhat unclear and it is being tracked #34211 (comment). It's likely that it has to do with a runaway schema change that continuously grabs and releases a schema change lease. This is an issue and should be fixed, but letting this cause such devastation to a cluster is a problem. To address this, this commit disables write backpressure on the system config span.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@dt
Copy link
Member

dt commented Jun 6, 2019

@andreimatei do you have any idea which schema change might be at fault here? They're supposed to renew their leases ever 2.5min so with a 25h TTL I'd expect no more than 600 MVCC revisions of a given desc, but I heard someone say we're seeing thousands per minute so something seems off.

@andreimatei
Copy link
Contributor

I have to dig through the logs, which I'll do hopefully later today (I'll send you a link to them). The last time when we saw this, it was about a schema change retrying endlessly.

@andreimatei
Copy link
Contributor

The latest incarnation of this has #38088 as the cause for the schema change retries.

@thoszhang
Copy link
Contributor

Closing since schema change leases (i.e. TableDescriptor.Lease) don't exist anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

4 participants