Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: investigate brown-out while scaling up under load #61005

Closed
tbg opened this issue Feb 23, 2021 · 7 comments
Closed

kv: investigate brown-out while scaling up under load #61005

tbg opened this issue Feb 23, 2021 · 7 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Feb 23, 2021

Describe the problem

It has been reported here (internal link) that scaling up under load results in a period of 0 qps.

To Reproduce

Try to reproduce this in a roachtest. Apparently this has happened under multiple workloads, so that kv50 seems like a good starting point.

Expected behavior

We expect to see a reproduction! QPS dropping to zero for a time is the symptom that was reported.

Additional data / screenshots

The slack link above contains a link to recording where this occurs in a customer environment.

The slack thread also mentions a possible connection to #37906. This is unsubstantiated (since we're purely scaling up) but it should be kept in mind while investigating. #37904, at the time of writing, is not backported to 20.2 (or rather, it was un-backported due to a subtle bug).

Environment:

20.2.x (probably .4)

Additional context

@tbg tbg self-assigned this Feb 23, 2021
@blathers-crl
Copy link

blathers-crl bot commented Feb 23, 2021

Hi @tbg, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 23, 2021
@mikeczabator
Copy link
Contributor

image
image
image
image

@mikeczabator
Copy link
Contributor

Note, this is occurring in 20.2.5, but we also saw it in 20.2.3 and 20.2.4.

@mikeczabator
Copy link
Contributor

More food for thought. When we scale the cluster up, you see the Leaseholders per Store start shuffling all over the place. Leases immediately start shuffling around across many nodes. Perhaps this could be correlated as all of this happens so quickly.

image

@ajwerner
Copy link
Contributor

This lease transfer behavior feels intimately related to #51867.

@ajwerner
Copy link
Contributor

The related PR was not backported to 20.2. I wonder if it would help. I suspect it might.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@mwang1026
Copy link

closing in favor of #67740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants