Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop scheduler from advancing through empty buckets without accept from ratelimiter #18604

Merged
merged 2 commits into from
Feb 15, 2018

Conversation

jim-minter
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2018
@jim-minter
Copy link
Contributor Author

@bparees do you think this need backporting?

@bparees
Copy link
Contributor

bparees commented Feb 14, 2018

return (s.position + inc + len(s.buckets)) % len(s.buckets)

unrelated to your changes but since it hurt my brain, isn't that numerically identical to return (s.position + inc) % len(s.buckets) ?

isn't (x+y) % y == x % y ?

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this does fix the code to behave as the doc implies it was intended to ("If the bucket is empty, we wait for the rate limiter before returning.") but since i haven't studied this like you have, why is it bad to skip through the empty buckets? i'm clear how the buckets, rate limiter, and desired import schedule interact, so it's not obvious to me why treating empty buckets as "an item to process" is critical to it working properly.

for k, v := range last {
delete(last, k)
s.buckets[s.at(-1)][k] = v
return k, v, false
}
s.position = s.at(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the interest of (some) clarity, i think i'd be inclined to move this into the "if len(last) == 0" block above and return from within that if-check. The fact that the for-loop above is really an "if" statement makes it a bit non-obvious what the code flow is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it and don't believe it ends up clearer. The compiler ends up insisting on a return statement at the end of the function which is actually unreachable. Or you need to declare k and v outside the range and break. I've added comments.

@jim-minter
Copy link
Contributor Author

return (s.position + inc + len(s.buckets)) % len(s.buckets)

unrelated to your changes but since it hurt my brain, isn't that numerically identical to return (s.position + inc) % len(s.buckets) ?
isn't (x+y) % y == x % y ?

It's not numerically identical in the case that inc is negative: in this case the return value could be negative and therefore invalid. I suppose the intended contract here is that abs(inc) <= len(s.buckets).

@bparees
Copy link
Contributor

bparees commented Feb 14, 2018

It's not numerically identical in the case that inc is negative: in this case the return value could be negative and therefore invalid.

sigh, yes of course.

@jim-minter
Copy link
Contributor Author

so this does fix the code to behave as the doc implies it was intended to ("If the bucket is empty, we wait for the rate limiter before returning.") but since i haven't studied this like you have, why is it bad to skip through the empty buckets? i'm clear how the buckets, rate limiter, and desired import schedule interact, so it's not obvious to me why treating empty buckets as "an item to process" is critical to it working properly.

Say the scheduler has 4 buckets and the minimum refresh interval is 15 minutes. I think the idea was that the ratelimiter emits 4 tokens in 15 minutes so that the scheduler gets through all its buckets once in the period. If the scheduler immediately skips to the next bucket when done without waiting on the ratelimiter, it goes too fast (https://bugzilla.redhat.com/show_bug.cgi?id=1515058). Worse, if no scheduler bucket is empty, the ratelimiter is never called and the scheduler hot loops (a second rate limiter based on MaxScheduledImageImportsPerMinute meant that this wasn't super obvious). That's https://bugzilla.redhat.com/show_bug.cgi?id=1543446.

@jim-minter
Copy link
Contributor Author

At the moment I'm looking for backportable correctness fixes because I think this may be appropriate for backport. Subsequently I'd like to refactor this as I think there are several corner case race conditions and the whole thing is very opaque.

@bparees
Copy link
Contributor

bparees commented Feb 14, 2018

Worse, if no scheduler bucket is empty, the ratelimiter is never called and the scheduler hot loops

every bucket will eventually be empty since it's moving items out of the bucket as it goes, right?

but thanks for you other explanation, that makes sense (that the ratelimiter config and the number of buckets are configured together. I didn't want to dig into the code enough to see if that was the case)

@bparees
Copy link
Contributor

bparees commented Feb 14, 2018

As for backporting, this doesn't feel like it rises to the level of needing to be backported (especially if we can get it into 3.9) but that would be up to CEE to negotiate w/ the customer who hit the issue. I'd ask on the bug.

@jim-minter jim-minter force-pushed the bz1515058 branch 2 times, most recently from 4917c6b to aeddc60 Compare February 14, 2018 17:28
@jim-minter
Copy link
Contributor Author

every bucket will eventually be empty since it's moving items out of the bucket as it goes, right?

That's not relevant - because Add() spreads items across all buckets, any OCP instance with more scheduled imagestreams than buckets will hot loop. It's in https://bugzilla.redhat.com/show_bug.cgi?id=1543446 and the unit test I've added recreates it. The net effect is that the backend docker registries are queried constantly to the rate of MaxScheduledImageImportsPerMinute.

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit and lgtm

return x
}

type wallClock struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not appear to be used

@jim-minter
Copy link
Contributor Author

wallClock comment added

@bparees
Copy link
Contributor

bparees commented Feb 14, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jim-minter

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2018
@bparees bparees added this to the 3.9.0 milestone Feb 14, 2018
@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 18505, 18617, 18604).

@openshift-merge-robot openshift-merge-robot merged commit 5535deb into openshift:master Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants