Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HandleError prevent hot-loops #40497

Merged
merged 2 commits into from
Jan 28, 2017

Conversation

lavalamp
Copy link
Member

Add an error "handler" that just sleeps for a bit if errors happen more
often than 500ms. Manually tested against #39816. This doesn't fix #39816 but it does keep it from crippling a cluster.

Prevent hotloops on error conditions, which could fill up the disk faster than log rotation can free space.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 26, 2017
@lavalamp lavalamp added release-note Denotes a PR that will be considered when it comes time to generate release notes. cherrypick-candidate labels Jan 26, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@lavalamp lavalamp added this to the v1.5 milestone Jan 26, 2017
@saad-ali saad-ali added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jan 26, 2017
@lavalamp
Copy link
Member Author

I will make another PR with a test. This is the minimum fix and it should be easy to cherrypick.

// package for that to be accessible here.
lastErrorTime time.Time
minPeriod time.Duration
lock sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, lock should be the field above the thing it locks and should be named lasteErrorTimeLock or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will fix here and in the cherrypick branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cherrypick branch is fixed.

@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 26, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @smarterclayton
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@mikedanese
Copy link
Member

/lgtm

@mikedanese mikedanese closed this Jan 26, 2017
@mikedanese mikedanese reopened this Jan 26, 2017
@mikedanese mikedanese added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 26, 2017
@lavalamp lavalamp added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 26, 2017
@lavalamp
Copy link
Member Author

I want to make the fix in the release branch here before this merges.

@lavalamp
Copy link
Member Author

(but I'm building and testing the release branch first)

Add an error "handler" that just sleeps for a bit if errors happen more
often than 500ms. Manually tested against kubernetes#39816.
@lavalamp lavalamp removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 26, 2017
@lavalamp
Copy link
Member Author

OK, this should be mergable now.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2017
@smarterclayton smarterclayton added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2017
k8s-github-robot pushed a commit that referenced this pull request Jan 26, 2017
…97-upstream-release-1.5

Automatic merge from submit-queue

Automated cherry pick of #40497

Cherry pick of #40497 on release-1.5.

#40497: Make HandleError prevent hot-loops
@deads2k
Copy link
Contributor

deads2k commented Jan 26, 2017

Added a tag to resolve questions about alternative solutions before this merges and affects all callers.

@ncdc
Copy link
Member

ncdc commented Jan 26, 2017

I'll second what @deads2k said. Let's focus on the code that's hot-looping and formalize our error handling functionality in the controller framework.

@deads2k
Copy link
Contributor

deads2k commented Jan 26, 2017

I'll second what @deads2k said. Let's focus on the code that's hot-looping and formalize our error handling functionality in the controller framework.

.AddRateLimited( is pretty common. apiservice, certificate, daemon, deployment, disruption, endpoints, jobs, namespaces, replicaset, quota, service accounts, tokens, statefulset.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 26, 2017

So there's two different arguments going on here:

  1. The process has a spout that is unbounded (handle error) which is intended to be used for handling "shrug" scenarios (literally, we fail, and it's better than eating the error) - we want all code to use that spout, but the spout must be rate limited
  2. Controllers already have a mechanism for this, but it's not being used in this case

David is concerned that 2 should be fixed at the same time as 1, and that 1 makes 2 not work as well (because it allows other code that is not controllers to break controllers). I agree with this, but we'd need to police the mechanism for 2.

So:

  1. We must rate limit util.HandleError to non-system destructive levels (arguably this should be per handler type, because system logging is not the same as shipping errors off to third party systems, not global, but whatever)
  2. We need a way to allow proper rate limited things like controllers to bypass 1 safely
  3. We need to prevent other people in the system from abusing 2 to bypass 1. That's effectively something like import boss.

@smarterclayton
Copy link
Contributor

Doesn't glog have a rate limiter? Why wouldn't we also set that?

@deads2k
Copy link
Contributor

deads2k commented Jan 26, 2017

Controllers already have a mechanism for this, but it's not being used

It's be used by the majority, it's not being used by GC. This effectively breaks the backoff handling of most controllers in a way that unnecessarily blocks execution of that controller (AddRateLimited doesn't) and the majority of callers from the controller packages don't want a delay like this.

Further, if you're building something to just stop a hotloop on an unconsidered error (again, not the case in the majority of controllers), you can sleep for a very short period (10s of milliseconds) and you'd just do it unconditionally since the purpose is to just avoid ddos-ing yourself. However, since its the opposite of what the majority of controllers want, all the existing controllers with proper handling need to be updated to not use this new (or severely changed) method.

Ending up in this place means that instead of fixing rating limiting for the GC controller (clients have rate limiters) and instead of fixing the GC controller to AddRateLimited (libraries already exist), means that we've taken the biggest hammer we have and applied it across multiple processes and allowed errors in on go routine to negatively rate limit already rate limited controllers for one bad apple.

@smarterclayton
Copy link
Contributor

However, since its the opposite of what the majority of controllers want, all the existing controllers with proper handling need to be updated to not use this new (or severely changed) method.

Yes, that's 2, and we need 3 to prevent it from being abused.

Ending up in this place means that instead of fixing rating limiting for the GC controller (clients have rate limiters) and instead of fixing the GC controller to AddRateLimited (libraries already exist), means that we've taken the biggest hammer we have and applied it across multiple processes and allowed errors in on go routine to negatively rate limit already rate limited controllers for one bad apple.

I had the expectation that

  1. GC gets fixed
  2. We add the new method and use it in controllers
  3. We fix controllers that are sending too much to handle error to more accurately ignore common errors
  4. We reduce the delay imposed here to something much smaller.

I would expect us to fix that in 1.6.

@deads2k
Copy link
Contributor

deads2k commented Jan 26, 2017

#38679 switches GC to a ratelimited work queue for 1.6 and simple wait could be added here https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/garbagecollector/garbagecollector.go#L594 as a minimal touch in 1.5 instead of disrupting most of the other controllers in the 1.5 stream.

@lavalamp
Copy link
Member Author

I did this because I never want to have an afternoon destroyed by someone logging every 20 microseconds again, and this seemed like the most general place with the fewest side effects. I don't care if we make it 50ms instead of 500. Should we really be HandleErroring that frequently? I think that's a problem even if it's separate components.

@deads2k
Copy link
Contributor

deads2k commented Jan 26, 2017

We spoke in slack and decided that a 1ms delay would protect infrastructure with minimal impact to callers like controllers. Once its updated, I'm ok with sleeping here.

@mikedanese
Copy link
Member

@lavalamp can you cherrypick into 1.4 while you are at it?

@lavalamp
Copy link
Member Author

We had a discussion on slack and it seems 1ms is the number that makes everyone able to live with a global limit like this. I will update this PR and send an adjustment to the 1.5 branch.

@lavalamp
Copy link
Member Author

OK, number adjusted, in a second commit so it'll be easier to cherrypick.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2017
@mikedanese
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2017
k8s-github-robot pushed a commit that referenced this pull request Jan 27, 2017
…97-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of #40497

Cherry pick of #40497 on release-1.4.

#40497: Make HandleError prevent hot-loops
@deads2k deads2k removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 27, 2017
@lavalamp
Copy link
Member Author

@grodrigues3 @apelisse The bot seems confused about the LGTM ordering here? Or is there some other reason why this is stuck?

@apelisse
Copy link
Member

Agreed, there is a confusion. Thanks Lavalamp, somebody noticed this bug before but we couldn't figure out what was going wrong. I think I somewhat understand now.

@apelisse apelisse added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 28, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit fe2829c into kubernetes:master Jan 28, 2017
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GC] controller manager gets error "unable to get REST mapping for kind" for ownerRefs to TPR and add-on APIs