Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow terminates e2e pods while test is still running which can cause resource leaks #7673

Closed
krzyzacy opened this issue Apr 12, 2018 · 62 comments
Labels
area/jobs lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯

Comments

@krzyzacy
Copy link
Member

So, prow kills presubmit pods upon a new commit push, which can cause runs like https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62467/pull-kubernetes-kubemark-e2e-gce-big/1392?log#log

Since we simply abort the ginkgo run, the entire cluster is going to be leaked. And currently janitor is having issue doing bulk clean up, while bootstrap does not handle timeout properly.

I'll try to improve the clean up logic for janitor, not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.

/area jobs

cc @shyamjvs @cjwagner @BenTheElder @fejta

@shyamjvs
Copy link
Member

Ref kubernetes/kubernetes#62267

cc @krzysied @wojtek-t - This is one of the main problems causing failures in our presubmits.

@wojtek-t
Copy link
Member

not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.

I would say that this is more important than improving janitor.

@BenTheElder
Copy link
Member

not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.

This one may not be reasonable as cluster cleanup can take a long time, not sure how long a reasonable grace period is and also we'd have to plumb through the grace period. The default appears to be 30s which is likely not enough to do any useful cluster cleanup.

@BenTheElder
Copy link
Member

/cc @stevekuznetsov
is openshift doing anything interesting with pod termination and the grace period?

@krzyzacy
Copy link
Member Author

@shyamjvs can you use a pool of projects for your presubmit? Is default quota enough for your presubmits? If so we can just switch to boskos pool and janitor can do cleanups after per run.

@wojtek-t
Copy link
Member

This one may not be reasonable as cluster cleanup can take a long time, not sure how long a reasonable grace period is and also we'd have to plumb through the grace period. The default appears to be 30s which is likely not enough to do any useful cluster cleanup.

It's fine to set grace period to 5 or 10 minutes for example.
And in majority of cases we will cleanly shut down cluster.

In that case, cleaning up cluster will be needed less often, and then current mechanisms for that should be enough.

@wojtek-t
Copy link
Member

Is default quota enough for your presubmits?

@krzyzacy - by far not. We need a looot of quota for it.

@BenTheElder
Copy link
Member

It's fine to set grace period to 5 or 10 minutes for example.

It may not be fine to set this for all jobs, leaving around extra pods longer while starting more eats into disk on the nodes which will trigger more pods being evicted. cluster defaults are not great around this... (no soft eviction AFAICT).

We're otherwise pretty aggressive about GCing pods (using our own controller) because of our workload to avoid this.

@wojtek-t
Copy link
Member

It may not be fine to set this for all jobs, leaving around extra pods longer while starting more eats into disk on the nodes which will trigger more pods being evicted. cluster defaults are not great around this... (no soft eviction AFAICT).

I don't understand it. Grace period is max time that kubelet will give you. If you don't have anything else to do, the pod can finish itself. Assuming the test finished normally, there shouldn't be any difference.

@krzyzacy
Copy link
Member Author

does a pod automatically send signals to all running processes upon termination, or that part need to be implemented?

@BenTheElder
Copy link
Member

I don't understand it. Grace period is max time that kubelet will give you. If you don't have anything else to do, the pod can finish itself. Assuming the test finished normally, there shouldn't be any difference.

But we are scheduling another job while terminating the previous one, and tests are very often terminated early by a new commit. I think our existing pod creation logic / rate limiting may not play well with this change as-is.

To be clear: I suggested that this is probably the right route offline before @krzyzacy created this issue, but I also want to make sure we've handled this everywhere (plank, sinker, actually keeping soft-eviction enabled) before we go changing how we schedule things. We don't want more of #5700

@shyamjvs
Copy link
Member

can you use a pool of projects for your presubmit?

I'm not sure if having a pool of projects will help here. It will just split this problem of resource leaks across different projects. The problem here is we have a quota of X (doesn't matter how many projects) and when there are greater than Y leaked runs (which we should ensure to NOT be the case), we can't run enough no. of presubmits in parallel anymore.

Is default quota enough for your presubmits?

Like Wojtek said, this is far from default quota of most (all?) of our boskos projects.

It's fine to set grace period to 5 or 10 minutes for example.

I agree too.
5-10 mins of grace period is totally worth the time IMO (and that might be enough to delete the cluster) than waiting for janitor to reap garbage resources after 3 hours :)

@krzyzacy
Copy link
Member Author

I'm not sure if having a pool of projects will help here. It will just split this problem of resource leaks across different projects. The problem here is we have a quota of X (doesn't matter how many projects) and when there are greater than Y leaked runs (which we should ensure to NOT be the case), we can't run enough no. of presubmits in parallel anymore.

It will be in a better situation - with a pool, right after each e2e run, we cleans up the project and return it back to pool for the next e2e job, so that every e2e job will have a clean project.

Currently janitor is naively running every 3 hours and cleans up resources older than 3h in the project, and within that 3h resources can be piling up.

@shyamjvs
Copy link
Member

It will be in a better situation - with a pool, right after each e2e run, we cleans up the project and return it back to pool for the next e2e job, so that every e2e job will have a clean project.

Not sure if I understand this. Maybe let me clarify my earlier point:

  • let's say we have 12 projects each with a quota of 100 nodes instead of a single project with quota of 1200
  • if everything goes fine, we should be able to continuously have 12 runs in parallel (in both cases)
  • however if some run got evicted and leaked 100 nodes - in either cases, we need to wait for janitor to reclaim those 100 nodes before we can reuse it again
  • IIUC we need to wait for 3h (in either cases) in the worst case for that to happen

Let me know how having multiple projects will help here?

@krzyzacy
Copy link
Member Author

krzyzacy commented Apr 12, 2018

@shyamjvs boskos has a janitor controller, it will be triggered as soon as kubetest finishes, and, say, spend 5min to clean up the project, and project will be reuseable after 5min.

Currently if all 12 runs are busted, it need to wait at least 3h, or worst case, 6h, for a CI janitor run to clean the resources up.

(difference is, in boskos, janitor knows the project is not being used, and currently, janitor cannot tell if a resources is stale or not, so it assumes everything older than 3h is stale)

@BenTheElder
Copy link
Member

Per @cjwagner we may not be actually deleting pods as allow cancellations is not on for our deployment...

if c.ca.Config().Plank.AllowCancellations {

@krzyzacy
Copy link
Member Author

maybe sinker is deleting the pod then, since prowjob is completed

@cjwagner
Copy link
Member

As noted during standup, Prow doesn't cancel old pods by default (see https://github.com/kubernetes/test-infra/blob/master/prow/plank/controller.go#L264), but it does cancel the old ProwJob.
Sinker should probably be waiting for the pod to complete in addition to waiting for the ProwJob to complete and the pod to reach the max age: https://github.com/kubernetes/test-infra/blob/master/prow/cmd/sinker/main.go#L189

@BenTheElder
Copy link
Member

Sinker should probably be waiting for the pod to complete in addition to waiting for the ProwJob to complete and the pod to reach the max age: https://github.com/kubernetes/test-infra/blob/master/prow/cmd/sinker/main.go#L189

If we do this without signaling the pods for deletion we will have a big pileup of pods because the jobs will simply keep running for the full length of the job.

@cjwagner
Copy link
Member

If we do this without signaling the pods for deletion we will have a big pileup of pods because the jobs will simply keep running for the full length of the job.

Isn't that the desired behavior when AllowCancellations is false? If we want to terminate pods when the ProwJob is terminated we should enable AllowCancellations.

@BenTheElder
Copy link
Member

Isn't that the desired behavior when AllowCancellations is false? If we want to terminate pods when the ProwJob is terminated we should enable AllowCancellations.

This behavior is somewhat orthogonal to Sinker, but yes that sounds right to me. My point is just that if we do fix sinker in this way we will immediately have issues as we're currently relying on this aggressive GC...

We might be able to rebuild the prow cluster with nodes with more boot disk first to help alleviate this. The cluster needs some reshaping as-is.

@stevekuznetsov
Copy link
Contributor

We still run 99% of our jobs on Jenkins, and cancelling the job runs the post-steps, so we're fine :)

@krzyzacy
Copy link
Member Author

sounds like easiest thing is that we can introduce a different gc interval for running pods w/ finished prowjobs?

@krzyzacy
Copy link
Member Author

@shyamjvs how do you feel like to have a pool of projects instead of one giant project?

@shyamjvs
Copy link
Member

how do you feel like to have a pool of projects instead of one giant project?

I'm not a fan of that approach for at least the following reasons:

  • it'll mean that we need to maintain multiple projects and record-keeping / changing project properties (for e.g quotas) will be much tedious
  • we don't want to keep creating new projects as we increase the parallelism of those jobs
  • this behavior you mention in Prow terminates e2e pods while test is still running which can cause resource leaks #7673 (comment) seems to me like only a artificial difference. Even in the case of one big project - what's stopping doing the same kind of project cleanup after test run for the resources it created?

@shyamjvs
Copy link
Member

Actually after thinking a bit more, I have an idea to solve this problem easily (without needing either grace-period or prow changes).
Given that we see this problem only for PR jobs (where we are killing the run bluntly when new commit pushed) and the assumption that "a PR should only have single instance of a particular presubmit" holds, we can name the cluster we create for a presubmit job as:

e2e-<PR#>-<hash of job name>

The above naming has the following nice properties:

  • It is unique for a given (presubmit job, PR) pair: So there will be no clashes across PRs or across presubmits for a single PR
  • Since the new run of the same presubmit job uses the same cluster name, it will Down the earlier cluster as part of our setup scripts automatically before starting the new one

One potential consequence I can think of here is increased time for the run due to extra "down" step. But from historical data the TearDown step takes just ~7 mins even for our 100-node presubmit - which is quite reasonable IMO.

If that SG, I can easily make the change.

@shyamjvs
Copy link
Member

shyamjvs commented Apr 13, 2018

One more case that I missed earlier is that of presubmit running on batch of PRs. In such case, we might either use the first PR's number in the batch or append all the numbers and take a hash. Both these schemes are sensitive to ordering of the PRs, but I'm not sure if we should care enough about it.

[EDIT]: Actually a better approach seems to be to use 'batch' instead of PR# for such cases (just like how we're doing it in prow already). This is simple and also a clean way to handle leaks from batch runs.

@shyamjvs
Copy link
Member

I've sent #7682 implementing my idea above - let me know what you think of it.

@BenTheElder
Copy link
Member

we should really move to aborting old jobs :(

@shyamjvs
Copy link
Member

shyamjvs commented Aug 31, 2018

That smells like a broken teardown script. I think we should just fix that by adding retries - I'll try looking into it next week. For the long run it seems like this won't be a problem anymore when we move to boskos.

@krzyzacy
Copy link
Member Author

I think bad things can happen, like while running kube-up, you run kube-down... it will be hard to fix, so, yes let's move to boskos :-)

@shyamjvs
Copy link
Member

shyamjvs commented Sep 1, 2018

I agree about the boskos part, but I'm not sure why bad things can happen with kube-up and kube-down (we should only have any one of them happening at any point in time, which should be ok AFAIK).

@krzyzacy
Copy link
Member Author

krzyzacy commented Sep 1, 2018

compare https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/68096/pull-kubernetes-e2e-gce-100-performance/18880/ and https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/68096/pull-kubernetes-e2e-gce-100-performance/18882/

both of them shares the same cluster name: e2e-68096-95a39

18882 is triggered 2min after 18880, so when 18880 is running into kube-up, 18882 is probably also executing kube-down, or kube-up, which is racing there.

@shyamjvs
Copy link
Member

shyamjvs commented Sep 1, 2018

Hmm.. Why wasn't 18880 killed as 18882 was about to start?

@krzyzacy
Copy link
Member Author

krzyzacy commented Sep 1, 2018

/shrug
#7673 (comment)
we haven't done anything here yet

@k8s-ci-robot k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Sep 1, 2018
@BenTheElder
Copy link
Member

I think we should enable termination, add a large grace period for non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear down the old and spin up the new concurrently.

@shyamjvs
Copy link
Member

shyamjvs commented Sep 1, 2018 via email

@BenTheElder
Copy link
Member

BenTheElder commented Sep 1, 2018 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2018
@wojtek-t
Copy link
Member

wojtek-t commented Jan 4, 2019

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2019
@mithrav
Copy link
Contributor

mithrav commented Jan 4, 2019

What can we do to move this along? Has been over 6 months.
/remove-lifecycle stale

@BenTheElder
Copy link
Member

Jobs should not depend on exit traps / cleanup within the task. This is a wontfix.

Even if we changed k8s / Prow we can't guarantee that infra won't ever fail. Cleanup needs to be reentrant and managed by controllers or follow up tasks (IE use boskos).

@BenTheElder
Copy link
Member

@amwat has been moving more of this to boskos

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Projects
None yet
Development

No branches or pull requests

10 participants