Prow terminates e2e pods while test is still running which can cause resource leaks #7673

krzyzacy · 2018-04-12T18:03:12Z

So, prow kills presubmit pods upon a new commit push, which can cause runs like https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62467/pull-kubernetes-kubemark-e2e-gce-big/1392?log#log

Since we simply abort the ginkgo run, the entire cluster is going to be leaked. And currently janitor is having issue doing bulk clean up, while bootstrap does not handle timeout properly.

I'll try to improve the clean up logic for janitor, not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.

/area jobs

cc @shyamjvs @cjwagner @BenTheElder @fejta

shyamjvs · 2018-04-12T18:09:24Z

Ref kubernetes/kubernetes#62267

cc @krzysied @wojtek-t - This is one of the main problems causing failures in our presubmits.

wojtek-t · 2018-04-12T18:12:46Z

not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.

I would say that this is more important than improving janitor.

BenTheElder · 2018-04-12T18:15:52Z

not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.

This one may not be reasonable as cluster cleanup can take a long time, not sure how long a reasonable grace period is and also we'd have to plumb through the grace period. The default appears to be 30s which is likely not enough to do any useful cluster cleanup.

BenTheElder · 2018-04-12T18:16:52Z

/cc @stevekuznetsov
is openshift doing anything interesting with pod termination and the grace period?

krzyzacy · 2018-04-12T18:17:35Z

@shyamjvs can you use a pool of projects for your presubmit? Is default quota enough for your presubmits? If so we can just switch to boskos pool and janitor can do cleanups after per run.

wojtek-t · 2018-04-12T18:17:42Z

This one may not be reasonable as cluster cleanup can take a long time, not sure how long a reasonable grace period is and also we'd have to plumb through the grace period. The default appears to be 30s which is likely not enough to do any useful cluster cleanup.

It's fine to set grace period to 5 or 10 minutes for example.
And in majority of cases we will cleanly shut down cluster.

In that case, cleaning up cluster will be needed less often, and then current mechanisms for that should be enough.

wojtek-t · 2018-04-12T18:18:23Z

Is default quota enough for your presubmits?

@krzyzacy - by far not. We need a looot of quota for it.

BenTheElder · 2018-04-12T18:21:04Z

It's fine to set grace period to 5 or 10 minutes for example.

It may not be fine to set this for all jobs, leaving around extra pods longer while starting more eats into disk on the nodes which will trigger more pods being evicted. cluster defaults are not great around this... (no soft eviction AFAICT).

We're otherwise pretty aggressive about GCing pods (using our own controller) because of our workload to avoid this.

wojtek-t · 2018-04-12T18:23:19Z

It may not be fine to set this for all jobs, leaving around extra pods longer while starting more eats into disk on the nodes which will trigger more pods being evicted. cluster defaults are not great around this... (no soft eviction AFAICT).

I don't understand it. Grace period is max time that kubelet will give you. If you don't have anything else to do, the pod can finish itself. Assuming the test finished normally, there shouldn't be any difference.

krzyzacy · 2018-04-12T18:25:54Z

does a pod automatically send signals to all running processes upon termination, or that part need to be implemented?

BenTheElder · 2018-04-12T18:28:09Z

I don't understand it. Grace period is max time that kubelet will give you. If you don't have anything else to do, the pod can finish itself. Assuming the test finished normally, there shouldn't be any difference.

But we are scheduling another job while terminating the previous one, and tests are very often terminated early by a new commit. I think our existing pod creation logic / rate limiting may not play well with this change as-is.

To be clear: I suggested that this is probably the right route offline before @krzyzacy created this issue, but I also want to make sure we've handled this everywhere (plank, sinker, actually keeping soft-eviction enabled) before we go changing how we schedule things. We don't want more of #5700

shyamjvs · 2018-04-12T18:29:14Z

can you use a pool of projects for your presubmit?

I'm not sure if having a pool of projects will help here. It will just split this problem of resource leaks across different projects. The problem here is we have a quota of X (doesn't matter how many projects) and when there are greater than Y leaked runs (which we should ensure to NOT be the case), we can't run enough no. of presubmits in parallel anymore.

Is default quota enough for your presubmits?

Like Wojtek said, this is far from default quota of most (all?) of our boskos projects.

It's fine to set grace period to 5 or 10 minutes for example.

I agree too.
5-10 mins of grace period is totally worth the time IMO (and that might be enough to delete the cluster) than waiting for janitor to reap garbage resources after 3 hours :)

krzyzacy · 2018-04-12T18:33:08Z

I'm not sure if having a pool of projects will help here. It will just split this problem of resource leaks across different projects. The problem here is we have a quota of X (doesn't matter how many projects) and when there are greater than Y leaked runs (which we should ensure to NOT be the case), we can't run enough no. of presubmits in parallel anymore.

It will be in a better situation - with a pool, right after each e2e run, we cleans up the project and return it back to pool for the next e2e job, so that every e2e job will have a clean project.

Currently janitor is naively running every 3 hours and cleans up resources older than 3h in the project, and within that 3h resources can be piling up.

shyamjvs · 2018-04-12T18:54:08Z

It will be in a better situation - with a pool, right after each e2e run, we cleans up the project and return it back to pool for the next e2e job, so that every e2e job will have a clean project.

Not sure if I understand this. Maybe let me clarify my earlier point:

let's say we have 12 projects each with a quota of 100 nodes instead of a single project with quota of 1200
if everything goes fine, we should be able to continuously have 12 runs in parallel (in both cases)
however if some run got evicted and leaked 100 nodes - in either cases, we need to wait for janitor to reclaim those 100 nodes before we can reuse it again
IIUC we need to wait for 3h (in either cases) in the worst case for that to happen

Let me know how having multiple projects will help here?

krzyzacy · 2018-04-12T18:57:26Z

@shyamjvs boskos has a janitor controller, it will be triggered as soon as kubetest finishes, and, say, spend 5min to clean up the project, and project will be reuseable after 5min.

Currently if all 12 runs are busted, it need to wait at least 3h, or worst case, 6h, for a CI janitor run to clean the resources up.

(difference is, in boskos, janitor knows the project is not being used, and currently, janitor cannot tell if a resources is stale or not, so it assumes everything older than 3h is stale)

BenTheElder · 2018-04-12T18:58:22Z

Per @cjwagner we may not be actually deleting pods as allow cancellations is not on for our deployment...

test-infra/prow/plank/controller.go

Line 264 in 9fc2342

if c.ca.Config().Plank.AllowCancellations {

krzyzacy · 2018-04-12T19:01:30Z

maybe sinker is deleting the pod then, since prowjob is completed

cjwagner · 2018-04-12T19:13:28Z

As noted during standup, Prow doesn't cancel old pods by default (see https://github.com/kubernetes/test-infra/blob/master/prow/plank/controller.go#L264), but it does cancel the old ProwJob.
Sinker should probably be waiting for the pod to complete in addition to waiting for the ProwJob to complete and the pod to reach the max age: https://github.com/kubernetes/test-infra/blob/master/prow/cmd/sinker/main.go#L189

BenTheElder · 2018-04-12T19:16:23Z

Sinker should probably be waiting for the pod to complete in addition to waiting for the ProwJob to complete and the pod to reach the max age: https://github.com/kubernetes/test-infra/blob/master/prow/cmd/sinker/main.go#L189

If we do this without signaling the pods for deletion we will have a big pileup of pods because the jobs will simply keep running for the full length of the job.

cjwagner · 2018-04-12T19:18:45Z

If we do this without signaling the pods for deletion we will have a big pileup of pods because the jobs will simply keep running for the full length of the job.

Isn't that the desired behavior when AllowCancellations is false? If we want to terminate pods when the ProwJob is terminated we should enable AllowCancellations.

BenTheElder · 2018-04-12T19:29:41Z

Isn't that the desired behavior when AllowCancellations is false? If we want to terminate pods when the ProwJob is terminated we should enable AllowCancellations.

This behavior is somewhat orthogonal to Sinker, but yes that sounds right to me. My point is just that if we do fix sinker in this way we will immediately have issues as we're currently relying on this aggressive GC...

We might be able to rebuild the prow cluster with nodes with more boot disk first to help alleviate this. The cluster needs some reshaping as-is.

stevekuznetsov · 2018-04-12T20:22:05Z

We still run 99% of our jobs on Jenkins, and cancelling the job runs the post-steps, so we're fine :)

krzyzacy · 2018-04-12T21:18:16Z

sounds like easiest thing is that we can introduce a different gc interval for running pods w/ finished prowjobs?

krzyzacy · 2018-04-13T00:19:17Z

@shyamjvs how do you feel like to have a pool of projects instead of one giant project?

shyamjvs · 2018-04-13T13:09:56Z

how do you feel like to have a pool of projects instead of one giant project?

I'm not a fan of that approach for at least the following reasons:

it'll mean that we need to maintain multiple projects and record-keeping / changing project properties (for e.g quotas) will be much tedious
we don't want to keep creating new projects as we increase the parallelism of those jobs
this behavior you mention in Prow terminates e2e pods while test is still running which can cause resource leaks #7673 (comment) seems to me like only a artificial difference. Even in the case of one big project - what's stopping doing the same kind of project cleanup after test run for the resources it created?

shyamjvs · 2018-04-13T13:26:31Z

Actually after thinking a bit more, I have an idea to solve this problem easily (without needing either grace-period or prow changes).
Given that we see this problem only for PR jobs (where we are killing the run bluntly when new commit pushed) and the assumption that "a PR should only have single instance of a particular presubmit" holds, we can name the cluster we create for a presubmit job as:

e2e-<PR#>-<hash of job name>

The above naming has the following nice properties:

It is unique for a given (presubmit job, PR) pair: So there will be no clashes across PRs or across presubmits for a single PR
Since the new run of the same presubmit job uses the same cluster name, it will Down the earlier cluster as part of our setup scripts automatically before starting the new one

One potential consequence I can think of here is increased time for the run due to extra "down" step. But from historical data the TearDown step takes just ~7 mins even for our 100-node presubmit - which is quite reasonable IMO.

If that SG, I can easily make the change.

shyamjvs · 2018-04-13T16:12:09Z

One more case that I missed earlier is that of presubmit running on batch of PRs. In such case, we might either use the first PR's number in the batch or append all the numbers and take a hash. Both these schemes are sensitive to ordering of the PRs, but I'm not sure if we should care enough about it.

[EDIT]: Actually a better approach seems to be to use 'batch' instead of PR# for such cases (just like how we're doing it in prow already). This is simple and also a clean way to handle leaks from batch runs.

shyamjvs · 2018-04-13T16:43:24Z

I've sent #7682 implementing my idea above - let me know what you think of it.

BenTheElder · 2018-08-31T23:32:52Z

we should really move to aborting old jobs :(

shyamjvs · 2018-08-31T23:53:10Z

That smells like a broken teardown script. I think we should just fix that by adding retries - I'll try looking into it next week. For the long run it seems like this won't be a problem anymore when we move to boskos.

krzyzacy · 2018-08-31T23:56:05Z

I think bad things can happen, like while running kube-up, you run kube-down... it will be hard to fix, so, yes let's move to boskos :-)

shyamjvs · 2018-09-01T00:09:36Z

I agree about the boskos part, but I'm not sure why bad things can happen with kube-up and kube-down (we should only have any one of them happening at any point in time, which should be ok AFAIK).

krzyzacy · 2018-09-01T00:13:14Z

compare https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/68096/pull-kubernetes-e2e-gce-100-performance/18880/ and https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/68096/pull-kubernetes-e2e-gce-100-performance/18882/

both of them shares the same cluster name: e2e-68096-95a39

18882 is triggered 2min after 18880, so when 18880 is running into kube-up, 18882 is probably also executing kube-down, or kube-up, which is racing there.

shyamjvs · 2018-09-01T00:20:43Z

Hmm.. Why wasn't 18880 killed as 18882 was about to start?

krzyzacy · 2018-09-01T00:25:58Z

/shrug
#7673 (comment)
we haven't done anything here yet

BenTheElder · 2018-09-01T00:31:37Z

I think we should enable termination, add a large grace period for non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear down the old and spin up the new concurrently.

shyamjvs · 2018-09-01T07:21:09Z

I don't think that would work as we would end up with just the same problem we had before - leaked resources eating up all quota time and again (which won't let us run the presubmit at the wanted concurrency).

…

On Sat, Sep 1, 2018, 2:31 AM Benjamin Elder ***@***.***> wrote: I think we should enable termination, add a large grace period for non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear down the old and spin up the new concurrently. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7673 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEIhk7xb4-VGy3QuDoElg9urepKWz9q7ks5uWdVpgaJpZM4TSPlD> .

BenTheElder · 2018-09-01T07:23:34Z

Not if the tear down is actually working and we respect max concurrency...? Right now we're wasting money and time on runs that will never pass frequently due to this issue.

…

On Sat, Sep 1, 2018, 00:21 Shyam JVS ***@***.***> wrote: I don't think that would work as we would end up with just the same problem we had before - leaked resources eating up all quota time and again (which won't let us run the presubmit at the wanted concurrency). On Sat, Sep 1, 2018, 2:31 AM Benjamin Elder ***@***.***> wrote: > I think we should enable termination, add a large grace period for > non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear down > the old and spin up the new concurrently. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #7673 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AEIhk7xb4-VGy3QuDoElg9urepKWz9q7ks5uWdVpgaJpZM4TSPlD > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7673 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4BqxP8YyOrLpS_R4X3MAAyo9I2gstSks5uWjVmgaJpZM4TSPlD> .

fejta-bot · 2018-11-30T07:35:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-12-30T08:19:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

wojtek-t · 2019-01-04T12:36:29Z

/remove-lifecycle rotten

mithrav · 2019-01-04T16:59:15Z

What can we do to move this along? Has been over 6 months.
/remove-lifecycle stale

BenTheElder · 2019-01-04T19:10:37Z

Jobs should not depend on exit traps / cleanup within the task. This is a wontfix.

Even if we changed k8s / Prow we can't guarantee that infra won't ever fail. Cleanup needs to be reentrant and managed by controllers or follow up tasks (IE use boskos).

BenTheElder · 2019-01-04T19:11:08Z

@amwat has been moving more of this to boskos

fejta-bot · 2019-04-04T19:13:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-04T19:44:33Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-03T20:26:31Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-03T20:26:39Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the area/jobs label Apr 12, 2018

shyamjvs mentioned this issue Apr 13, 2018

Change cluster naming convention for e2e CI/PR jobs #7682

Merged

k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Sep 1, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2019

k8s-ci-robot closed this as completed Jun 3, 2019

BenTheElder mentioned this issue Jul 31, 2019

Scalability presubmits: Properly cope with termination #13707

Closed

Prow terminates e2e pods while test is still running which can cause resource leaks #7673

Prow terminates e2e pods while test is still running which can cause resource leaks #7673

Comments

krzyzacy commented Apr 12, 2018

shyamjvs commented Apr 12, 2018

wojtek-t commented Apr 12, 2018

BenTheElder commented Apr 12, 2018

BenTheElder commented Apr 12, 2018

krzyzacy commented Apr 12, 2018

wojtek-t commented Apr 12, 2018

wojtek-t commented Apr 12, 2018

BenTheElder commented Apr 12, 2018

wojtek-t commented Apr 12, 2018

krzyzacy commented Apr 12, 2018

BenTheElder commented Apr 12, 2018

shyamjvs commented Apr 12, 2018

krzyzacy commented Apr 12, 2018

shyamjvs commented Apr 12, 2018

krzyzacy commented Apr 12, 2018 • edited Loading

BenTheElder commented Apr 12, 2018

krzyzacy commented Apr 12, 2018

cjwagner commented Apr 12, 2018

BenTheElder commented Apr 12, 2018

cjwagner commented Apr 12, 2018

BenTheElder commented Apr 12, 2018

stevekuznetsov commented Apr 12, 2018

krzyzacy commented Apr 12, 2018

krzyzacy commented Apr 13, 2018

shyamjvs commented Apr 13, 2018

shyamjvs commented Apr 13, 2018

shyamjvs commented Apr 13, 2018 • edited Loading

shyamjvs commented Apr 13, 2018

BenTheElder commented Aug 31, 2018

shyamjvs commented Aug 31, 2018 • edited Loading

krzyzacy commented Aug 31, 2018

shyamjvs commented Sep 1, 2018

krzyzacy commented Sep 1, 2018

shyamjvs commented Sep 1, 2018

krzyzacy commented Sep 1, 2018

BenTheElder commented Sep 1, 2018

shyamjvs commented Sep 1, 2018 via email

BenTheElder commented Sep 1, 2018 via email

fejta-bot commented Nov 30, 2018

fejta-bot commented Dec 30, 2018

wojtek-t commented Jan 4, 2019

mithrav commented Jan 4, 2019

BenTheElder commented Jan 4, 2019

BenTheElder commented Jan 4, 2019

fejta-bot commented Apr 4, 2019

fejta-bot commented May 4, 2019

fejta-bot commented Jun 3, 2019

k8s-ci-robot commented Jun 3, 2019

krzyzacy commented Apr 12, 2018 •

edited

Loading

shyamjvs commented Apr 13, 2018 •

edited

Loading

shyamjvs commented Aug 31, 2018 •

edited

Loading