-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prow terminates e2e pods while test is still running which can cause resource leaks #7673
Comments
Ref kubernetes/kubernetes#62267 cc @krzysied @wojtek-t - This is one of the main problems causing failures in our presubmits. |
I would say that this is more important than improving janitor. |
This one may not be reasonable as cluster cleanup can take a long time, not sure how long a reasonable grace period is and also we'd have to plumb through the grace period. The default appears to be 30s which is likely not enough to do any useful cluster cleanup. |
/cc @stevekuznetsov |
@shyamjvs can you use a pool of projects for your presubmit? Is default quota enough for your presubmits? If so we can just switch to boskos pool and janitor can do cleanups after per run. |
It's fine to set grace period to 5 or 10 minutes for example. In that case, cleaning up cluster will be needed less often, and then current mechanisms for that should be enough. |
@krzyzacy - by far not. We need a looot of quota for it. |
It may not be fine to set this for all jobs, leaving around extra pods longer while starting more eats into disk on the nodes which will trigger more pods being evicted. cluster defaults are not great around this... (no soft eviction AFAICT). We're otherwise pretty aggressive about GCing pods (using our own controller) because of our workload to avoid this. |
I don't understand it. Grace period is max time that kubelet will give you. If you don't have anything else to do, the pod can finish itself. Assuming the test finished normally, there shouldn't be any difference. |
does a pod automatically send signals to all running processes upon termination, or that part need to be implemented? |
But we are scheduling another job while terminating the previous one, and tests are very often terminated early by a new commit. I think our existing pod creation logic / rate limiting may not play well with this change as-is. To be clear: I suggested that this is probably the right route offline before @krzyzacy created this issue, but I also want to make sure we've handled this everywhere (plank, sinker, actually keeping soft-eviction enabled) before we go changing how we schedule things. We don't want more of #5700 |
I'm not sure if having a pool of projects will help here. It will just split this problem of resource leaks across different projects. The problem here is we have a quota of X (doesn't matter how many projects) and when there are greater than Y leaked runs (which we should ensure to NOT be the case), we can't run enough no. of presubmits in parallel anymore.
Like Wojtek said, this is far from default quota of most (all?) of our boskos projects.
I agree too. |
It will be in a better situation - with a pool, right after each e2e run, we cleans up the project and return it back to pool for the next e2e job, so that every e2e job will have a clean project. Currently janitor is naively running every 3 hours and cleans up resources older than 3h in the project, and within that 3h resources can be piling up. |
Not sure if I understand this. Maybe let me clarify my earlier point:
Let me know how having multiple projects will help here? |
@shyamjvs boskos has a janitor controller, it will be triggered as soon as kubetest finishes, and, say, spend 5min to clean up the project, and project will be reuseable after 5min. Currently if all 12 runs are busted, it need to wait at least 3h, or worst case, 6h, for a CI janitor run to clean the resources up. (difference is, in boskos, janitor knows the project is not being used, and currently, janitor cannot tell if a resources is stale or not, so it assumes everything older than 3h is stale) |
Per @cjwagner we may not be actually deleting pods as allow cancellations is not on for our deployment... test-infra/prow/plank/controller.go Line 264 in 9fc2342
|
maybe sinker is deleting the pod then, since prowjob is completed |
As noted during standup, Prow doesn't cancel old pods by default (see https://github.com/kubernetes/test-infra/blob/master/prow/plank/controller.go#L264), but it does cancel the old ProwJob. |
If we do this without signaling the pods for deletion we will have a big pileup of pods because the jobs will simply keep running for the full length of the job. |
Isn't that the desired behavior when |
This behavior is somewhat orthogonal to Sinker, but yes that sounds right to me. My point is just that if we do fix sinker in this way we will immediately have issues as we're currently relying on this aggressive GC... We might be able to rebuild the prow cluster with nodes with more boot disk first to help alleviate this. The cluster needs some reshaping as-is. |
We still run 99% of our jobs on Jenkins, and cancelling the job runs the post-steps, so we're fine :) |
sounds like easiest thing is that we can introduce a different gc interval for running pods w/ finished prowjobs? |
@shyamjvs how do you feel like to have a pool of projects instead of one giant project? |
I'm not a fan of that approach for at least the following reasons:
|
Actually after thinking a bit more, I have an idea to solve this problem easily (without needing either grace-period or prow changes).
The above naming has the following nice properties:
One potential consequence I can think of here is increased time for the run due to extra "down" step. But from historical data the If that SG, I can easily make the change. |
One more case that I missed earlier is that of presubmit running on batch of PRs. In such case, we might either use the first PR's number in the batch or append all the numbers and take a hash. Both these schemes are sensitive to ordering of the PRs, but I'm not sure if we should care enough about it. [EDIT]: Actually a better approach seems to be to use 'batch' instead of PR# for such cases (just like how we're doing it in prow already). This is simple and also a clean way to handle leaks from batch runs. |
I've sent #7682 implementing my idea above - let me know what you think of it. |
we should really move to aborting old jobs :( |
That smells like a broken teardown script. I think we should just fix that by adding retries - I'll try looking into it next week. For the long run it seems like this won't be a problem anymore when we move to boskos. |
I think bad things can happen, like while running kube-up, you run kube-down... it will be hard to fix, so, yes let's move to boskos :-) |
I agree about the boskos part, but I'm not sure why bad things can happen with kube-up and kube-down (we should only have any one of them happening at any point in time, which should be ok AFAIK). |
compare https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/68096/pull-kubernetes-e2e-gce-100-performance/18880/ and https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/68096/pull-kubernetes-e2e-gce-100-performance/18882/ both of them shares the same cluster name: 18882 is triggered 2min after 18880, so when 18880 is running into kube-up, 18882 is probably also executing kube-down, or kube-up, which is racing there. |
Hmm.. Why wasn't 18880 killed as 18882 was about to start? |
/shrug |
I think we should enable termination, add a large grace period for non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear down the old and spin up the new concurrently. |
I don't think that would work as we would end up with just the same problem
we had before - leaked resources eating up all quota time and again (which
won't let us run the presubmit at the wanted concurrency).
…On Sat, Sep 1, 2018, 2:31 AM Benjamin Elder ***@***.***> wrote:
I think we should enable termination, add a large grace period for
non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear down
the old and spin up the new concurrently.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7673 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEIhk7xb4-VGy3QuDoElg9urepKWz9q7ks5uWdVpgaJpZM4TSPlD>
.
|
Not if the tear down is actually working and we respect max concurrency...?
Right now we're wasting money and time on runs that will never pass
frequently due to this issue.
…On Sat, Sep 1, 2018, 00:21 Shyam JVS ***@***.***> wrote:
I don't think that would work as we would end up with just the same problem
we had before - leaked resources eating up all quota time and again (which
won't let us run the presubmit at the wanted concurrency).
On Sat, Sep 1, 2018, 2:31 AM Benjamin Elder ***@***.***>
wrote:
> I think we should enable termination, add a large grace period for
> non-boskos e2e jobs, and switch to random clsuter ids. then we'll tear
down
> the old and spin up the new concurrently.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#7673 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AEIhk7xb4-VGy3QuDoElg9urepKWz9q7ks5uWdVpgaJpZM4TSPlD
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7673 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4BqxP8YyOrLpS_R4X3MAAyo9I2gstSks5uWjVmgaJpZM4TSPlD>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
What can we do to move this along? Has been over 6 months. |
Jobs should not depend on exit traps / cleanup within the task. This is a wontfix. Even if we changed k8s / Prow we can't guarantee that infra won't ever fail. Cleanup needs to be reentrant and managed by controllers or follow up tasks (IE use boskos). |
@amwat has been moving more of this to boskos |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
So, prow kills presubmit pods upon a new commit push, which can cause runs like https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62467/pull-kubernetes-kubemark-e2e-gce-big/1392?log#log
Since we simply abort the ginkgo run, the entire cluster is going to be leaked. And currently janitor is having issue doing bulk clean up, while bootstrap does not handle timeout properly.
I'll try to improve the clean up logic for janitor, not sure if we need to make prow pods have a graceful termination so that we can send a SIGINT | SIGTERM to kubetest, and kubetest can handle cluster cleanup properly.
/area jobs
cc @shyamjvs @cjwagner @BenTheElder @fejta
The text was updated successfully, but these errors were encountered: