Change cluster naming convention for e2e CI/PR jobs #7682

shyamjvs · 2018-04-13T16:41:53Z

Ref #7673

Does this look reasonable to you?

/cc @krzyzacy @BenTheElder @cjwagner

krzyzacy · 2018-04-13T18:04:01Z

/cc @rmmh
looks like this will work - not sure if there's case that we can trigger multiple batch jobs?

shyamjvs · 2018-04-16T09:39:35Z

@rmmh Could you PTAL? The rationale of this change is explained here - #7673 (comment)

rmmh · 2018-04-16T18:07:36Z

jenkins/bootstrap.py

@@ -657,9 +657,10 @@ def pr_paths(base, repos, job, build):
    # Batch merges are those with more than one PR specified.
    pr_nums = pull_numbers(pull)
    if len(pr_nums) > 1:
-        pull = os.path.join(prefix, 'batch')
+        os.environ[PULL_ENV] = 'batch'


don't change bootstrap.py -- PULL_NUMBER is set by Prow.

Done - PTAL.

rmmh · 2018-04-16T18:14:28Z

scenarios/kubernetes_e2e.py

+    # This ensures no conflict across runs of different jobs (see #7592).
+    # For PR jobs, we use PR number instead of build number to ensure the
+    # name is constant across different runs of the presubmit on the PR.
+    # This helps clean potentially leaked resources from earlier run that


if the cluster already exists, it will be deleted first, right?

That's right - as part of k/k's e2e-up.sh script.

shyamjvs · 2018-04-16T18:21:30Z

@rmmh Fixed on comments - PTAL.

krzyzacy · 2018-04-16T18:26:35Z

/hold

rmmh · 2018-04-16T18:29:12Z

scenarios/kubernetes_e2e.py

+    # name is constant across different runs of the presubmit on the PR.
+    # This helps clean potentially leaked resources from earlier run that
+    # could've got evicted midway (see #7673).
+    suffix = os.getenv('BUILD_NUMBER', 0)


make this dispatch explicit based on JOB_TYPE, and reduce the amount of magic happening in getenv:

job_type = os.getenv('JOB_TYPE') if job_type == 'batch': suffix = 'batch' elif job_type == 'presubmit': suffix = '%s' % os.environ['PULL_NUMBER'] else: suffix = 'b%s' % os.getenv('BUILD_NUMBER', 0) if len(suffix) > 10: suffix = hashlib.md5(suffix).hexdigest([:10]) job_hash = hashlib.md5(os.getenv('JOB_NAME', '')).hexdigest()[:5] return 'e2e-%s-%s' % (suffix, job_hash)

krzyzacy · 2018-04-16T18:30:10Z

my concern here:

you have a PR:

presubmit-scale run 1 is triggered
- cluster abc is created
new commit is pushed
presubmit-scale run 2 is triggered, run 1 is still running
- cluster abc is killed, then created again
run 1 is finished, or terminated gracefully after 30min
- cluster abc is killed from run 1
run 2 is going to be in a broken state now

shyamjvs · 2018-04-16T18:33:07Z

@krzyzacy With this change, we wouldn't need to set a grace-period for the scalability jobs (and in fact for any job that's creating a k8s cluster) :)

[EDIT: To clarify, reaping the leaked resources would be part of the next run's setup scripts]

rmmh · 2018-04-16T18:44:32Z

we should probably retain a long enough grace period for the test logs to upload

/lgtm
/hold

shyamjvs · 2018-04-18T20:29:50Z

I think we should switch everything to boskos, but that may take time..

This may not be possible currently for our scalability presubmits due to need for various kinds of quota (unless we can somehow change boskos to account for that).

BenTheElder · 2018-04-18T20:30:40Z

This may not be possible currently for our scalability presubmits due to need for various kinds of quota (unless we can somehow change boskos to account for that).

Boskos supports multiple resource pools, gpu related testing has its own collection of projects with special quota. We can do something similar for scalability?

krzyzacy · 2018-04-18T20:30:54Z

@shyamjvs there's ways to manage quota for internal projects, you don't have to do it manually

shyamjvs · 2018-04-18T20:38:23Z

Boskos supports multiple resource pools, gpu related testing has its own collection of projects with special quota. We can do something similar for scalability?

I see, thanks. If we can somehow ensure that our jobs land on specific project(s), that should work.

there's ways to manage quota for internal projects, you don't have to do it manually

By "manage quota" do you mean manage increasing/decreasing quota for our scalability projects or manage allocation of projects to our jobs by test-infra?

BenTheElder · 2018-04-18T21:22:44Z

I see, thanks. If we can somehow ensure that our jobs land on specific project(s), that should work.

Yeah, for gpu jobs we have the flag --gcp-project-type=gpu-project, we should be able to do --gcp-project-type=scalability or similar, @krzyzacy is the expert on this.

krzyzacy · 2018-04-18T21:37:58Z

chatted offline with @shyamjvs , I'm fine with stabilize the tests first, then tackle #7769

shyamjvs · 2018-04-20T14:38:48Z

Can we get this PR in if there are no other concerns?
This is important to fix the scalability presubmits that are currently failing due to leaked resources eating up quota:

If there are some discussions that can happen independent of this PR (like moving scale jobs to boskos) - I'd prefer not blocking this on those.
And regarding this PR itself, to clarify:

this doesn't change behaviour of CI jobs (as I'm still continuing to use build-number as part of cluster name)
this doesn't change behaviour of batch presubmit runs (same as above)
this changes behaviour for normal presubmits runs as follows: If a presubmit run is killed suddenly, then next run will clean the leaked resources due to same cluster name (reducing dependency on janitor)
once we have a proper graceful-deletion based solution - we may be able to get rid of this. But until then I'd suggest having this change to mitigate the leakage issues.

Does it SG?

shyamjvs · 2018-04-20T15:04:17Z

For now, I manually cleaned up the leaked resources from so many runs. With this PR in, there should be way less leaks.

krzyzacy · 2018-04-20T15:49:48Z

/lgtm
/hold
I'm OOO today... @cjwagner maybe flip AllowCancellations and then merge this one?

BenTheElder · 2018-04-20T18:07:08Z

I would prefer we not check in new code using a deprecated environment variable, FWIW.

cjwagner · 2018-04-20T22:46:41Z

I'm OOO today... @cjwagner maybe flip AllowCancellations and then merge this one?

Flipping AllowCancellations should not make a difference because sinker is deleting the pods anyways currently. Basically we are always allowing cancellations.

The fix that I am suggesting would just fix the AllowCancellations: false case so that pods still run to completion if their prowjob is aborted. This case wouldn't apply to our Prow instance because we would set AllowCancellations: true.

shyamjvs · 2018-04-23T12:10:27Z

I would prefer we not check in new code using a deprecated environment variable, FWIW.

@BenTheElder If the problem is using BUILD_NUMBER (which is being deprecated) instead of BUILD_ID.. then I can easily change that :)

BenTheElder · 2018-04-23T18:53:21Z

@BenTheElder If the problem is using BUILD_NUMBER (which is being deprecated) instead of BUILD_ID.. then I can easily change that :)

Please do. We should minimize the number of environment variables in use where reasonable.

shyamjvs · 2018-04-23T19:00:25Z

@BenTheElder Done, PTAL

BenTheElder · 2018-04-23T20:16:05Z

scenarios/kubernetes_e2e.py

+    job_type = os.getenv('JOB_TYPE')
+    if job_type == 'batch':
+        suffix = 'batch-%s' % os.getenv('BUILD_ID', 0)
+    elif job_type == 'presubmit':


why aren't we just using BUILD_ID for all cases?

That's the reason to have this PR in the first place :)
See #7673 (comment) for the rationale.

If we use BUILD_ID for the presubmit runs then the whole purpose (of having the same cluster-name for a given presubmit+PR pair) is defeated - and we end being pretty much in the same state that we already are in.

Er so we're going to depend on cluster naming for the same project + PR number + job and the side effect that kubetest will tear it down ... ? 😯
This feels like a brittle work around that will come to bite us. In the future this will get refactored... We don't want to need the "scenarios" long term..

BenTheElder

/lgtm
/hold cancel
I'm merging this because we have very real problems with the affected jobs right now, but for the record I don't think this is a strong solution, new projects should be managed by boskos and we should improve the boskos janitor redundancy.

k8s-ci-robot · 2018-04-23T21:57:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, krzyzacy, rmmh, shyamjvs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [BenTheElder,krzyzacy,rmmh,shyamjvs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shyamjvs · 2018-04-23T22:18:09Z

Thanks Ben. That does sound like a reasonable long-term plan.

…

On Tue, Apr 24, 2018, 12:03 AM k8s-ci-robot ***@***.***> wrote: Merged #7682 <#7682>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7682 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEIhkyJBwjED0bM89Xnv48SXG5XV3lz5ks5trk-WgaJpZM4TTwCT> .

shyamjvs · 2018-04-25T17:21:16Z

Ref kubernetes/kubernetes#62267

shyamjvs requested review from BenTheElder and krzyzacy as code owners April 13, 2018 16:41

k8s-ci-robot requested a review from cjwagner April 13, 2018 16:41

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 13, 2018

shyamjvs mentioned this pull request Apr 13, 2018

Prow terminates e2e pods while test is still running which can cause resource leaks #7673

Closed

k8s-ci-robot requested a review from rmmh April 13, 2018 18:04

shyamjvs force-pushed the make-presubmit-cluster-names-different branch from 7180b60 to 9fa0b98 Compare April 15, 2018 19:31

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 15, 2018

rmmh reviewed Apr 16, 2018

View reviewed changes

shyamjvs force-pushed the make-presubmit-cluster-names-different branch from 9fa0b98 to cbb1934 Compare April 16, 2018 18:16

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 16, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2018

rmmh reviewed Apr 16, 2018

View reviewed changes

shyamjvs force-pushed the make-presubmit-cluster-names-different branch from cbb1934 to 256b8d8 Compare April 16, 2018 18:39

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 16, 2018

shyamjvs force-pushed the make-presubmit-cluster-names-different branch from 256b8d8 to 0e0e047 Compare April 16, 2018 18:41

k8s-ci-robot assigned rmmh Apr 16, 2018

k8s-ci-robot assigned krzyzacy Apr 20, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2018

BenTheElder mentioned this pull request Apr 20, 2018

prow: investifgate not exposing any job configuration to the test container #7810

Closed

Change cluster naming convention for e2e CI/PR jobs

b82a294

shyamjvs force-pushed the make-presubmit-cluster-names-different branch from cca348b to b82a294 Compare April 23, 2018 18:59

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2018

BenTheElder reviewed Apr 23, 2018

View reviewed changes

BenTheElder approved these changes Apr 23, 2018

View reviewed changes

k8s-ci-robot assigned BenTheElder Apr 23, 2018

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 23, 2018

k8s-ci-robot merged commit 2c43ee2 into kubernetes:master Apr 23, 2018

shyamjvs deleted the make-presubmit-cluster-names-different branch April 23, 2018 22:18

Change cluster naming convention for e2e CI/PR jobs #7682

Change cluster naming convention for e2e CI/PR jobs #7682

Conversation

shyamjvs commented Apr 13, 2018 • edited Loading

krzyzacy commented Apr 13, 2018

shyamjvs commented Apr 16, 2018

rmmh Apr 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shyamjvs commented Apr 16, 2018

krzyzacy commented Apr 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krzyzacy commented Apr 16, 2018

shyamjvs commented Apr 16, 2018 • edited Loading

rmmh commented Apr 16, 2018

shyamjvs commented Apr 18, 2018

BenTheElder commented Apr 18, 2018 • edited Loading

krzyzacy commented Apr 18, 2018

shyamjvs commented Apr 18, 2018

BenTheElder commented Apr 18, 2018

krzyzacy commented Apr 18, 2018

shyamjvs commented Apr 20, 2018

shyamjvs commented Apr 20, 2018

krzyzacy commented Apr 20, 2018

BenTheElder commented Apr 20, 2018

cjwagner commented Apr 20, 2018

shyamjvs commented Apr 23, 2018 • edited Loading

BenTheElder commented Apr 23, 2018

shyamjvs commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenTheElder left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 23, 2018

shyamjvs commented Apr 23, 2018 via email

shyamjvs commented Apr 25, 2018

shyamjvs commented Apr 13, 2018 •

edited

Loading

rmmh Apr 16, 2018 •

edited

Loading

shyamjvs commented Apr 16, 2018 •

edited

Loading

BenTheElder commented Apr 18, 2018 •

edited

Loading

shyamjvs commented Apr 23, 2018 •

edited

Loading