Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change cluster naming convention for e2e CI/PR jobs #7682

Conversation

shyamjvs
Copy link
Member

@shyamjvs shyamjvs commented Apr 13, 2018

Ref #7673

Does this look reasonable to you?

/cc @krzyzacy @BenTheElder @cjwagner

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 13, 2018
@krzyzacy
Copy link
Member

/cc @rmmh
looks like this will work - not sure if there's case that we can trigger multiple batch jobs?

@k8s-ci-robot k8s-ci-robot requested a review from rmmh April 13, 2018 18:04
@shyamjvs shyamjvs force-pushed the make-presubmit-cluster-names-different branch from 7180b60 to 9fa0b98 Compare April 15, 2018 19:31
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 15, 2018
@shyamjvs
Copy link
Member Author

@rmmh Could you PTAL? The rationale of this change is explained here - #7673 (comment)

@@ -657,9 +657,10 @@ def pr_paths(base, repos, job, build):
# Batch merges are those with more than one PR specified.
pr_nums = pull_numbers(pull)
if len(pr_nums) > 1:
pull = os.path.join(prefix, 'batch')
os.environ[PULL_ENV] = 'batch'
Copy link
Contributor

@rmmh rmmh Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change bootstrap.py -- PULL_NUMBER is set by Prow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - PTAL.

# This ensures no conflict across runs of different jobs (see #7592).
# For PR jobs, we use PR number instead of build number to ensure the
# name is constant across different runs of the presubmit on the PR.
# This helps clean potentially leaked resources from earlier run that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the cluster already exists, it will be deleted first, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right - as part of k/k's e2e-up.sh script.

@shyamjvs shyamjvs force-pushed the make-presubmit-cluster-names-different branch from 9fa0b98 to cbb1934 Compare April 16, 2018 18:16
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 16, 2018
@shyamjvs
Copy link
Member Author

@rmmh Fixed on comments - PTAL.

@krzyzacy
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2018
# name is constant across different runs of the presubmit on the PR.
# This helps clean potentially leaked resources from earlier run that
# could've got evicted midway (see #7673).
suffix = os.getenv('BUILD_NUMBER', 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this dispatch explicit based on JOB_TYPE, and reduce the amount of magic happening in getenv:

job_type = os.getenv('JOB_TYPE')
if job_type == 'batch':
    suffix = 'batch'
elif job_type == 'presubmit':
    suffix = '%s' % os.environ['PULL_NUMBER']
else:
    suffix = 'b%s' % os.getenv('BUILD_NUMBER', 0)
if len(suffix) > 10:
    suffix = hashlib.md5(suffix).hexdigest([:10])
job_hash = hashlib.md5(os.getenv('JOB_NAME', '')).hexdigest()[:5]
return 'e2e-%s-%s' % (suffix, job_hash)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@krzyzacy
Copy link
Member

my concern here:

you have a PR:

  • presubmit-scale run 1 is triggered
    • cluster abc is created
  • new commit is pushed
  • presubmit-scale run 2 is triggered, run 1 is still running
    • cluster abc is killed, then created again
  • run 1 is finished, or terminated gracefully after 30min
    • cluster abc is killed from run 1
  • run 2 is going to be in a broken state now

@shyamjvs
Copy link
Member Author

shyamjvs commented Apr 16, 2018

@krzyzacy With this change, we wouldn't need to set a grace-period for the scalability jobs (and in fact for any job that's creating a k8s cluster) :)

[EDIT: To clarify, reaping the leaked resources would be part of the next run's setup scripts]

@shyamjvs shyamjvs force-pushed the make-presubmit-cluster-names-different branch from cbb1934 to 256b8d8 Compare April 16, 2018 18:39
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 16, 2018
@shyamjvs shyamjvs force-pushed the make-presubmit-cluster-names-different branch from 256b8d8 to 0e0e047 Compare April 16, 2018 18:41
@rmmh
Copy link
Contributor

rmmh commented Apr 16, 2018

we should probably retain a long enough grace period for the test logs to upload

/lgtm
/hold

@shyamjvs
Copy link
Member Author

I think we should switch everything to boskos, but that may take time..

This may not be possible currently for our scalability presubmits due to need for various kinds of quota (unless we can somehow change boskos to account for that).

@BenTheElder
Copy link
Member

BenTheElder commented Apr 18, 2018

This may not be possible currently for our scalability presubmits due to need for various kinds of quota (unless we can somehow change boskos to account for that).

Boskos supports multiple resource pools, gpu related testing has its own collection of projects with special quota. We can do something similar for scalability?

@krzyzacy
Copy link
Member

@shyamjvs there's ways to manage quota for internal projects, you don't have to do it manually

@shyamjvs
Copy link
Member Author

Boskos supports multiple resource pools, gpu related testing has its own collection of projects with special quota. We can do something similar for scalability?

I see, thanks. If we can somehow ensure that our jobs land on specific project(s), that should work.

there's ways to manage quota for internal projects, you don't have to do it manually

By "manage quota" do you mean manage increasing/decreasing quota for our scalability projects or manage allocation of projects to our jobs by test-infra?

@BenTheElder
Copy link
Member

I see, thanks. If we can somehow ensure that our jobs land on specific project(s), that should work.

Yeah, for gpu jobs we have the flag --gcp-project-type=gpu-project, we should be able to do --gcp-project-type=scalability or similar, @krzyzacy is the expert on this.

@krzyzacy
Copy link
Member

chatted offline with @shyamjvs , I'm fine with stabilize the tests first, then tackle #7769

@shyamjvs
Copy link
Member Author

Can we get this PR in if there are no other concerns?
This is important to fix the scalability presubmits that are currently failing due to leaked resources eating up quota:

If there are some discussions that can happen independent of this PR (like moving scale jobs to boskos) - I'd prefer not blocking this on those.
And regarding this PR itself, to clarify:

  • this doesn't change behaviour of CI jobs (as I'm still continuing to use build-number as part of cluster name)
  • this doesn't change behaviour of batch presubmit runs (same as above)
  • this changes behaviour for normal presubmits runs as follows: If a presubmit run is killed suddenly, then next run will clean the leaked resources due to same cluster name (reducing dependency on janitor)
  • once we have a proper graceful-deletion based solution - we may be able to get rid of this. But until then I'd suggest having this change to mitigate the leakage issues.

Does it SG?

@shyamjvs
Copy link
Member Author

For now, I manually cleaned up the leaked resources from so many runs. With this PR in, there should be way less leaks.

@krzyzacy
Copy link
Member

/lgtm
/hold
I'm OOO today... @cjwagner maybe flip AllowCancellations and then merge this one?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2018
@BenTheElder
Copy link
Member

I would prefer we not check in new code using a deprecated environment variable, FWIW.

@cjwagner
Copy link
Member

I'm OOO today... @cjwagner maybe flip AllowCancellations and then merge this one?

Flipping AllowCancellations should not make a difference because sinker is deleting the pods anyways currently. Basically we are always allowing cancellations.

The fix that I am suggesting would just fix the AllowCancellations: false case so that pods still run to completion if their prowjob is aborted. This case wouldn't apply to our Prow instance because we would set AllowCancellations: true.

@shyamjvs
Copy link
Member Author

shyamjvs commented Apr 23, 2018

I would prefer we not check in new code using a deprecated environment variable, FWIW.

@BenTheElder If the problem is using BUILD_NUMBER (which is being deprecated) instead of BUILD_ID.. then I can easily change that :)

@BenTheElder
Copy link
Member

@BenTheElder If the problem is using BUILD_NUMBER (which is being deprecated) instead of BUILD_ID.. then I can easily change that :)

Please do. We should minimize the number of environment variables in use where reasonable.

@shyamjvs shyamjvs force-pushed the make-presubmit-cluster-names-different branch from cca348b to b82a294 Compare April 23, 2018 18:59
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2018
@shyamjvs
Copy link
Member Author

@BenTheElder Done, PTAL

job_type = os.getenv('JOB_TYPE')
if job_type == 'batch':
suffix = 'batch-%s' % os.getenv('BUILD_ID', 0)
elif job_type == 'presubmit':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't we just using BUILD_ID for all cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the reason to have this PR in the first place :)
See #7673 (comment) for the rationale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use BUILD_ID for the presubmit runs then the whole purpose (of having the same cluster-name for a given presubmit+PR pair) is defeated - and we end being pretty much in the same state that we already are in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er so we're going to depend on cluster naming for the same project + PR number + job and the side effect that kubetest will tear it down ... ? 😯
This feels like a brittle work around that will come to bite us. In the future this will get refactored... We don't want to need the "scenarios" long term..

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold cancel
I'm merging this because we have very real problems with the affected jobs right now, but for the record I don't think this is a strong solution, new projects should be managed by boskos and we should improve the boskos janitor redundancy.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 23, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, krzyzacy, rmmh, shyamjvs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [BenTheElder,krzyzacy,rmmh,shyamjvs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 2c43ee2 into kubernetes:master Apr 23, 2018
@shyamjvs
Copy link
Member Author

shyamjvs commented Apr 23, 2018 via email

@shyamjvs shyamjvs deleted the make-presubmit-cluster-names-different branch April 23, 2018 22:18
@shyamjvs
Copy link
Member Author

Ref kubernetes/kubernetes#62267

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants