Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Move boskos testing projects pool to kubernetes.io #390

Closed
cblecker opened this issue Oct 5, 2019 · 19 comments
Closed

RFC: Move boskos testing projects pool to kubernetes.io #390

cblecker opened this issue Oct 5, 2019 · 19 comments
Assignees

Comments

@cblecker
Copy link
Member

cblecker commented Oct 5, 2019

I'd like to start looking at moving the boskos pool over to public-owned projects.

Things I see on the surface we'd need to do:

  • Template to create the projects
  • Understand what the "bare" state of these projects is
  • Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)
  • What quotas do we need to request for the projects
  • Probably other unknown things

cc: @kubernetes/test-infra-admins

@stevekuznetsov
Copy link

/retitle RFC: Move boskos testing projects pool to kubernetes.io

@k8s-ci-robot k8s-ci-robot changed the title RFC: Move bozkos testing projects pool to kubernetes.io RFC: Move boskos testing projects pool to kubernetes.io Oct 7, 2019
@BenTheElder
Copy link
Member

  • Template to create the projects

roughly it's just projects with the CI service account and some admins having access, and quota depending on which pool they are going into

  • Understand what the "bare" state of these projects is

Literally bare. No resources. Just a namespace w/ quota

  • Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Some humans should have backup access, but primarily the CI service account needs access.

That would be pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com (this is visible from the CI logs)

In the future this should be some service account from a publicly owned prow.

  • What quotas do we need to request for the projects

Each boskos pool is defined by the kind of quota present. I don't think the GCP non-gke pool is particularly special (and the GKE pool should be managed by GKE...)

There are also pools for EG GPU testing, which need quota for that, and I think for scale testing (which of course need more of basically all resources)

-Probably other unknown things

We should consider the fact that the state of "is this project available" is in CRDs in the build cluster.

@BenTheElder
Copy link
Member

... continuing (accidentally hit enter)

As long as the state is in the build cluster, that means to switch prow over we'll either have serious disruption (need to spin down the pool) or need a whole new pool.

Humans generally have no need to access these projects, so in terms of getting the community access to the infra, the boskos projects are uninteresting, they're ~100% controlled by automation via public config already.

In terms of spending CNCF GCP credits, they're somewhat more interesting I suppose, if that's what we're going for.

If we're interested in just getting things migrated because we should migrate things, it will be much more useful to migrate boskos along with the management and state and generally replace the legacy prow.k8s.io service accounts etc. (can you tell by the jenkins in the name?) ...

@dims
Copy link
Member

dims commented Oct 16, 2019

/assign @thockin

@thockin
Copy link
Member

thockin commented Oct 16, 2019

This seems like something we can and should enable ASAP. Christoph started with great questions. I'd like to add a couple:

How can we break down the billing or attribution for this? With a single big pool and a single CI service account, I have no idea who spent what money on what things. I think we need to do better than this.

  • Can we use a service account for each coarse "purpose"?
  • Can we use a distinct pool of projects for each coarse purpose?
  • Both?

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity?

Can quota requests be automated?

Who owns this, that we can have this conversation?

The net result of this is probably a script which ensures the requisite projects exist and have the correct IAM for the appropriate CI SA, plus a link to docs explaining what they are for. That alone seems straight-forward, but without an owner to drive it, I don't think we can reasonably do much.

@BenTheElder
Copy link
Member

BenTheElder commented Oct 17, 2019

Can we use a service account for each coarse "purpose"?

We can but the CI users will need to correctly activate their unique service account.

These service accounts need to make their way into Prow, and not much prevents someone from using the wrong SA (the Prow cluster is so old in-place upgraded that it doesn't have RBAC...)

Older style "bootstrap.py" prowjobs (don't worry about the details, most of our CI jobs are these though...) automagically activate a default service account before we get to testing.

Can we use a distinct pool of projects for each coarse purpose?

Yes, we have a few pools today. The GCP projects are monitored here showing a few types (EG GPU):
http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1

The full set of resources (including AWS) is here: https://github.com/kubernetes/test-infra/blob/d8449cb095fb6dc791958bbaf8940c7c1007410c/prow/cluster/boskos-resources.yaml

The biggest trick is just figuring out what a distinct use is and carving these up...
Unfortunately a ton of our CI tests are relatively ownerless so this may be tricky.

Most tests use the generic GCE pool but they don't have to.

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity?

Prow runs O(12000) build/tests a day, if only 25% of those are GCP e2e we'd churn through ~300 projects a day at 10 uses before retiring. I think this probably wouldn't scale.

Can quota requests be automated?

I took a quick look now and didn't see an API, but I'm not sure.

Who owns this, that we can have this conversation?

• boskos the tool? => I might have an answer, but waiting for confirmation
• or this migration? ... unsure
• the prow.k8s.io deployment? => the test-infra maintainers / google engprod team nominally at the moment, the infra runs in the build / test workload cluster.

@thockin
Copy link
Member

thockin commented Oct 17, 2019 via email

@BenTheElder
Copy link
Member

As we consider moving prow into community space, we will HAVE to get a
better story around this.

Agreed. I'm certainly not thrilled about the current state...

That said, I generally don't think we can consider the presubmit testing to be trustworthy, and scheduling with boskos is cooperative. Changing that would be a bit involved.

Cull the herd?

Yes and no.

A lot of valuable signal shouldn't be culled imo, but still doesn't have a clear owner 😞 (EG who owns the periodic integration and unit testing ...?)

We probably need to enforce ownership better somehow. I'm not sure how.

I'd like to set the objective at EVERY test identifies which pool it
belongs to and then as needed we can split those pools to better indicate
the prime spenders.

We can do that incrementally with the new community owned pools we set up, I have no idea what the right granularity would be though.

A project takes 30-45 seconds to create.

... that is a lot faster than I thought. If we can get this to work, that would be a neat trick! 🙃

Yes

ACK, I'm hoping for an official "stepping up to the plate" in the next couple of days ... will circle back. @sebastienvas may serve as an transitionary owner (previously worked on this).

Yes

ACK ... I can certainly help, I'm also hoping for more help though, perhaps @dims who raised this :-)

@dims
Copy link
Member

dims commented Oct 28, 2019

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2020
@cblecker
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2020
@dims dims removed their assignment Apr 29, 2020
@spiffxp
Copy link
Member

spiffxp commented May 13, 2020

I have done some of this under #752

Template to create the projects
Understand what the "bare" state of these projects is

  • ensure-e2e-projects.sh

Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Permissions for the projects to facilitate community member troubleshooting TBD under #844

What quotas do we need to request for the projects

I have thus far only worked out what is needed to match the 'scalability' pool (#851), there are others for ingress and gpu TBD

Probably other unknown things

How do we do billing per-job or per-sig?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2020
@BenTheElder
Copy link
Member

I think this is progressing somewhat?

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2020
@spiffxp
Copy link
Member

spiffxp commented Oct 6, 2020

/remove-lifecycle rotten

Yes, this is progressing:

Revisiting the description

  • Template to create the projects

Done. Unfortunately it's not possible to adjust project quota via API. Everything else is scripted today via https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/ensure-e2e-projects.sh

  • Understand what the "bare" state of these projects is

Done. This was done during development of ensure-e2e-projects.sh. "Bare" as in no-quota-adjustments projects are in the gce-project pool.

  • Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Done for the scope of this issue.

  • What quotas do we need to request for the projects

Done for all pools except ingress projects and aws accounts (which I'm excluding from this issue since they're not gcp projects)

  • Probably other unknown things

There is the question of billing per-job or billing per-sig, in a way that accounts for both cluster-usage and project-usage. I think we should call that out of scope for this issue.


I'm personally ready to close this out. What follow up work do folks think we should have tracking issues for?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 6, 2020
@spiffxp spiffxp added this to In Progress in sig-k8s-infra Oct 14, 2020
@spiffxp
Copy link
Member

spiffxp commented Oct 28, 2020

/close

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sig-k8s-infra automation moved this from In Progress to Done Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
sig-k8s-infra
  
Done
Development

No branches or pull requests

8 participants