RFC: Move boskos testing projects pool to kubernetes.io #390

cblecker · 2019-10-05T01:45:47Z

I'd like to start looking at moving the boskos pool over to public-owned projects.

Things I see on the surface we'd need to do:

Template to create the projects
Understand what the "bare" state of these projects is
Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)
What quotas do we need to request for the projects
Probably other unknown things

cc: @kubernetes/test-infra-admins

stevekuznetsov · 2019-10-07T15:55:37Z

/retitle RFC: Move boskos testing projects pool to kubernetes.io

BenTheElder · 2019-10-07T20:23:57Z

Template to create the projects

roughly it's just projects with the CI service account and some admins having access, and quota depending on which pool they are going into

Understand what the "bare" state of these projects is

Literally bare. No resources. Just a namespace w/ quota

Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Some humans should have backup access, but primarily the CI service account needs access.

That would be pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com (this is visible from the CI logs)

In the future this should be some service account from a publicly owned prow.

What quotas do we need to request for the projects

Each boskos pool is defined by the kind of quota present. I don't think the GCP non-gke pool is particularly special (and the GKE pool should be managed by GKE...)

There are also pools for EG GPU testing, which need quota for that, and I think for scale testing (which of course need more of basically all resources)

-Probably other unknown things

We should consider the fact that the state of "is this project available" is in CRDs in the build cluster.

BenTheElder · 2019-10-07T20:26:18Z

... continuing (accidentally hit enter)

As long as the state is in the build cluster, that means to switch prow over we'll either have serious disruption (need to spin down the pool) or need a whole new pool.

Humans generally have no need to access these projects, so in terms of getting the community access to the infra, the boskos projects are uninteresting, they're ~100% controlled by automation via public config already.

In terms of spending CNCF GCP credits, they're somewhat more interesting I suppose, if that's what we're going for.

If we're interested in just getting things migrated because we should migrate things, it will be much more useful to migrate boskos along with the management and state and generally replace the legacy prow.k8s.io service accounts etc. (can you tell by the jenkins in the name?) ...

dims · 2019-10-16T15:51:00Z

/assign @thockin

thockin · 2019-10-16T16:28:10Z

This seems like something we can and should enable ASAP. Christoph started with great questions. I'd like to add a couple:

How can we break down the billing or attribution for this? With a single big pool and a single CI service account, I have no idea who spent what money on what things. I think we need to do better than this.

Can we use a service account for each coarse "purpose"?
Can we use a distinct pool of projects for each coarse purpose?
Both?

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity?

Can quota requests be automated?

Who owns this, that we can have this conversation?

The net result of this is probably a script which ensures the requisite projects exist and have the correct IAM for the appropriate CI SA, plus a link to docs explaining what they are for. That alone seems straight-forward, but without an owner to drive it, I don't think we can reasonably do much.

BenTheElder · 2019-10-17T02:54:11Z

Can we use a service account for each coarse "purpose"?

We can but the CI users will need to correctly activate their unique service account.

These service accounts need to make their way into Prow, and not much prevents someone from using the wrong SA (the Prow cluster is so old in-place upgraded that it doesn't have RBAC...)

Older style "bootstrap.py" prowjobs (don't worry about the details, most of our CI jobs are these though...) automagically activate a default service account before we get to testing.

Can we use a distinct pool of projects for each coarse purpose?

Yes, we have a few pools today. The GCP projects are monitored here showing a few types (EG GPU):
http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1

The full set of resources (including AWS) is here: https://github.com/kubernetes/test-infra/blob/d8449cb095fb6dc791958bbaf8940c7c1007410c/prow/cluster/boskos-resources.yaml

The biggest trick is just figuring out what a distinct use is and carving these up...
Unfortunately a ton of our CI tests are relatively ownerless so this may be tricky.

Most tests use the generic GCE pool but they don't have to.

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity?

Prow runs O(12000) build/tests a day, if only 25% of those are GCP e2e we'd churn through ~300 projects a day at 10 uses before retiring. I think this probably wouldn't scale.

Can quota requests be automated?

I took a quick look now and didn't see an API, but I'm not sure.

Who owns this, that we can have this conversation?

• boskos the tool? => I might have an answer, but waiting for confirmation
• or this migration? ... unsure
• the prow.k8s.io deployment? => the test-infra maintainers / google engprod team nominally at the moment, the infra runs in the build / test workload cluster.

thockin · 2019-10-17T03:17:27Z

On Wed, Oct 16, 2019 at 7:54 PM Benjamin Elder ***@***.***> wrote: Can we use a service account for each coarse "purpose"? We can but the CI users will need to correctly activate their unique service account. These service accounts need to make their way into Prow, and not much prevents someone from using the wrong SA (the Prow cluster is so old in-place upgraded that it doesn't have RBAC...)

As we consider moving prow into community space, we will HAVE to get a better story around this. Older style "bootstrap.py" prowjobs (don't worry about the details, most of

our CI jobs are these though...) automagically activate a default service account before we get to testing. Can we use a distinct pool of projects for each coarse purpose? Yes, we have a few pools today. The GCP projects are monitored here showing a few types (EG GPU): http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1 The full set of resources (including AWS) is here: https://github.com/kubernetes/test-infra/blob/d8449cb095fb6dc791958bbaf8940c7c1007410c/prow/cluster/boskos-resources.yaml The biggest trick is just figuring out what a distinct use is and carving these up... Unfortunately a ton of our CI tests are relatively ownerless so this may be tricky.

Cull the herd?

Most tests use the generic GCE pool but they don't *have* to.

I'd like to set the objective at EVERY test identifies which pool it belongs to and then as needed we can split those pools to better indicate the prime spenders.

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity? Prow runs O(1200) build/tests a day, if only 25% of those are GCP e2e we'd churn through ~300 projects a day at 10 uses before retiring. I think this probably wouldn't scale.

A project takes 30-45 seconds to create. Not sure how quota would affect this but I don't see a scale problem (or at least, it doesn't seem WILDLY insane - we could try it :)

Can quota requests be automated? I took a quick look now don't see an API, but I'm not sure. Who owns this, that we can have this conversation? • boskos the tool? => I might have an answer, but waiting for confirmation

Yes

• or this migration? ... unsure

Yes

• the prow.k8s.io deployment? => the test-infra maintainers / google engprod team nominally at the moment, the infra runs in the build / test workload cluster.

Less interesting for this thread :)

BenTheElder · 2019-10-17T03:39:16Z

As we consider moving prow into community space, we will HAVE to get a
better story around this.

Agreed. I'm certainly not thrilled about the current state...

That said, I generally don't think we can consider the presubmit testing to be trustworthy, and scheduling with boskos is cooperative. Changing that would be a bit involved.

Cull the herd?

Yes and no.

A lot of valuable signal shouldn't be culled imo, but still doesn't have a clear owner 😞 (EG who owns the periodic integration and unit testing ...?)

We probably need to enforce ownership better somehow. I'm not sure how.

I'd like to set the objective at EVERY test identifies which pool it
belongs to and then as needed we can split those pools to better indicate
the prime spenders.

We can do that incrementally with the new community owned pools we set up, I have no idea what the right granularity would be though.

A project takes 30-45 seconds to create.

... that is a lot faster than I thought. If we can get this to work, that would be a neat trick! 🙃

Yes

ACK, I'm hoping for an official "stepping up to the plate" in the next couple of days ... will circle back. @sebastienvas may serve as an transitionary owner (previously worked on this).

Yes

ACK ... I can certainly help, I'm also hoping for more help though, perhaps @dims who raised this :-)

dims · 2019-10-28T18:56:31Z

/assign

fejta-bot · 2020-01-26T19:20:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-02-25T20:02:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

cblecker · 2020-02-25T20:31:27Z

/remove-lifecycle rotten

spiffxp · 2020-05-13T15:31:12Z

I have done some of this under #752

Template to create the projects
Understand what the "bare" state of these projects is

ensure-e2e-projects.sh

Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Permissions for the projects to facilitate community member troubleshooting TBD under #844

What quotas do we need to request for the projects

I have thus far only worked out what is needed to match the 'scalability' pool (#851), there are others for ingress and gpu TBD

Probably other unknown things

How do we do billing per-job or per-sig?

fejta-bot · 2020-08-11T15:31:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

BenTheElder · 2020-08-14T00:15:05Z

I think this is progressing somewhat?

fejta-bot · 2020-09-13T01:08:01Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

spiffxp · 2020-10-06T20:07:32Z

/remove-lifecycle rotten

Yes, this is progressing:

Revisiting the description

Template to create the projects

Done. Unfortunately it's not possible to adjust project quota via API. Everything else is scripted today via https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/ensure-e2e-projects.sh

Understand what the "bare" state of these projects is

Done. This was done during development of ensure-e2e-projects.sh. "Bare" as in no-quota-adjustments projects are in the gce-project pool.

Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Done for the scope of this issue.

Per Develop better ACLs for prow build clusters and e2e projects #844 (comment), Add oncall groups and bind to projective primitive roles #919 setup the following:
- k8s-infra-prow-oncall@kubernetes.io - oncall/break-glass access to prow project and e2e projects
- k8s-infra-sig-scalability-oncall@kubernetes.io - oncall/break-glass access to k8s-infra-e2e-boskos-scale-* projects
- k8s-infra-prow-viewers@kubernetes.io - readonly access to prow project and e2e projects
Service accounts for prow and boskos' components are created via terraform files, and given appropriate privileges to boskos projects via ensure-e2e-projects.sh

What quotas do we need to request for the projects

Done for all pools except ingress projects and aws accounts (which I'm excluding from this issue since they're not gcp projects)

Identify quota changes needed for ingress jobs, create pool of ingress projects #1093 - ingress projects, still open because: need to figure out if they're necessary, and if so what quota changes are needed
Identify quota changes needed for gpu jobs, create pool of gpu projects #1095 - gpu projects have been provisioned
Identify quota changes needed for scalability jobs, create pool of scalability projects #851 - scalability projects have been provisioned
Provision stock e2e projects in anticipation of migrating kubernetes merge-blocking jobs #1078 - gce projects have been provisioned

Probably other unknown things

There is the question of billing per-job or billing per-sig, in a way that accounts for both cluster-usage and project-usage. I think we should call that out of scope for this issue.

I'm personally ready to close this out. What follow up work do folks think we should have tracking issues for?

spiffxp · 2020-10-28T16:27:17Z

/close

k8s-ci-robot · 2020-10-28T16:27:30Z

@spiffxp: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cblecker added the wg/k8s-infra label Oct 5, 2019

k8s-ci-robot changed the title ~~RFC: Move bozkos testing projects pool to kubernetes.io~~ RFC: Move boskos testing projects pool to kubernetes.io Oct 7, 2019

k8s-ci-robot assigned thockin Oct 16, 2019

k8s-ci-robot assigned dims Oct 28, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2020

dims removed their assignment Apr 29, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 6, 2020

spiffxp added this to In Progress in sig-k8s-infra Oct 14, 2020

k8s-ci-robot closed this as completed Oct 28, 2020

sig-k8s-infra automation moved this from In Progress to Done Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Move boskos testing projects pool to kubernetes.io #390

RFC: Move boskos testing projects pool to kubernetes.io #390

cblecker commented Oct 5, 2019

stevekuznetsov commented Oct 7, 2019

BenTheElder commented Oct 7, 2019

BenTheElder commented Oct 7, 2019

dims commented Oct 16, 2019

thockin commented Oct 16, 2019

BenTheElder commented Oct 17, 2019 •

edited

Loading

thockin commented Oct 17, 2019 via email

BenTheElder commented Oct 17, 2019

dims commented Oct 28, 2019

fejta-bot commented Jan 26, 2020

fejta-bot commented Feb 25, 2020

cblecker commented Feb 25, 2020

spiffxp commented May 13, 2020

fejta-bot commented Aug 11, 2020

BenTheElder commented Aug 14, 2020

fejta-bot commented Sep 13, 2020

spiffxp commented Oct 6, 2020

spiffxp commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

RFC: Move boskos testing projects pool to kubernetes.io #390

RFC: Move boskos testing projects pool to kubernetes.io #390

Comments

cblecker commented Oct 5, 2019

stevekuznetsov commented Oct 7, 2019

BenTheElder commented Oct 7, 2019

BenTheElder commented Oct 7, 2019

dims commented Oct 16, 2019

thockin commented Oct 16, 2019

BenTheElder commented Oct 17, 2019 • edited Loading

thockin commented Oct 17, 2019 via email

BenTheElder commented Oct 17, 2019

dims commented Oct 28, 2019

fejta-bot commented Jan 26, 2020

fejta-bot commented Feb 25, 2020

cblecker commented Feb 25, 2020

spiffxp commented May 13, 2020

fejta-bot commented Aug 11, 2020

BenTheElder commented Aug 14, 2020

fejta-bot commented Sep 13, 2020

spiffxp commented Oct 6, 2020

spiffxp commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

BenTheElder commented Oct 17, 2019 •

edited

Loading