Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate prowjobs to new infrastructure #8689

Closed
chrischdi opened this issue May 17, 2023 · 45 comments
Closed

Migrate prowjobs to new infrastructure #8689

chrischdi opened this issue May 17, 2023 · 45 comments
Assignees
Labels
area/ci Issues or PRs related to ci kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@chrischdi
Copy link
Member

What would you like to be added (User Story)?

This issue is for discussing and tracking efforts to migrate existing prowjobs over to the new infrastructure provided by test-infra.

Detailed Description

One point at the office hours at 10th May 2023 was:

  • [Fabrizio] PSA: from latest SIG Chair/TL meeting (more detailed communications in-flight)
    • It is possible to start using Prow for taking advantages of AWS credits by simply changing a field in the prow config (examples)

Migrating jobs over from the google infrastructure might also help in enabling folks at test-infra to look into issues e.g. to help debugging

because on the non-default target prow cluster, all folks of test-infra are able to take a look into it.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
/area ci

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/ci Issues or PRs related to ci needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 17, 2023
@killianmuldoon
Copy link
Contributor

killianmuldoon commented May 17, 2023

/triage accepted

Definitely think we should see how this works. We currently have almost duplicate runs of e2e-full - one with IPv6 enabled and one without - introduced here: kubernetes/test-infra#29519 The IPv6 e2e seems to perform about as well as the normal e2e run with a couple of additional flakes from the dualstack tests (which only exist in the IPv6 variant).

WDYT about moving one of those to the AWS cluster to see how it works? This would give us good coverage across a lot of tests on AWS, but wouldn't reduce any of our coverage from what we had ~last week.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 17, 2023
@sbueringer
Copy link
Member

sbueringer commented May 17, 2023

WDYT about moving all 1.2 tests over for now?

We can test if everything works and dont introduce a potential additional factor in the tests we care about

If that looks good we can continue to move more

@killianmuldoon
Copy link
Contributor

That makes sense +1 to moving over all the 1.2 tests.

@fabriziopandini
Copy link
Member

+1 for stating from 1.2, this gives us time to experimenting without affecting release calendar

@chrischdi
Copy link
Member Author

chrischdi commented May 22, 2023

Looks like its not just about moving the jobs. There are also additional properties getting enforced around resources (CPU limits and memory limits + requests:

e.g.:

    jobs_test.go:1133: periodic-cluster-api-e2e-mink8s-release-1-2: container '' must have resources.limits[cpu] specified
    jobs_test.go:1133: periodic-cluster-api-e2e-mink8s-release-1-2: container '' resources.limits[cpu] must be non-zero
    jobs_test.go:1133: periodic-cluster-api-e2e-mink8s-release-1-2: container '' must have resources.limits[memory] specified
    jobs_test.go:1133: periodic-cluster-api-e2e-mink8s-release-1-2: container '' must have resources.requests[memory] specified
    jobs_test.go:1133: periodic-cluster-api-e2e-mink8s-release-1-2: container '' resources.limits[memory] must be non-zero

@sbueringer
Copy link
Member

sbueringer commented May 22, 2023

Hm for CPU we can use requests=limits. This should solve a few cases, maybe all (+/- looking at similar jobs).

I have no idea how much memory we need. Maybe we should ask upstream how we can find out the usage of our current jobs, or with which values we should start.

Too low memory values are definitely not fun as we have to deal with random errors because of OOM kills :/

@killianmuldoon
Copy link
Contributor

AFAIK our current CPU request is cpu: 7300m which we could use for both limits and requests.

For memory it looks like CAPZ is setting it at 9Gi right now. Not sure how comparable their jobs are, but maybe it's a good starting point?

@sbueringer
Copy link
Member

CAPZ is not running the workload clusters in the ProwJob pod

@chrischdi
Copy link
Member Author

Notes from talking to @ameukam :

  • We for now should quickly move to k8s-infra-prow-build.
    • This one is owned by the community and on GCP.
  • At a second stage we could then move to move to eks-prow-build-cluster.

There is no good method to determine how much we need or a baseline. So we kind of need to test and iterate.

@ameukam
Copy link
Member

ameukam commented May 22, 2023

we should probably start with low values (2 GB ?) of reqs and limits and iterate if the jobs are oomkilled.

@sbueringer
Copy link
Member

sbueringer commented May 22, 2023

This will require quite a lot of trial-and-error. We did this in the past on our own Prow instance and it's sometimes hard to figure out that your job failed because of an oom kill.

(also memory usage of a job is usually not constant over longer periods of time)

@sbueringer
Copy link
Member

sbueringer commented May 22, 2023

Do we have to set memory on k8s-infra-prow-build as well? Otherwise we could probably (?) get some data for memory from there.

@sbueringer
Copy link
Member

Ah, I think we can run the jobs locally with pj-on-kind.sh and then check via metrics / k top

@chrischdi
Copy link
Member Author

That's a good idea. But maybe better with a prometheus and metrics-server to hopefully get the highest value (instead of manually grabbing from k top).

@ameukam
Copy link
Member

ameukam commented May 22, 2023

Do we have to set memory on k8s-infra-prow-build as well? Otherwise we could probably (?) get some data for memory from there.

Yes. CPU/memory reqs and limits are required to run on community-owned (GKE/EKS) clusters.

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 15, 2023

@fabriziopandini this was the issue for migrating prowjobs mentioned in the office hours yesterday.
Edit: Note also that the umbrella issue linked above links to many (all?) affected providers

@fabriziopandini
Copy link
Member

@lentzi90 thanks, reporting here some notes from the office hours discussion:

call for action about moving CI jobs to EKS

  • Still some limitations :
    • this doesn’t work for jobs creating external resources / relying on GCP secrets
  • Resources (requests + limits) needs to be set on jobs
  • Some PRs already open (see umbrella issue)
  • My proposal
    • Bank on recent efforts for chasing flakes and keep things stable during last part of the 1.5.0 release cycle
    • Staff the new next CI team for this job

@rjsadow
Copy link

rjsadow commented Aug 1, 2023

Hey all, with the release of 1.5.0 on the books. Would it be a good time to start moving some of the CAPI jobs over to eks? Should we start with 1.3 jobs and see how it goes?

@sbueringer
Copy link
Member

+1

@furkatgofurov7
Copy link
Member

+1

/cc @nawazkh

@chrischdi
Copy link
Member Author

/unassign

@chrischdi
Copy link
Member Author

/assign rjsadow

Because you already opened and merged the first PR 🎉

Thanks for taking this over!

@chrischdi
Copy link
Member Author

Note: Dashboard which helps to fine-tune the memory/cpu requests: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&refresh=30s&var-org=kubernetes-sigs&var-repo=cluster-api&var-job=periodic-cluster-api-e2e-workload-upgrade-1-18-1-19-release-1-3&var-build=All

@sbueringer
Copy link
Member

sbueringer commented Aug 10, 2023

Just an update. Merged a bunch of PRs. Now all 1.3 jobs should run on the eks cluster. Please verify though

@sbueringer
Copy link
Member

sbueringer commented Aug 10, 2023

But looks like the test infra PRs did not all link this issue. Can we please correct that? Very hard to track what we did otherwise. Maybe someone can post a quick summary here as well

@nawazkh
Copy link
Member

nawazkh commented Aug 11, 2023

Summary:

@furkatgofurov7
Copy link
Member

For completeness on ^, you missed kubernetes/test-infra#30340 after

;)

@chrischdi
Copy link
Member Author

chrischdi commented Aug 14, 2023

xref: #8426 (comment)

It looks like this is an even more substantial problem on the community eks cluster. The release-1.3 e2e testing jobs are failing ~80 percent of the time due to this error:

Ref:

The second link isn't obvious from the triage page, but clicking through it's the same error but hit at an earlier point during clusterctl init.

@sbueringer
Copy link
Member

As Killian wrote in kubernetes/test-infra#30365 (review) let's wait with further migrations until release-1.3 is fixed and then stable for a bit

@rjsadow
Copy link

rjsadow commented Sep 11, 2023

It seems like 1.3 jobs have stabilized relatively well since the public ip changes. How does everyone feel about pushing forward with the 1.4 migration in kubernetes/test-infra#30365?

@sbueringer
Copy link
Member

It would be good to get these flakes fixed #9379

@chrischdi
Copy link
Member Author

Looks like the last occurency of Job execution failed: Pod got deleted unexpectedly was yesterday 23rd of October:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-23-1-24-release-1-3/1716538879793696768

@sbueringer
Copy link
Member

sbueringer commented Nov 27, 2023

@furkatgofurov7 Can we already open a PR to move the 1.5 jobs? Just so we can already starting reviewing it to get it ready

@furkatgofurov7
Copy link
Member

@furkatgofurov7 Can we already open a PR to move the 1.5 jobs? Just so we can already starting reviewing it to get it ready

As discussed offline CI team members can handle this (@nawazkh @adilGhaffarDev @kranurag7 Sunnat), otherwise I am happy to prepare it, please let me know

@adilGhaffarDev
Copy link
Contributor

As discussed offline CI team members can handle this (@nawazkh @adilGhaffarDev @kranurag7 Sunnat), otherwise I am happy to prepare it, please let me know

@kranurag7 is making the PR.

@sbueringer
Copy link
Member

@ameukam kubernetes/test-infra#31386 is getting merged now.

Do you have an easy way to check if we migrated all jobs (just to make sure we didn't miss any)

@schrej
Copy link
Member

schrej commented Dec 7, 2023

We could check https://prow.k8s.io/?repo=kubernetes-sigs%2Fcluster-api&cluster=default to see if any more jobs are running on the default cluster after that PR is merged.

@kranurag7
Copy link
Contributor

We had a list being tracked #9609 (comment)

I just validated quickly, and I think we migrated all the jobs on the list. I'll get a second round of validation using the above link shared by Jakob.

@sbueringer
Copy link
Member

Thx! Yup was mostly asking Arnaud because I think he has a tool / script to generate these lists

@rjsadow
Copy link

rjsadow commented Dec 7, 2023

We have a tool in test-infra to track the cluster migrations in hack/cluster-migrations. It looks like that MR finishes out the CAPI jobs! Well done. There are still a few lingering provider jobs that I know some folks are aware of.

→ go run main.go --repo-report
...

Repository                                               Complete   Total(Eligible)      Remaining  (Percent)
---------------------------------------------------------------------------------------------------------------
...
cluster-api                                              88         88(88)               0          (100.00%)
cluster-api-addon-provider-helm                          6          6(6)                 0          (100.00%)
cluster-api-ipam-provider-in-cluster                     1          1(1)                 0          (100.00%)
cluster-api-operator                                     15         17(15)               0          (100.00%)
cluster-api-provider-aws                                 62         62(62)               0          (100.00%)
cluster-api-provider-azure                               7          51(10)               3          (70.00%)
cluster-api-provider-cloudstack                          3          4(4)                 1          (75.00%)
cluster-api-provider-digitalocean                        11         32(11)               0          (100.00%)
cluster-api-provider-gcp                                 42         42(42)               0          (100.00%)
cluster-api-provider-ibmcloud                            43         44(44)               1          (97.73%)
cluster-api-provider-nested                              9          9(9)                 0          (100.00%)
cluster-api-provider-openstack                           3          10(10)               7          (30.00%)
cluster-api-provider-vsphere                             32         61(32)               0          (100.00%)

Specific Jobs

→ go run main.go --repo cluster-api
pull-cluster-api-build-main                                            is done
pull-cluster-api-apidiff-main                                          is done
pull-cluster-api-verify-main                                           is done
pull-cluster-api-test-main                                             is done
pull-cluster-api-test-mink8s-main                                      is done
pull-cluster-api-e2e-mink8s-main                                       is done
pull-cluster-api-e2e-main                                              is done
pull-cluster-api-e2e-full-dualstack-and-ipv6-main                      is done
pull-cluster-api-e2e-full-main                                         is done
pull-cluster-api-e2e-workload-upgrade-1-28-latest-main                 is done
pull-cluster-api-e2e-scale-main-experimental                           is done
pull-cluster-api-build-release-1-4                                     is done
pull-cluster-api-apidiff-release-1-4                                   is done
pull-cluster-api-verify-release-1-4                                    is done
pull-cluster-api-test-release-1-4                                      is done
pull-cluster-api-test-mink8s-release-1-4                               is done
pull-cluster-api-e2e-release-1-4                                       is done
pull-cluster-api-e2e-informing-release-1-4                             is done
pull-cluster-api-e2e-informing-ipv6-release-1-4                        is done
pull-cluster-api-e2e-full-release-1-4                                  is done
pull-cluster-api-e2e-workload-upgrade-1-26-1-27-release-1-4            is done
pull-cluster-api-build-release-1-5                                     is done
pull-cluster-api-apidiff-release-1-5                                   is done
pull-cluster-api-verify-release-1-5                                    is done
pull-cluster-api-test-release-1-5                                      is done
pull-cluster-api-test-mink8s-release-1-5                               is done
pull-cluster-api-e2e-mink8s-release-1-5                                is done
pull-cluster-api-e2e-release-1-5                                       is done
pull-cluster-api-e2e-informing-release-1-5                             is done
pull-cluster-api-e2e-full-dualstack-and-ipv6-release-1-5               is done
pull-cluster-api-e2e-full-release-1-5                                  is done
pull-cluster-api-e2e-workload-upgrade-1-27-1-28-release-1-5            is done
pull-cluster-api-e2e-scale-release-1-5-experimental                    is done
pull-cluster-api-build-release-1-6                                     is done
pull-cluster-api-apidiff-release-1-6                                   is done
pull-cluster-api-verify-release-1-6                                    is done
pull-cluster-api-test-release-1-6                                      is done
pull-cluster-api-test-mink8s-release-1-6                               is done
pull-cluster-api-e2e-mink8s-release-1-6                                is done
pull-cluster-api-e2e-release-1-6                                       is done
pull-cluster-api-e2e-full-dualstack-and-ipv6-release-1-6               is done
pull-cluster-api-e2e-full-release-1-6                                  is done
pull-cluster-api-e2e-workload-upgrade-1-28-latest-release-1-6          is done
pull-cluster-api-e2e-scale-release-1-6-experimental                    is done
periodic-cluster-api-e2e-workload-upgrade-1-23-1-24-main               is done
periodic-cluster-api-e2e-workload-upgrade-1-24-1-25-main               is done
periodic-cluster-api-e2e-workload-upgrade-1-25-1-26-main               is done
periodic-cluster-api-e2e-workload-upgrade-1-26-1-27-main               is done
periodic-cluster-api-e2e-workload-upgrade-1-27-1-28-main               is done
periodic-cluster-api-e2e-workload-upgrade-1-28-latest-main             is done
periodic-cluster-api-test-main                                         is done
periodic-cluster-api-test-mink8s-main                                  is done
periodic-cluster-api-e2e-main                                          is done
periodic-cluster-api-e2e-dualstack-and-ipv6-main                       is done
periodic-cluster-api-e2e-mink8s-main                                   is done
periodic-cluster-api-e2e-workload-upgrade-1-21-1-22-release-1-4        is done
periodic-cluster-api-e2e-workload-upgrade-1-22-1-23-release-1-4        is done
periodic-cluster-api-e2e-workload-upgrade-1-23-1-24-release-1-4        is done
periodic-cluster-api-e2e-workload-upgrade-1-24-1-25-release-1-4        is done
periodic-cluster-api-e2e-workload-upgrade-1-25-1-26-release-1-4        is done
periodic-cluster-api-e2e-workload-upgrade-1-26-1-27-release-1-4        is done
periodic-cluster-api-test-release-1-4                                  is done
periodic-cluster-api-test-mink8s-release-1-4                           is done
periodic-cluster-api-e2e-release-1-4                                   is done
periodic-cluster-api-e2e-mink8s-release-1-4                            is done
periodic-cluster-api-e2e-workload-upgrade-1-22-1-23-release-1-5        is done
periodic-cluster-api-e2e-workload-upgrade-1-23-1-24-release-1-5        is done
periodic-cluster-api-e2e-workload-upgrade-1-24-1-25-release-1-5        is done
periodic-cluster-api-e2e-workload-upgrade-1-25-1-26-release-1-5        is done
periodic-cluster-api-e2e-workload-upgrade-1-26-1-27-release-1-5        is done
periodic-cluster-api-e2e-workload-upgrade-1-27-1-28-release-1-5        is done
periodic-cluster-api-test-release-1-5                                  is done
periodic-cluster-api-test-mink8s-release-1-5                           is done
periodic-cluster-api-e2e-release-1-5                                   is done
periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-5                is done
periodic-cluster-api-e2e-mink8s-release-1-5                            is done
periodic-cluster-api-e2e-workload-upgrade-1-23-1-24-release-1-6        is done
periodic-cluster-api-e2e-workload-upgrade-1-24-1-25-release-1-6        is done
periodic-cluster-api-e2e-workload-upgrade-1-25-1-26-release-1-6        is done
periodic-cluster-api-e2e-workload-upgrade-1-26-1-27-release-1-6        is done
periodic-cluster-api-e2e-workload-upgrade-1-27-1-28-release-1-6        is done
periodic-cluster-api-e2e-workload-upgrade-1-28-latest-release-1-6      is done
periodic-cluster-api-test-release-1-6                                  is done
periodic-cluster-api-test-mink8s-release-1-6                           is done
periodic-cluster-api-e2e-release-1-6                                   is done
periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-6                is done
periodic-cluster-api-e2e-mink8s-release-1-6                            is done
post-cluster-api-push-images                                           is done

@sbueringer
Copy link
Member

Perfect. Then I would close this issue for core CAPI.

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Perfect. Then I would close this issue for core CAPI.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues or PRs related to ci kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests