Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize backups #677

Merged
merged 8 commits into from
Apr 2, 2020
Merged

Optimize backups #677

merged 8 commits into from
Apr 2, 2020

Conversation

listx
Copy link
Contributor

@listx listx commented Mar 20, 2020

This implements the optimizations described in kubernetes/release#270 (comment) and #666.

/hold
/wip

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. wg/k8s-infra labels Mar 20, 2020
@listx
Copy link
Contributor Author

listx commented Mar 20, 2020

Looks like the secret was deleted (error from the pull-k8sio-backup test):

The pod could not start because it could not mount the volume "creds": secret "k8s-gcr-backup-test-prod-bak-service-account" not found

I'll take this setback as an opportunity to move to Workload Identity.

@listx
Copy link
Contributor Author

listx commented Mar 21, 2020

I am going with kubernetes/test-infra#16883 first to see how this Workload Identity stuff plays out in a simpler context. The context there is simpler because we already actuated WI KSA empowerment stuff in #655. For us to use WI here, I would first have to add another special case for the promoter GCP SA (specifically, k8s-infra-gcr-promoter@k8s-gcr-backup-test-prod-bak.iam.gserviceaccount.com as per kubernetes/test-infra#15398 (comment)).

Once kubernetes/test-infra#16883 is merged, the other changes described above will follow. After all that, then the steps to re-enable backups are as follows:

(1) create another workload identity empowerment PR like #655 for k8s-infra-gcr-promoter@k8s-gcr-backup-test-prod-bak.iam.gserviceaccount.com
(2) update the pull-k8sio-backup job in test-infra repo to use Workload Identity
(3) re-test the the pull-k8sio-backup check to make sure we are not breaking any backup behaviors
(4) merge this PR
(5) re-enable the periodic backup job (ci-k8sio-backup; i.e., revert kubernetes/test-infra#16835) --- to run 1x per day to ensure we stay beneath quota limits (the initial runs will take hours and fail with a job timeout, most likely, because we would need to copy all 30K images for the very first time to each region). I can also run the backups on a separate machine to speed up the process.
(6) observe how long backups take once the initial set of images have been copied over; that is, how long would a NOP backup job (due to no delta) take? Experimentally we have seen numbers around the ~5min range
(7) if (6) shows ~5min, then we can increase the backup frequency back to hourly.

@spiffxp
Copy link
Member

spiffxp commented Mar 31, 2020

/approve
/lgtm
I leave the /hold cancel to you @listx

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: listx, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 31, 2020
@bartsmykla
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2020
@listx
Copy link
Contributor Author

listx commented Apr 1, 2020

@spiffxp I need to wait for the workload identity stuff to get submitted first, namely #710 and kubernetes/test-infra#17048. At that point, I can use the pull-k8sio-backup presubmit check here to verify that WI is correctly working (and that gcrane can handle it --- which it should).

This PR is blocked by pull-k8sio-backup passing. (step 3 in my previous outline)

@listx
Copy link
Contributor Author

listx commented Apr 2, 2020

/retest

listx and others added 2 commits April 1, 2020 18:52
This is because we will be using workload identity for these jobs, where
the jobs will start out already authenticated as the GCP service
accounts.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020
@listx
Copy link
Contributor Author

listx commented Apr 2, 2020

/test pull-k8sio-backup

@listx
Copy link
Contributor Author

listx commented Apr 2, 2020

@jonjohnsonjr I've moved pull-k8sio-backup to use Workload Identity instead of GOOGLE_APPLICATION_CREDENTIALS=... and it appears that even though it authenticates as the same service account it used to auth as ADC, it's not the same as ADC and thus failing.

For example I can see in the pull-k8sio-backup test's logs:

+ gcloud auth list
                               Credentialed Accounts
ACTIVE  ACCOUNT
*       k8s-infra-gcr-promoter@k8s-gcr-backup-test-prod-bak.iam.gserviceaccount.com

but it is failing with:

+ /home/prow/go/src/github.com/google/go-containerregistry/cmd/gcrane/gcrane ls -r us.gcr.io/k8s-gcr-backup-test-prod
+ xargs -n1 /home/prow/go/src/github.com/google/go-containerregistry/cmd/gcrane/gcrane delete
2020/04/02 07:13:14 No matching credentials were found, falling back on anonymous
2020/04/02 07:13:15 deleting us.gcr.io/k8s-gcr-backup-test-prod/artifact-promoter/cip@sha256:f61e2ff9d38993b0e8acd93a8d3d9fbafedf67d97845c244bf4d329aadf0e3c4: GET https://us.gcr.io/v2/token?scope=repository%3Ak8s-gcr-backup-test-prod%2Fartifact-promoter%2Fcip%3Apush%2Cpull&service=us.gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

I guess gcrane always requires ADC? Do you know if it is possible to make Workload Identity work with ADC?

@listx
Copy link
Contributor Author

listx commented Apr 2, 2020

For more context the gcrane version was built from 3d03ed9b1ca2ad5d78d43832e8e46adc31d2b961 (master HEAD).

@spiffxp
Copy link
Member

spiffxp commented Apr 2, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020
@listx
Copy link
Contributor Author

listx commented Apr 2, 2020

/wip

Still working on resolving the auth issues post-WI...!

Linus Arver added 4 commits April 2, 2020 12:34
If gcrane fails to delete the images, then this loop might execute
forever until timeout. Instead, fail after 5 attempts.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020
@listx
Copy link
Contributor Author

listx commented Apr 2, 2020

Woohoo, it passed!

FTR the auth was not working for gcrane because we didn't do gcloud auth configure-docker. This is now necessary because we no longer pass in ADC creds as an env var.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2020
@spiffxp
Copy link
Member

spiffxp commented Apr 2, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7b50ca9 into kubernetes:master Apr 2, 2020
@listx listx deleted the optimize-backups branch April 2, 2020 22:27
listx pushed a commit to listx/k8s.io that referenced this pull request Apr 4, 2020
This gives `k8s-prow.svc.id.goog[test-pods/k8s-infra-gcr-promoter-bak]`
access to authenticate as `$(svc_acct_email "${PRODBAK_PROJECT}"
"${PROMOTER_SVCACCT}")`, which currently resolves to
`k8s-infra-gcr-promoter@k8s-artifacts-prod-bak.iam.gserviceaccount.com`.

This is a preparatory step before we can re-introduced the backup job
that was optimized in kubernetes#677.
listx pushed a commit to listx/k8s.io that referenced this pull request Apr 4, 2020
This empowers the `k8s-infra-gcr-promoter-bak` KSA in the `test-pods`
K8s namespace in the `k8s-prow` GCP Project (where the Prow trusted
cluster lives) to authenticate as
`k8s-infra-gcr-promoter@k8s-artifacts-prod-bak.iam.gserviceaccount.com`.

The `k8s-infra-gcr-promoter-bak` KSA does not exist yet and will be
created when we re-introduce the backup job (pulled on 2020-03-18 due to
quota issues). The backup job itself was optimized in
kubernetes#677.
listx pushed a commit to listx/test-infra that referenced this pull request Apr 8, 2020
This reverts commit a6a2655.

The backup job has received some optimizations in
kubernetes/k8s.io#677, In addition, the
k8s-artifacts-prod-bak GCR has been manually pre-populated with all ~30K
images in k8s-artifacts-prod for all ASIA, EU, and US regions, which
will result in jobs taking just minutes to run (as subsequent runs are
mostly NOP runs).

For more discussion on the backup job, please see
https://docs.google.com/document/d/11eiosJvm2xEVUhPRU3-luANxxTPL5FqQdJXVrLPImyQ/edit?usp=sharing.
@listx
Copy link
Contributor Author

listx commented Apr 9, 2020

Just following up on the work items:

(5) re-enable the periodic backup job (ci-k8sio-backup; i.e., revert kubernetes/test-infra#16835) --- to run 1x per day to ensure we stay beneath quota limits (the initial runs will take hours and fail with a job timeout, most likely, because we would need to copy all 30K images for the very first time to each region). I can also run the backups on a separate machine to speed up the process.

This was done here: kubernetes/test-infra#17150

(6) observe how long backups take once the initial set of images have been copied over; that is, how long would a NOP backup job (due to no delta) take? Experimentally we have seen numbers around the ~5min range

The first successful run, which backed up a handful of images, is here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-k8sio-backup/1248214661472456704

This took 19 minutes, which isn't too bad because, keep in mind that is for 3 regions, serially. So it took ~6-7 minutes per region. We could increase the speed by making the 3 copies parallel (using something like GNU parallel), but I don't think it's worth it (we have to most likely install it at the beginning of the job, because I doubt it's in the image we use).

(7) if (6) shows ~5min, then we can increase the backup frequency back to hourly.

According to this comment thread, for a NOP copy we would only be using ~3K requests to list all images for 1 region. If we did the job hourly, then there would be a minimum of 3K * 3 regions, or ~9K requests per hour, or 24 (hours) * 9K or ~216K requests to GCR per day. These are both well under the ~50K requests per 10 minutes, or 1000K requests per day quota limits here. However, the request count still feels heavy to me to be doing this hourly.

I think it's OK to leave it at 12h intervals for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants