Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Bazel Remote Caching [Tracking Issue] #6808

Closed
12 tasks done
BenTheElder opened this issue Feb 13, 2018 · 13 comments
Closed
12 tasks done

Implement Bazel Remote Caching [Tracking Issue] #6808

BenTheElder opened this issue Feb 13, 2018 · 13 comments
Assignees
Labels
area/bazel area/jobs kind/feature Categorizes issue or PR as related to a new feature. kind/velocity-improvement
Milestone

Comments

@BenTheElder
Copy link
Member

BenTheElder commented Feb 13, 2018

Bazel 0.10.0 is now in use in test-infra and kubernetes (release-1.10/master). This release contains some nice improvements to the HTTP remote caching system, we should leverage this instead of our existing "use persistent storage for the local cache".

Why?

  • using a remote cache means the cache is global so all jobs can share it
  • the remote cache is (mostly) content addressed and designed for sharing, the local cache is not so much
  • we can't have a broken node with a bad cache to hunt down if we don't put the cache on the node

Why not?

Action Items

/area bazel
/area jobs
/assign

@BenTheElder
Copy link
Member Author

Some more notes:

  • The invalid cache sharing work around seems to work well with the canary jobs at least
  • Experimental Jobs using the cache can be much faster:
    • pull-kubernetes-bazel-test takes 25-30m currently, once the cache is hot pull-kubernetes-bazel-test-canary takes ~5min typically. There is a lot of variation for both though mostly due to the load on the node the job runs on
    • pull-test-infra-bazel takes 8-10m currently, pull-test-infra-bazel-canary takes ~3 min (about two minutes of which is spent installing python deps and running pylint...)
    • pull-kubernetes-bazel-build-canary is not caching well currently, we probably need to mark things like hyperkube and the tarballs as no-cache

@BenTheElder
Copy link
Member Author

FYI @perotinus we can probably look at using this with the cluster-registry soon, test-infra is using it now 😄

@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 21, 2018

Tested eviction a bit more with: https://github.com/BenTheElder/test-infra/blob/20d7d58ac34d59e241eddfb107e3b735398cd8d7/experiment/fill_cache.sh
Will PR some logging changes but so far WAI

Edit: see also, results of turning this on for test-infra:
image

@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 22, 2018

@BenTheElder
Copy link
Member Author

Testing a new test-infra PR appears to have 3468 action cache hits, 7 action cache misses, and 1920 CAS hits (!)

@BenTheElder
Copy link
Member Author

I've now turned this on for the kubernetes CI bazel-build and bazel-test jobs with great results*
screen shot 2018-02-24 at 11 30 28 am
screen shot 2018-02-24 at 11 42 01 am
* Note: the build job only runs in post-submit, and once every 6 hours, currently. Once this job is properly continuous the results for it will be more obvious.

We also have a monitoring dashboard now:
screen shot 2018-02-24 at 11 31 27 am

@BenTheElder
Copy link
Member Author

https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#pull-kubernetes-bazel-test&graph-metrics=test-duration-minutes

As expected instead of ~25+ minutes we're seeing ~5-6 minutes for pull-kubernetes-bazel-test after switching this on today.

@buchgr
Copy link

buchgr commented Mar 2, 2018

Absolutely amazing work @BenTheElder. Congratulations!

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 2, 2018 via email

@BenTheElder
Copy link
Member Author

kops is now using the remote cache, 8+ minutes to test -> ~2 minutes

https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/pr-logs/pull/kops/4565/pull-kops-bazel-test/

@BenTheElder
Copy link
Member Author

I think once #7205 is in we can close this, I've significantly upped the cache storage and we're flipping it on for pretty much all other presubmits that should leverage caching.

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 12, 2018

/close
We've rolled this out to many more jobs, pull-kubernetes-bazel-build in particular is now trending towards 5-8 minutes instead of 13+
image

@buchgr
Copy link

buchgr commented Mar 13, 2018

woohooo!!! 🍾 🎆 🎉

cc @ulfjack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bazel area/jobs kind/feature Categorizes issue or PR as related to a new feature. kind/velocity-improvement
Projects
None yet
Development

No branches or pull requests

3 participants