Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup CI setup to <20s #30706

Open
adeebshihadeh opened this issue Dec 13, 2023 · 16 comments
Open

Speedup CI setup to <20s #30706

adeebshihadeh opened this issue Dec 13, 2023 · 16 comments
Labels
bounty CI / testing good first issue Feasible for new contributers

Comments

@adeebshihadeh
Copy link
Contributor

adeebshihadeh commented Dec 13, 2023

The best case time of the setup-with-retry stage that runs in most of our CI jobs is ~1m4s. All it does is setup the openpilot environment, and most of that time is pulling an already built docker image. This puts a hard limit on how fast our jobs can finish; a job that finishes in 1m is 10x better than one that finishes in 2-3m.

Some possible strategies:

  • make the docker image smaller
  • move the docker cache somewhere faster
  • ditch docker, install directly on the 20.04 runner, and cache the apt and pip packages

Requirements for the bounty:

  • all setup-with-retry on the final PR commit must finish in less than <20s

Sub-bounty of $100 for <40s if you can't get to <20s. $500 is for <20s. Bounties don't stack.

https://github.com/commaai/openpilot/blob/master/.github/workflows/setup-with-retry/action.yaml

@adeebshihadeh adeebshihadeh pinned this issue Dec 13, 2023
@adeebshihadeh adeebshihadeh changed the title [$100 bounty] Speedup CI setup to <40s [$500 bounty] Speedup CI setup to <20s Dec 18, 2023
@nelsonjchen
Copy link
Contributor

wild idea, compatibility unknown, gain (or loss) unknown:

use containerd or similar drop-in runtime instead of stock GHA moby with lazy loading of compatible estargz docker image "pages".

https://github.com/crazy-max/ghaction-setup-containerd

@jimbrend
Copy link

jimbrend commented Dec 20, 2023

@lukechilds what do you think?
just out of curiosity on first glance, either way

@jimbrend
Copy link

  • ditch docker, install directly on the 20.04 runner, and cache the apt and pip packages

what are the cons here?

@mbiernat42
Copy link

That sleep 30 seems questionably long

@profknow
Copy link

If it's just loading previously configured env, then why not just operate on Flash Drive, leaving the state where it was when last active.

Even on my own computer, I often keep copies of working system/environments that I simply dump into active memory without "booting up". It saves SO MUCH time when you already know the final state anyway.

--Loren Grayson

@sanjams2
Copy link

sanjams2 commented Jun 4, 2024

Throwing some thoughts down:

So taking an example from a recent run, it looks like the setup-with-rety step takes 75 seconds. The majority of this time is spent in downloading and extracting the cache layers of the openpilot_base image from ghcr.io/commaai/openpilot-base. This step alone seems to take about 60 seconds and you can validate this yourself by running docker pull ghcr.io/commaai/openpilot-base on a machine and timing it.

What you can see when you do that is that there is one long-pole layer which is the installation of python dependencies. In the Dockerfile that's done here. There is also another large layer to install the ubuntu dependencies but this is not the bottleneck (at the moment). Using dive, we can see the sizes of the different layers as well which confirms this python dependency layer is the big boy.

Image

So what can we do about this?

There are likely more ways, but I can think of two ways to go about addressing this:

  1. The "simple" way would be to divide this large layer up into more layers. Docker by default will only download 3 layers concurrently (docs). While one could increase the concurrency, increasing concurrency alone wouldnt help since the bits of the layer itself are still downloaded serially. To get higher concurrency within the layer, we would also want to divide this layer up into smaller layers. To do this, one could in theory install the different python dependencies in different layers. It would probably make sense to pull the largest python dependencies out into their own layers. Using dive again, we can see the following python dependencies sorted by size:
    Image
    The drawback of this approach is maintainability. Dynamically defining layers in the Dockerfile is not trivial (nor is it probably a good practice). In order for this to work then, you would likely have to statically define some dependencies to install (and which version) in the Dockerfile. However, this causes duplication of dependency definition with the pyproject.toml file. Perhaps this is an OK tradeoff for speed if you limit the statically defined dependencies in the Dockerfile to be a small subset. It's also not clear how far this could really take you; once you optimize downloads, you still have to deal with extraction. Docker layers are compressed with gzip — which has to be decompressed serially by nature — so in the end, the absolute best you can do may still not be enough. Going back to our numbers, we only have 5 seconds to play with to get < 20 seconds, so this method would almost certainly come up short.
  2. A more complex way I havent dug into is to use github actions cache to cache the python dependencies. This seems like a way to potentially get a faster download of the python dependencies assuming github actions cache download to the github worker is faster than the download from github container repository (I suspect this actually the case). You would then need to ensure that the cache is properly wired up in all the docker run commands throughout the build process. A similar mechanism is already being used for scons cache (example). Github actions cache has its own complexities with limits per project, limitations on sharing caches between branches, and more.

Note that with either method, you may have to repeat the exercise for the ubuntu dependencies as well given the size of that layer is on the same order of magnitude as the python dependency layer.

Both of these methods do continue to rely on docker, with option 1 in some ways doubling down on it. I personally do not think docker overhead is really the issue at hand here and believe there are likely benefits to continue using containers for portability. To me this ultimately seems like a problem of having a large amount of dependency bits and finding the fastest way to move them onto a clean github worker. Docker makes some of this more challenging (the layer concurrency piece) but doesnt completely block a speedy build. In some ways, it might make things easier. One final way to go about this would be to trim the dependency fat and hope that slims the layers enough to download in a reasonable amount of time. There is no telling if that would be enough, and furthermore, once you do trim, it is a cat-and-mouse game since dependencies will likely be added in the future.

@knownotunknown
Copy link

Is this bounty still open @adeebshihadeh? I see setup-with-retry is running in < 20 seconds in a lot of the CI runs (ex: https://github.com/commaai/openpilot/actions/runs/9432165207/job/25981593463).

Also, looking at the latest code on master it seems like we've ultimately decided to use a self-hosted runner (which previously wasn't considered a viable solution for this bounty)?

@adeebshihadeh
Copy link
Contributor Author

Still open.

We're now using namespace runners for internal branches, but I'd love to move back to the GitHub-hosted runners at some point.

@adeebshihadeh adeebshihadeh changed the title [$500 bounty] Speedup CI setup to <20s Speedup CI setup to <20s Jul 7, 2024
@ADITYA1720
Copy link

Is the issue open?

@jimbrend
Copy link

jimbrend commented Aug 30, 2024

Is the issue open?

it looks open to me @ADITYA1720

@naaa760
Copy link

naaa760 commented Sep 23, 2024

Is this issue open, please??

@BBBmau
Copy link
Contributor

BBBmau commented Sep 23, 2024

@ADITYA1720 @jimbrend @naaa760 for those asking, if an issue has it marked as Opened then it's open for anyone to start work on and will be marked as Locked once a PR has been submitted that shows a considerable amount of progress has been made. Issues aren't assigned to those that request it, it's usually given to the one that has opened the PR.

It seems like the contributing guidelines doesn't include this and it also likes like the BOUNTIES.md file was removed which has that info.

@jimbrend
Copy link

jimbrend commented Oct 2, 2024

thank you @BBBmau

@andrewchambers
Copy link

andrewchambers commented Oct 21, 2024

@adeebshihadeh Just to clarify, I can disregard the namespace runners if I get it working in github actions fully in under 20s?

@andrewchambers
Copy link

andrewchambers commented Oct 21, 2024

I have put up a WIP PR with my work at #33831 if it is possible to lock this bounty. If not I will continue to work on it regardless.

edit: I closed the PR so I don't trigger your github actions while testing on my fork.

@adeebshihadeh
Copy link
Contributor Author

@adeebshihadeh Just to clarify, I can disregard the namespace runners if I get it working in github actions fully in under 20s?

correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty CI / testing good first issue Feasible for new contributers
Projects
Status: Open