Speedup CI setup to <20s #30706

adeebshihadeh · 2023-12-13T01:26:14Z

The best case time of the setup-with-retry stage that runs in most of our CI jobs is ~1m4s. All it does is setup the openpilot environment, and most of that time is pulling an already built docker image. This puts a hard limit on how fast our jobs can finish; a job that finishes in 1m is 10x better than one that finishes in 2-3m.

Some possible strategies:

make the docker image smaller
move the docker cache somewhere faster
ditch docker, install directly on the 20.04 runner, and cache the apt and pip packages

Requirements for the bounty:

all setup-with-retry on the final PR commit must finish in less than <20s

Sub-bounty of $100 for <40s if you can't get to <20s. $500 is for <20s. Bounties don't stack.

https://github.com/commaai/openpilot/blob/master/.github/workflows/setup-with-retry/action.yaml

The text was updated successfully, but these errors were encountered:

nelsonjchen · 2023-12-18T02:41:36Z

wild idea, compatibility unknown, gain (or loss) unknown:

use containerd or similar drop-in runtime instead of stock GHA moby with lazy loading of compatible estargz docker image "pages".

https://github.com/crazy-max/ghaction-setup-containerd

jimbrend · 2023-12-20T13:44:31Z

@lukechilds what do you think?
just out of curiosity on first glance, either way

jimbrend · 2023-12-20T13:47:00Z

ditch docker, install directly on the 20.04 runner, and cache the apt and pip packages

what are the cons here?

mbiernat42 · 2024-02-26T17:54:05Z

That sleep 30 seems questionably long

profknow · 2024-05-16T22:17:59Z

If it's just loading previously configured env, then why not just operate on Flash Drive, leaving the state where it was when last active.

Even on my own computer, I often keep copies of working system/environments that I simply dump into active memory without "booting up". It saves SO MUCH time when you already know the final state anyway.

--Loren Grayson

sanjams2 · 2024-06-04T23:03:11Z

Throwing some thoughts down:

So taking an example from a recent run, it looks like the setup-with-rety step takes 75 seconds. The majority of this time is spent in downloading and extracting the cache layers of the openpilot_base image from ghcr.io/commaai/openpilot-base. This step alone seems to take about 60 seconds and you can validate this yourself by running docker pull ghcr.io/commaai/openpilot-base on a machine and timing it.

What you can see when you do that is that there is one long-pole layer which is the installation of python dependencies. In the Dockerfile that's done here. There is also another large layer to install the ubuntu dependencies but this is not the bottleneck (at the moment). Using dive, we can see the sizes of the different layers as well which confirms this python dependency layer is the big boy.

So what can we do about this?

There are likely more ways, but I can think of two ways to go about addressing this:

The "simple" way would be to divide this large layer up into more layers. Docker by default will only download 3 layers concurrently (docs). While one could increase the concurrency, increasing concurrency alone wouldnt help since the bits of the layer itself are still downloaded serially. To get higher concurrency within the layer, we would also want to divide this layer up into smaller layers. To do this, one could in theory install the different python dependencies in different layers. It would probably make sense to pull the largest python dependencies out into their own layers. Using dive again, we can see the following python dependencies sorted by size:

The drawback of this approach is maintainability. Dynamically defining layers in the Dockerfile is not trivial (nor is it probably a good practice). In order for this to work then, you would likely have to statically define some dependencies to install (and which version) in the Dockerfile. However, this causes duplication of dependency definition with the pyproject.toml file. Perhaps this is an OK tradeoff for speed if you limit the statically defined dependencies in the Dockerfile to be a small subset. It's also not clear how far this could really take you; once you optimize downloads, you still have to deal with extraction. Docker layers are compressed with gzip — which has to be decompressed serially by nature — so in the end, the absolute best you can do may still not be enough. Going back to our numbers, we only have 5 seconds to play with to get < 20 seconds, so this method would almost certainly come up short.
A more complex way I havent dug into is to use github actions cache to cache the python dependencies. This seems like a way to potentially get a faster download of the python dependencies assuming github actions cache download to the github worker is faster than the download from github container repository (I suspect this actually the case). You would then need to ensure that the cache is properly wired up in all the docker run commands throughout the build process. A similar mechanism is already being used for scons cache (example). Github actions cache has its own complexities with limits per project, limitations on sharing caches between branches, and more.

Note that with either method, you may have to repeat the exercise for the ubuntu dependencies as well given the size of that layer is on the same order of magnitude as the python dependency layer.

Both of these methods do continue to rely on docker, with option 1 in some ways doubling down on it. I personally do not think docker overhead is really the issue at hand here and believe there are likely benefits to continue using containers for portability. To me this ultimately seems like a problem of having a large amount of dependency bits and finding the fastest way to move them onto a clean github worker. Docker makes some of this more challenging (the layer concurrency piece) but doesnt completely block a speedy build. In some ways, it might make things easier. One final way to go about this would be to trim the dependency fat and hope that slims the layers enough to download in a reasonable amount of time. There is no telling if that would be enough, and furthermore, once you do trim, it is a cat-and-mouse game since dependencies will likely be added in the future.

knownotunknown · 2024-06-09T19:41:18Z

Is this bounty still open @adeebshihadeh? I see setup-with-retry is running in < 20 seconds in a lot of the CI runs (ex: https://github.com/commaai/openpilot/actions/runs/9432165207/job/25981593463).

Also, looking at the latest code on master it seems like we've ultimately decided to use a self-hosted runner (which previously wasn't considered a viable solution for this bounty)?

adeebshihadeh · 2024-06-09T19:44:36Z

Still open.

We're now using namespace runners for internal branches, but I'd love to move back to the GitHub-hosted runners at some point.

ADITYA1720 · 2024-08-30T00:28:38Z

Is the issue open?

jimbrend · 2024-08-30T19:33:44Z

Is the issue open?

it looks open to me @ADITYA1720

naaa760 · 2024-09-23T07:20:49Z

Is this issue open, please??

BBBmau · 2024-09-23T18:56:36Z

@ADITYA1720 @jimbrend @naaa760 for those asking, if an issue has it marked as Opened then it's open for anyone to start work on and will be marked as Locked once a PR has been submitted that shows a considerable amount of progress has been made. Issues aren't assigned to those that request it, it's usually given to the one that has opened the PR.

It seems like the contributing guidelines doesn't include this and it also likes like the BOUNTIES.md file was removed which has that info.

jimbrend · 2024-10-02T00:16:12Z

thank you @BBBmau

andrewchambers · 2024-10-21T04:51:35Z

@adeebshihadeh Just to clarify, I can disregard the namespace runners if I get it working in github actions fully in under 20s?

andrewchambers · 2024-10-21T05:39:03Z

I have put up a WIP PR with my work at #33831 if it is possible to lock this bounty. If not I will continue to work on it regardless.

edit: I closed the PR so I don't trigger your github actions while testing on my fork.

adeebshihadeh · 2024-10-21T16:21:50Z

@adeebshihadeh Just to clarify, I can disregard the namespace runners if I get it working in github actions fully in under 20s?

correct

adeebshihadeh added good first issue Feasible for new contributers CI / testing bounty labels Dec 13, 2023

adeebshihadeh pinned this issue Dec 13, 2023

adeebshihadeh changed the title ~~[$100 bounty] Speedup CI setup to <40s~~ [$500 bounty] Speedup CI setup to <20s Dec 18, 2023

adeebshihadeh mentioned this issue Dec 18, 2023

dependency: remove pycurl package #30771

Merged

bongbui321 mentioned this issue Dec 31, 2023

Set-up-on-retry Openpilot on GHA runners (WIP) #30873

Closed

knownotunknown mentioned this issue Feb 27, 2024

WIP: Trying to run Setup-on-retry without Docker by installing directly. #31606

Closed

BBBmau mentioned this issue Mar 12, 2024

WIP: Shorten openpilot CI Setup Time to <40s #31836

Closed

NripeshN mentioned this issue May 3, 2024

feat: optimize setup-with-retry CI to run faster #32338

Closed

This was referenced May 15, 2024

leanify dockerfile JustinMiehle/openpilot#1

Open

wip: Optimize setup-with-retry to run faster by lowering the size #32437

Closed

adeebshihadeh changed the title ~~[$500 bounty] Speedup CI setup to <20s~~ Speedup CI setup to <20s Jul 7, 2024

andrewchambers mentioned this issue Oct 21, 2024

WIP: do openpilot ci build in a cached ubuntu chroot instead of docker. #33831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup CI setup to <20s #30706

Speedup CI setup to <20s #30706

adeebshihadeh commented Dec 13, 2023 •

edited

Loading

nelsonjchen commented Dec 18, 2023

jimbrend commented Dec 20, 2023 •

edited

Loading

jimbrend commented Dec 20, 2023

mbiernat42 commented Feb 26, 2024

profknow commented May 16, 2024

sanjams2 commented Jun 4, 2024 •

edited

Loading

knownotunknown commented Jun 9, 2024

adeebshihadeh commented Jun 9, 2024

ADITYA1720 commented Aug 30, 2024

jimbrend commented Aug 30, 2024 •

edited

Loading

naaa760 commented Sep 23, 2024

BBBmau commented Sep 23, 2024

jimbrend commented Oct 2, 2024

andrewchambers commented Oct 21, 2024 •

edited

Loading

andrewchambers commented Oct 21, 2024 •

edited

Loading

adeebshihadeh commented Oct 21, 2024

Speedup CI setup to <20s #30706

Speedup CI setup to <20s #30706

Comments

adeebshihadeh commented Dec 13, 2023 • edited Loading

nelsonjchen commented Dec 18, 2023

jimbrend commented Dec 20, 2023 • edited Loading

jimbrend commented Dec 20, 2023

mbiernat42 commented Feb 26, 2024

profknow commented May 16, 2024

sanjams2 commented Jun 4, 2024 • edited Loading

knownotunknown commented Jun 9, 2024

adeebshihadeh commented Jun 9, 2024

ADITYA1720 commented Aug 30, 2024

jimbrend commented Aug 30, 2024 • edited Loading

naaa760 commented Sep 23, 2024

BBBmau commented Sep 23, 2024

jimbrend commented Oct 2, 2024

andrewchambers commented Oct 21, 2024 • edited Loading

andrewchambers commented Oct 21, 2024 • edited Loading

adeebshihadeh commented Oct 21, 2024

adeebshihadeh commented Dec 13, 2023 •

edited

Loading

jimbrend commented Dec 20, 2023 •

edited

Loading

sanjams2 commented Jun 4, 2024 •

edited

Loading

jimbrend commented Aug 30, 2024 •

edited

Loading

andrewchambers commented Oct 21, 2024 •

edited

Loading

andrewchambers commented Oct 21, 2024 •

edited

Loading