-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI keeps outdated caches until it runs out of space, so we should delete them manually #79919
Comments
As a tangential note, we can also split up some builds and their tests. This would allow us to restart tests without restarting builds, if there are issues outside of our control (and perhaps that would clean up some space on runners too, between the build and the test). We can do this by uploading the artifacts as early as we can (immediately after the build is done). This is something that have been requested for other reasons too, so a good improvement regardless. Then some of the tests can be moved to a dedicated job that depends on the build job. The test job would download the artifact and use it, since most of the time it doesn't need any source code access. We can also move building our test |
Are the caches created by PRs actually all that useful or should we maybe just disable writing them from PRs? TL/DR: I think just not caching things from PRs would make for a better cache selection and also better cache retention as not so many pretty much identical things end up in the cache filling it pretty much immediately. |
That's what I've been thinking as well. Most PRs don't really benefit from having a cache. Sure, it would speed up consecutive builds of the same PR, but not a lot of PRs are that different from Plus, we now have a multi-stage setup and the static checks action guards the CI from running many invalid PRs. Those would previously benefit from caches during reruns, but now they don't even build and cache anything. So having the cache enabled doesn't benefit them at all. And finally, even in a hypothetical situation where some PR doesn't change anything about the existing cache it took, it still produces a new cache entry, wasting space needlessly. So I agree, we could likely disable caches for everything other than main development branches. One major takeaway from the whole thing is that it's not about how much space all caches take. It's about how much space individual caches take. The problem with the CI comes from the fact that one particular cache gets too big and then it gets loaded onto the runner and that makes it run out of allocated space. So keeping individual caches small is important. The issue seems to be that one "bad" cache can poison the well. If one cache gets too big for some reason, or if caches bloat over time organically, at some point the cache is going to be too large and cause this kind of outage.
That's not entirely true. Current strategy is to find the closest match using the action name, the target (base) branch, the PR branch, and the PR SHA hash.
This basically means that it will check for the exact match (same hash and all), which it is unlikely to find. Then it will check a match that has the same PR branch. This should fetch the old cache from the same PR, if one exists. If it does not, it will pick the most recent cache based on the target/base branch only. This is where things get tricky. Technically all PRs would match this. So an arbitrary cache could be used here. One way to solve this would be to add a direct look up for the base branch, e.g.:
We'd need to keep those in sync with the name of the branch the CI itself is stored in (so we'd need to change it for the 4.1 branch, 4.0 branch, 3.x branch, 3.6 branch, etc). Another option is, once again, to only keep the master/3.x/etc build result cached. |
It's a shame we don't have any historic data on this, since it's hard to identify a pattern. We cache the entire But as I've mentioned, we don't have any historic data. We might've been getting huge caches for some time already. And just reached an unfortunate mix of increased runner requirements (from SCU builds perhaps) and having a big cache already there. Whatever it is,
|
Right, I missed to To keep the size of the caches down we should probably investigate trimming the cache entries more aggressively based on atime so that is only items actually used in the current build get uploaded again. This should prevent the cache from growing any larger than absolutely necessary to effectively have incremental builds in CI. Alternatively (or for later) we could also ditch githubs caching entirely and use the scons cache as intended by mounting a network share with the cache into the runner, which would mean that a) the cache itself doesn't take any disk space on the runner and also only artifacts that are actually used/produced by that run are transferred over the network. Of course this approach comes with a whole lot of other complications, so I don't think this is a good short term goal. re: cache size, that is kinda to be expected when requires rebuilding the entire engine as the cache will then contain both all the previous object files (in case the ever become useful again; not really ever the case for the current setup) and the newly build versions of the same files (very useful for the next build) doubling its size. As a very quick workaround the space issues, we could probably also just disable the cache for the GCC SAN run entirely, which will mean that it always takes an hour but at least the runner would have 1 GB more free disk space and not run out all the time. |
Removing the Android toolchain saves 14 GiB, which gives us more room for growth and to avoid running into out-of-space errors in the Linux sanitizers + debug symbols builds. Related to godotengine#79919, though the caches were just one part of the problem, the real issue is that our Linux sanitizers builds take 12 GiB, and adding godot-cpp on top with 2 GiB leaves only a few GiB left for the cache itself.
We've discussed this a bit with @akien-mga and measured the impact of each build step. It seems that the Ubuntu runner comes with 84 GB of space total, with around 64 GB taken by the OS and other prerequisites. While our own dependencies, depending on the build configuration, can take a gig or two, for the most part this is the amount of space that is deterministically taken every time we spin up an Ubuntu image. This leaves us with around 20 GB left for the build. With the san builds (whether clang or GCC) we are using around 12-13 GB, without any cache. The following There is no way to use a different distribution for GitHub-hosted runners, it's only Ubuntu (latest, 22.04, or 20.04). However, there are certainly ways to debloat it. After all, 60 gigs for Linux looks like a lot! We are not the first to come up with such an idea, as there are a couple of actions publicly available that do just that. They remove unnecessary pre-installed tools and dependencies. So that's something we could do too, as the results of these tools look very promising. (Edit: Here's a PR with more details: #80115) For the reference, here are some of the test runs, you can see the disk usage stats from their logs: |
Removing the Android toolchain saves 14 GiB, which gives us more room for growth and to avoid running into out-of-space errors in the Linux sanitizers + debug symbols builds. Related to godotengine#79919, though the caches were just one part of the problem, the real issue is that our Linux sanitizers builds take 12 GiB, and adding godot-cpp on top with 2 GiB leaves only a few GiB left for the cache itself. (cherry picked from commit 611123f)
Removing the Android toolchain saves 14 GiB, which gives us more room for growth and to avoid running into out-of-space errors in the Linux sanitizers + debug symbols builds. Related to godotengine#79919, though the caches were just one part of the problem, the real issue is that our Linux sanitizers builds take 12 GiB, and adding godot-cpp on top with 2 GiB leaves only a few GiB left for the cache itself. (cherry picked from commit 611123f)
Removing the Android toolchain saves 14 GiB, which gives us more room for growth and to avoid running into out-of-space errors in the Linux sanitizers + debug symbols builds. Related to godotengine#79919, though the caches were just one part of the problem, the real issue is that our Linux sanitizers builds take 12 GiB, and adding godot-cpp on top with 2 GiB leaves only a few GiB left for the cache itself.
Removing the Android toolchain saves 14 GiB, which gives us more room for growth and to avoid running into out-of-space errors in the Linux sanitizers + debug symbols builds. Related to godotengine#79919, though the caches were just one part of the problem, the real issue is that our Linux sanitizers builds take 12 GiB, and adding godot-cpp on top with 2 GiB leaves only a few GiB left for the cache itself. (cherry picked from commit 611123f)
Removing the Android toolchain saves 14 GiB, which gives us more room for growth and to avoid running into out-of-space errors in the Linux sanitizers + debug symbols builds. Related to godotengine#79919, though the caches were just one part of the problem, the real issue is that our Linux sanitizers builds take 12 GiB, and adding godot-cpp on top with 2 GiB leaves only a few GiB left for the cache itself.
I'd like to work on this. Would calling a python script that implements some of the logic at the end of the workflows be an acceptable solution? |
@brno32 I don't think there is anything that needs to be worked on here right now. #80115 resolved most of the pressing issues and we haven't had CI troubles with space ever since. We are kind of wasting space with old caches, but that doesn't seem to bother GitHub much. With changes in #80091 we also use more relevant caches than when this issue was submitted. And we don't compile Godot and godot-cpp on the same runner anymore. So while all these findings are still interesting and educational, I think the issue can be closed as resolved. |
Godot version
n/a
System information
n/a
Issue description
Our GitHub Actions CI has a caching mechanism that reuses results of previous builds to speed up following builds. The caching itself works pretty well, however we do have an issue with the space allocated to our runners.
Take #76252 for example. It has been pushed several times in the last day, which created multiple cache records which all take spaces. Even though only the latest one is useful.
In theory, there is an automatic mechanism for flushing outdated cache, but it's only triggered when the cache storage is about to run out of space. So until then 3 of these 4 cached packages lay idle and take up space. Space which we desperately need, as our builds require a lot of it.
Steps to reproduce
What we could do to fix this is to make sure we always clean up outdated caches.
First of all, we can safely remove any cache associated with a recently merged PR. In theory, these caches could be used by the
master
branch (and all other main branches) too, but in practice that would almost never happen. The lookup for a matching cache would first check for caches starting withwindows-template-master-refs/heads/master
. Only if those are missing it would take the latest cache starting withwindows-template-master
. Which isn't even guaranteed to be the one we need and can come from any recently built PR.So all in all, PR caches are useless to us the moment we merge the PR.
Second, we should definitely remove previous caches associated with any PR once they are no longer useful. So in the situation demonstrated above we would start the first build without any caches and generate a cache labeled as
windows-editor-3.x-refs/pull/76252/merge-cff55472ae5752a1f9b3026a95c21485357697ab
. The second build would want to use that cache, so we start the build as normal, but once it's done and a new cache is available (windows-editor-3.x-refs/pull/76252/merge-0fbd955076278d0112419b34b67731952992faf6
) we need to removewindows-editor-3.x-refs/pull/76252/merge-cff55472ae5752a1f9b3026a95c21485357697ab
.This would make sure we don't store anything that is no longer usable. The implementation should be pretty straightforward (store the old cache key, do the build, delete the old cache key, finish the action).
Minimal reproduction project
n/a
The text was updated successfully, but these errors were encountered: