-
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Registry fails to initialize for highly parallelized builds #285
Comments
168 doesn't seem a lot and shouldn't put pressure on the number of TCP ports open. Ideally, we can make the registry toolchain to allow changing the hardcoded timeout which is 5sec, or have a retry mechanism with an exponential backoff mechanism. |
We're running around 48 builds/runs in parallel on a 48 core machine, which might slow down things enough for the startup to take longer. The following patch "fixes" the issues for now on our side and can easily be applied via the
|
Ok, we saw this again, but with a more concrete error message:
So looks like our ulimit is too low. EDIT: For reference the ulimit was set to |
Evaluating this further: It seems like 1048576 open files should be plenty to build even ~170 containers concurrently. |
I’d like to clear things up a little bit. First neither crane or zot runs as a daemon. By design crane needs to talk to a registry in order to manipulate/assemble containers. Therefore we need to spin up a local registry instance, this is where we spin up either crane or zot, to stage changes and pull in to complete the action. This is conscious choice we made when we designed rules_oci, to keep the complexity low so that we can afford effective maintenance. I’m not surprised that you bump into this issue as we have seen this in one of clients and the workaround was to increase the limits to allow more open files. ‘zot’ and ‘crane registry serve’ are the only two implementations that we support at the moment. While zot dumps everything to disk, crane stores them on memory. So ideally switching to crane as the registry should make this issue disappear but would lead to more memory usage. Would it be possible for you to remove “ZOT_VERSION” from your workspace and let it run like that for a while to see if your setup works? |
We initially only used krane without zot, as our base images still used the docker manifest. Our builds almost immediately OOMed on a machine with 64GB RAM. We're mostly bundling java apps and our jar layer can easily use between 300-500MB we also have a few images with ml-models where the resulting image is about 1.5GB compressed in size. So without zot we wouldn't have been able to migrate to oci_rules at all. |
This might be relevant: google/go-containerregistry#1731 |
fixed by #550 |
We're building around 168 oci_images in our monorepo build. During one of our CI runs I saw:
See: https://github.com/bazel-contrib/rules_oci/blob/main/oci/private/image.sh.tpl#L60
I assume this happened because the registry didn't start in time and timed out. I'm wondering whether there's a safer way to initialize the registry or a way to increase the timeout.
The text was updated successfully, but these errors were encountered: