Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registry fails to initialize for highly parallelized builds #285

Closed
hanikesn opened this issue Jul 7, 2023 · 8 comments
Closed

Registry fails to initialize for highly parallelized builds #285

hanikesn opened this issue Jul 7, 2023 · 8 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@hanikesn
Copy link

hanikesn commented Jul 7, 2023

We're building around 168 oci_images in our monorepo build. During one of our CI runs I saw:

bazel-out/k8-opt/bin/XXX_image.sh: line 60: $2: unbound variable

See: https://github.com/bazel-contrib/rules_oci/blob/main/oci/private/image.sh.tpl#L60

I assume this happened because the registry didn't start in time and timed out. I'm wondering whether there's a safer way to initialize the registry or a way to increase the timeout.

@alexeagle alexeagle added bug Something isn't working help wanted Extra attention is needed labels Jul 12, 2023
@thesayyn
Copy link
Collaborator

168 doesn't seem a lot and shouldn't put pressure on the number of TCP ports open. Ideally, we can make the registry toolchain to allow changing the hardcoded timeout which is 5sec, or have a retry mechanism with an exponential backoff mechanism.

@hanikesn
Copy link
Author

hanikesn commented Jul 13, 2023

168 doesn't seem a lot

We're running around 48 builds/runs in parallel on a 48 core machine, which might slow down things enough for the startup to take longer.

The following patch "fixes" the issues for now on our side and can easily be applied via the patches field on http_archive rule for oci_rules:

--- oci/private/image.sh.tpl  2023-07-10 18:24:57.019088204 +0200
+++ oci/private/image.sh.tpl	2023-07-10 18:26:54.093602950 +0200
@@ -85,3 +85,3 @@
 source "${REGISTRY_LAUNCHER}"
-readonly REGISTRY=$(start_registry "${STORAGE_DIR}" "${STDERR}")
+readonly REGISTRY=$(start_registry "${STORAGE_DIR}" "${STDERR}" 20)

@hanikesn
Copy link
Author

hanikesn commented Jul 21, 2023

Ok, we saw this again, but with a more concrete error message:

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
HTTP port 0
panic: too many open files

goroutine 1 [running]:
zotregistry.io/zot/pkg/cli.newServeCmd.func1(0xc0002a2a00?, {0xc000bb8dd0, 0x1, 0x1?})
	zotregistry.io/zot/pkg/cli/root.go:54 +0xbe
github.com/spf13/cobra.(*Command).execute(0xc0002a2a00, {0xc000bb8da0, 0x1, 0x1})
	github.com/spf13/cobra@v1.5.0/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc000236f00)
	github.com/spf13/cobra@v1.5.0/command.go:990 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.5.0/command.go:918
main.main()
	zotregistry.io/zot/cmd/zot/main.go:10 +0x1e
registry didn't become ready within 60s.
bazel-out/k8-opt/XXX_image.sh: line 60: $2: unbound variable
Target //buildsys/docker:push failed to build

So looks like our ulimit is too low. rules_oci is definitely more straining than rules_docker used to be. I'm wondering how feasible it would be to replace zot with a solution/modify it, to run in standalone, non-daemon mode. As it seems quite expensive for what it's doing.

EDIT: For reference the ulimit was set to 1048576

@hanikesn
Copy link
Author

Evaluating this further: It seems like 1048576 open files should be plenty to build even ~170 containers concurrently.

@thesayyn
Copy link
Collaborator

I’d like to clear things up a little bit. First neither crane or zot runs as a daemon. By design crane needs to talk to a registry in order to manipulate/assemble containers. Therefore we need to spin up a local registry instance, this is where we spin up either crane or zot, to stage changes and pull in to complete the action. This is conscious choice we made when we designed rules_oci, to keep the complexity low so that we can afford effective maintenance.

I’m not surprised that you bump into this issue as we have seen this in one of clients and the workaround was to increase the limits to allow more open files.

‘zot’ and ‘crane registry serve’ are the only two implementations that we support at the moment. While zot dumps everything to disk, crane stores them on memory. So ideally switching to crane as the registry should make this issue disappear but would lead to more memory usage. Would it be possible for you to remove “ZOT_VERSION” from your workspace and let it run like that for a while to see if your setup works?

@hanikesn
Copy link
Author

Would it be possible for you to remove “ZOT_VERSION” from your workspace and let it run like that for a while to see if your setup works?

We initially only used krane without zot, as our base images still used the docker manifest. Our builds almost immediately OOMed on a machine with 64GB RAM. We're mostly bundling java apps and our jar layer can easily use between 300-500MB we also have a few images with ml-models where the resulting image is about 1.5GB compressed in size. So without zot we wouldn't have been able to migrate to oci_rules at all.

@jonjohnsonjr
Copy link

This might be relevant: google/go-containerregistry#1731

@thesayyn
Copy link
Collaborator

thesayyn commented May 7, 2024

fixed by #550

@thesayyn thesayyn closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants