Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching paths from cache.nixos.org gets slow in certain Docker environments #11258

Open
nebez opened this issue Aug 5, 2024 · 15 comments
Open
Labels
bug fetching Networking with the outside (non-Nix) world, input locking performance

Comments

@nebez
Copy link

nebez commented Aug 5, 2024

Describe the bug

When fetching paths from cache.nixos.org, the operation gets stuck depending on the environment. This is a follow-up to my comment in NixOS/nixpkgs#160289 . Given enough time, the operation seems to make progress. For a simple operation of installing a single package from nixpkgs, this takes ~3 minutes. For something slightly more complex (our production app), the operation hasn't yet completed and has been running for 2 hours. For comparison, it used to take ~10 minutes on nixos/nix:2.22.3.

This behaviour was introduced in nixos/nix:2.23.0-arm64. It was last working in nixos/nix:2.22.3-arm64. I've tested every new tag published since (until 2.24.1-arm64) and observed the same behaviour.

What's been difficult about narrowing this bug down has been that it is consistent in its reproduction, but behaves very differently depending on invocation. Using the same Dockerfile on my darwin vs. linux produces different results. On darwin, I don't face this issue. I assume this is because the of docker-desktop virtualization/differences. Secondly, invoking the offending command via docker run works fine!

I will describe below.

Steps To Reproduce

On linux, create Dockerfile and fill contents:

FROM nixos/nix:2.23.0-arm64
RUN nix-env --file https://github.com/NixOS/nixpkgs/archive/47b604b07d1e8146d5398b42d3306fdebd343986.tar.gz --install --attr sops

docker build .

Observe getting stuck at: copying path '/nix/store/y359d58sp2j91i86aclsg16p8h414n16-sops-3.8.1' from 'https://cache.nixos.org'...

Wait for a few minutes, and eventually the process will continue. On an arm64 c7g.2xlarge, this operation takes 227 seconds.

Expected behavior

Change base image in your Dockerfile to nixos/nix:2.22.3-arm64

docker build .

You will not get stuck copying path [...]... step.

Process will complete in a much more reasonable 33 seconds on the same arm64 c7g.2xlarge machine.

Additional context

It seems this only happens in the docker builder. When running these same two commands, you will not see any difference in behaviour. Both operations complete in a reasonable 30s.

time docker run --rm nixos/nix:2.22.3 nix-env --file https://github.com/NixOS/nixpkgs/archive/47b604b07d1e8146d5398b42d3306fdebd343986.tar.gz --install --attr sops
time docker run --rm nixos/nix:2.23.0 nix-env --file https://github.com/NixOS/nixpkgs/archive/47b604b07d1e8146d5398b42d3306fdebd343986.tar.gz --install --attr sops

These commands respectively return:

# nixos/nix:2.22.3
real	0m37.988s

# nixos/nix:2.23.0
real	0m41.800s

Notify maintainers

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ docker run --rm nixos/nix:2.22.3 nix-shell -p nix-info --run "nix-info -m"
 - system: `"aarch64-linux"`
 - host os: `Linux 6.1.90-99.173.amzn2023.aarch64`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.22.3`
 - channels(root): `""`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixpkgs`
[user@system:~]$ docker run --rm nixos/nix:2.23.0 nix-shell -p nix-info --run "nix-info -m"
 - system: `"aarch64-linux"`
 - host os: `Linux 6.1.90-99.173.amzn2023.aarch64`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.23.0`
 - channels(root): `""`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixpkgs`
[user@system:~]$ docker -v
Docker version 25.0.3, build 4debf41

Add a 👍 reaction to issues you find important.

@Ericson2314
Copy link
Member

I wonder if the problem #11257 is fixing is at all related.

@Ericson2314
Copy link
Member

CC @L-as

@L-as
Copy link
Member

L-as commented Aug 5, 2024

I wouldn't think so, that was introduced in 2.24. I can still try to investigate this since I'm so familiar with the build/worker code now.

@roberth
Copy link
Member

roberth commented Aug 11, 2024

This may be improved by

However, it should never hang for 2 hours. Do we have a timeout that applies during the handling of the connection? Something akin to max-silent-time but for libcurl?

@nebez
Copy link
Author

nebez commented Aug 11, 2024

I have a retry mechanism on my side during my build phase, but unfortunately i'm just realizing now that it's silent. I can't be 100% sure that the connection isn't timing out something like ~30 minutes and then trying the nix-env command again.

I'll add some more verbosity to it and report back, as well as try the fix applied in #11171.

@nebez
Copy link
Author

nebez commented Aug 11, 2024

Just tested 2.24.2 which includes 6e3bba5 – result is the same.

Takes 226.3s to run the nix-env command that took 34.6s on 2.22.3. Both just ran on the same host machine. output of nix-info:

docker run --rm nixos/nix:2.24.2 nix-shell -p nix-info --run "nix-info -m"
 - system: `"aarch64-linux"`
 - host os: `Linux 6.1.90-99.173.amzn2023.aarch64`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.24.2`
 - channels(root): `""`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixpkgs`

The Dockerfile to test:

FROM nixos/nix:2.24.2-arm64
RUN nix-env --file https://github.com/NixOS/nixpkgs/archive/47b604b07d1e8146d5398b42d3306fdebd343986.tar.gz --install --attr sops

@roberth roberth added the fetching Networking with the outside (non-Nix) world, input locking label Aug 11, 2024
@roberth roberth changed the title Fetching paths from cache.nixos.org gets stuck in certain Docker environments Fetching paths from cache.nixos.org gets slow in certain Docker environments Aug 12, 2024
@roberth
Copy link
Member

roberth commented Aug 12, 2024

I've changed the title because it is likely to only be slow now, since #11171

@mischavandenburg
Copy link

I'm having this issue as well. I'm running dev containers using Nix as my package manager. nix is installed as a dev container feature.

It always gets stuck on this specific package:

copying path '/nix/store/6a7qxqcm33y3lwr1v55hpp5hg7akja92-stdenv-linux' from 'https://cache.nixos.org'...

.devcontainer.json

{
  "image": "mcr.microsoft.com/devcontainers/base:debian",
  "features": {
    "ghcr.io/devcontainers/features/nix:1": {}
  },
}

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-08-12-nix-team-meeting-minutes-168/50561/1

@roberth
Copy link
Member

roberth commented Aug 13, 2024

DIscussed during the aforementioned Nix maintainer team meeting:

Possibly same cause: #11249, but @edolstra could not reproduce.

Could this be libgit2 overhead of the new tarball-cache? In a Docker container we may be dealing with an overhead of many fs overlays, amplifying the overhead (normally on the order of 2× at most?).
This might also make reproducing dependent on the layer complexity of the image?

@kjeremy
Copy link
Contributor

kjeremy commented Aug 13, 2024

I'm seeing this in a docker container running 2.18.5 as well.

@nebez
Copy link
Author

nebez commented Aug 13, 2024

I can donate an aws linux machine for you to reproduce this with. it only has docker installed and the issue is consistently reproducible. send me your public key, i'll get you on it.

@mischavandenburg
Copy link

Tested just now, and the issue still persists. I tried hacking the dev container feature to bring it back to an older Nix version, but unfortunately that hasn't worked.

@PkmX
Copy link

PkmX commented Aug 31, 2024

We hit a similar problem with extremely slow builds in docker which turns to be caused by a huge default OPEN_MAX value:

$ docker run -it --rm alpine:latest getconf OPEN_MAX
1073741816

This causes closeMostFDs to be stuck in a loop trying to close them:

int maxFD = 0;
#if HAVE_SYSCONF
maxFD = sysconf(_SC_OPEN_MAX);
#endif
for (int fd = 0; fd < maxFD; ++fd)
if (!exceptions.count(fd))
close(fd); /* ignore result */

For reasons I didn't investigate, the loop above that iterates over /proc/self/fd on Linux throws an exception inside my docker environment, so this fallback method gets called.

Maybe check if running the container with --ulimit nofile=1024:1024 helps?

@nebez
Copy link
Author

nebez commented Sep 2, 2024

This does indeed fix it, @PkmX ! Thank you!

Using this Dockerfile:

FROM nixos/nix:2.23.0-arm64
RUN nix-env --file https://github.com/NixOS/nixpkgs/archive/47b604b07d1e8146d5398b42d3306fdebd343986.tar.gz --install --attr sops

Running docker build --ulimit nofile=65534:65534 . finishes in a very reasonable 41.7 seconds. Incredible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fetching Networking with the outside (non-Nix) world, input locking performance
Projects
None yet
Development

No branches or pull requests

8 participants