Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nix 2.0: nix copy stalls #1988

Closed
chris-martin opened this issue Mar 19, 2018 · 24 comments
Closed

Nix 2.0: nix copy stalls #1988

chris-martin opened this issue Mar 19, 2018 · 24 comments
Assignees

Comments

@chris-martin
Copy link
Contributor

chris-martin commented Mar 19, 2018

I've been deploying a server with nix copy for a while now, but suddenly I find it's no longer working.

$ nix copy /nix/store/cg47i25qrhp3lcd3hirgwc28pd7yxw0v-nixos-system-tc-webserver1-18.03pre-git --to ssh://example.com --no-check-sigs --verbose
copying 11 paths...
copying path '/nix/store/lkr614grgjfm7dghyw8zj9x5i4dwhmcy-qemu-2.11.1' to 'ssh://example.com'...
warning: dumping very large path (> 256 MiB); this may run out of memory
[1/0/11 copied (406.0/410.7 MiB)] copying path '/nix/store/lkr614grgjfm7dghyw8zj9x5i4dwhmcy-qemu-2.11.1' to 'ssh://example.com'

The command pauses (apparently indefinitely) with no clue as to why. If I interrupt it and run it again, I get exactly the same behavior.

Watching top on the server, I see that a nix-store --serve --write process is running.

I've tried restarting sshd and nix-daemon on the server, and I've rebooted my laptop.


Update: After about an hour, the nix copy command printed this and halted:

error: Nix daemon out of memory
reaping 3 worker threads
killing process 9152
[0 copied (408.5 MiB)]
error: unexpected end-of-file
@AmineChikhaoui
Copy link
Member

It would probably be nice if you could test with the latest master branch as Eelco did some tweaks to reduce memory consumption afaik (48662d1).

@chris-martin
Copy link
Contributor Author

Hi @AmineChikhaoui, I'm trying to do that. I think the manual is out of date - it says the first build step is to run ./configure, but there's no such script.

@AmineChikhaoui
Copy link
Member

I generally override the nix package in my configuration.nix e.g of an old setup (you need to pull the latest revision/checksum):

nixpkgs.config.packageOverrides = pkgs: {
    nixUnstable = pkgs.nixUnstable.overrideAttrs (oldAttrs: rec{ 
      src = pkgs.fetchFromGitHub {
        owner = "NixOS";
        repo = "nix";
        rev = "179b896acb6deb8fea9614dfbddeaf3b23797bf5";
        sha256 = "15dd0f8jx5c4vhjnis6rrccyw8qdm86l5fakqp07ifvb6l0xlfzb";
    };});
};

@chris-martin
Copy link
Contributor Author

Added this to my package overrides:

    nixUnstable = pkgs.nixUnstable.overrideAttrs (oldAttrs: rec {
      src = pkgs.fetchFromGitHub {
        owner = "NixOS";
        repo = "nix";
        rev = "d53970d31bdf9e4133a6ddd42d6a8d6db15903c4";
        sha256 = "1ssk37s1byp4gy63iaxlyfwxzfnxirqpg6g92m8dxv5jgy4qgk78";
      };
    });

This is the build result:

installing 'nix-2.0'
these derivations will be built:
  /nix/store/mvmdcaw565vrq90pg9cxdqlfw270qi10-nix-2.0.drv
building '/nix/store/mvmdcaw565vrq90pg9cxdqlfw270qi10-nix-2.0.drv'...
unpacking sources
unpacking source archive /nix/store/kqx30r3cns93sqbv842z6f1bn6fhc3vr-source
source root is source
patching sources
configuring
no configure script, doing nothing
building
build flags: -j1 -l1 SHELL=/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash profiledir=\$\(out\)/etc/profile.d
  GEN    Makefile.config
/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash: ./config.status: No such file or directory
  GEN    src/libexpr/parser-tab.cc
/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash: bison: command not found
make: *** [src/libexpr/local.mk:24: src/libexpr/parser-tab.cc] Error 127
builder for '/nix/store/mvmdcaw565vrq90pg9cxdqlfw270qi10-nix-2.0.drv' failed with exit code 2
error: build of '/nix/store/mvmdcaw565vrq90pg9cxdqlfw270qi10-nix-2.0.drv' failed

@chris-martin
Copy link
Contributor Author

It does seem like memory may be relevant. This is what I get when I try to build qemu on the server directly:

$ nix-shell -p qemu
these paths will be fetched (60.93 MiB download, 408.51 MiB unpacked):
  /nix/store/l3nz0167a4xhb78ldcjns5p4dwvlm1cw-stdenv
  /nix/store/lkr614grgjfm7dghyw8zj9x5i4dwhmcy-qemu-2.11.1
copying path '/nix/store/lkr614grgjfm7dghyw8zj9x5i4dwhmcy-qemu-2.11.1' from 'https://cache.nixos.org'...
error: Nix daemon out of memory

Do I read this correctly, is Nix somehow running of memory while merely trying to download a file?

@teto
Copy link
Member

teto commented Mar 20, 2018

I may have a similar problem: with nix 2.0 simply adding environment.systemPackages = [ pkgs.qemu ]; and trying to deploy in nixops (master) gives:
server> copying path '/nix/store/9lharz6d9i2zp92zl6w4v7ifks15m775-qemu-2.11.1' to 'ssh://root@192.168.122.222'... client> warning: dumping very large path (> 256 MiB); this may run out of memory client> error: out of memory
(export GC_MAXIMUM_HEAP_SIZE=10G didn't help either). At the time i had 3,5Gb of ram available.

@chris-martin
Copy link
Contributor Author

chris-martin commented Mar 20, 2018

Although it does so happen that qemu is a particularly large derivation, sometimes nix copy fails to copy derivations that are as small as a few kilobytes.

@dhess
Copy link

dhess commented Mar 21, 2018

Since upgrading to Nix 2.0, I am also having this problem trying to copy GHC to a BeagleBone Black, which has only 512MiB of RAM. I bumped swap on the host to 1GiB, and even that wasn't sufficient.

@chris-martin
Copy link
Contributor Author

Workaround: I have gone back to using nix-copy-closure, and it still seems to work reliably in version 2.0.

@dhess
Copy link

dhess commented Mar 22, 2018

@chris-martin FYI, nix-copy-closure from Nix 2.0 is what I was using when I had my issues. I had to bump swap to 2GiB just to copy the ghc-8.0.2 closure alone.

@teto
Copy link
Member

teto commented Mar 22, 2018 via email

@dtzWill
Copy link
Member

dtzWill commented Mar 22, 2018

Strongly recommend using latest Nix (git), which has huge improvements regarding memory usage. (if you're still unstable to make your override work, LMK)

@chris-martin
Copy link
Contributor Author

chris-martin commented Mar 22, 2018

@dtzWill Yes, the nixUnstable build is still broken. This is the latest master commit.

      nixUnstable = pkgs.nixUnstable.overrideAttrs (oldAttrs: rec {
        src = pkgs.fetchFromGitHub {
          owner = "NixOS";
          repo = "nix";
          rev = "2bc6cfe1adc89673626a173ee38ef300588be3d1";
          sha256 = "1qcvmngdxpdn6l6was8j6biaxc2hc6ywhsv4hk6lg21llkb2203s";
        };
      });
$ nix install nixUnstable
installing 'nix-2.0'
these derivations will be built:
  /nix/store/v0zsbbbgp97yijh7kfdhiy39s1d3i7bi-nix-2.0.drv
building '/nix/store/v0zsbbbgp97yijh7kfdhiy39s1d3i7bi-nix-2.0.drv'...
unpacking sources
unpacking source archive /nix/store/cwgndzm925ryqs4b7hk3a9mzlk4nyjql-source
source root is source
patching sources
configuring
no configure script, doing nothing
building
build flags: -j1 -l1 SHELL=/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash profiledir=\$\(out\)/etc/profile.d
  GEN    Makefile.config
/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash: ./config.status: No such file or directory
  GEN    src/libexpr/parser-tab.cc
/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash: bison: command not found
make: *** [src/libexpr/local.mk:24: src/libexpr/parser-tab.cc] Error 127
builder for '/nix/store/v0zsbbbgp97yijh7kfdhiy39s1d3i7bi-nix-2.0.drv' failed with exit code 2
error: build of '/nix/store/v0zsbbbgp97yijh7kfdhiy39s1d3i7bi-nix-2.0.drv' failed

@dtzWill
Copy link
Member

dtzWill commented Mar 22, 2018

Okay, well that's a bummer. Looks like there's no convenient nixUnstable that knows how to build from git. This should work, it's basically what I use for using a version of Nix built from my fork or a specific branch: https://gist.github.com/0aa8821f53c358a5c4b61a334ff9e953

If you don't set nixpkgs it'll use the default which is currently nixos-18.03. Using the default is more supported, but if you're using recent nixpkgs you may want to ensure your nixpkgs is being used.

Not quite as clean as an override but works for me and hopefully helps.

EDIT: grafting bits into an override works, but eep: https://gist.github.com/dtzWill/d3fb86978f8fb8dcb3a8f726d7db0522

@dtzWill
Copy link
Member

dtzWill commented Mar 22, 2018

nix install

Where'd this come from? 😁

@chris-martin
Copy link
Contributor Author

Where'd this come from?

Oh, right, nix install x is my fish alias for nix-env --keep-going --file '<nixpkgs>' --install --attr x 😛

@shlevy shlevy added the backlog label Apr 1, 2018
@shlevy shlevy self-assigned this Apr 1, 2018
@teto
Copy link
Member

teto commented Apr 6, 2018

I've tried the overcommit memory, GC_INITIAL_HEAP_SIZE=128k, installed master nix and even with 5GB of free memory I still have:

(ins)[nix-shell:~/nixops]$ nixops deploy
server> connecting...
client> connecting...
building all machine configurations...
client> copying closure...
server> copying closure...
client> copying 11 paths...
server> copying 11 paths...
server> copying path '/nix/store/j7x4cjlfrdhagmciqk5zc58g1icg91rb-qemu-2.11.1' to 'ssh://root@192.168.122.241'...
client> copying path '/nix/store/j7x4cjlfrdhagmciqk5zc58g1icg91rb-qemu-2.11.1' to 'ssh://root@192.168.122.98'...
server> copying path '/nix/store/zr8gkk7fh45dpi9q6yqcrv1zaycn82ms-configuration.nix' to 'ssh://root@192.168.122.241'...
client> error: out of memory
client> error (ignored): writing to file: Broken pipe
client> error: writing to file: Broken pipe
server> error: out of memory
server> error (ignored): writing to file: Broken pipe
server> error: writing to file: Broken pipe
error: Multiple exceptions (2): 
  * client: command ‘['nix-copy-closure', '--to', 'root@192.168.122.98', u'/nix/store/sg56hkjv5rkhs0qprvryd8l7mgz8smv7-nixos-system-client-18.09.git.94a99c0']’ failed on machine ‘client’ (exit code 1)
  * server: command ‘['nix-copy-closure', '--to', 'root@192.168.122.241', u'/nix/store/74hn3jdkxypkq7rn16wcm06kq9mrszgk-nixos-system-server-18.09.git.94a99c0']’ failed on machine ‘server’ (exit code 1)

I am really looking forward for a fix.

@NCrashed
Copy link

NCrashed commented Apr 8, 2018

I have the same issue, nixops fails to copy closures to 2G RAM VPS:

building all machine configurations...
node2....> copying closure...
node6.> copying closure...
node5...> copying closure...
node3> copying closure...
node1.> copying closure...
mainServer..> copying closure...
node2....> copying path '/nix/store/g9baqjsh28swdy6mvvp4hp2by8xvd7af-ghc-8.2.2' from 'https://cache.nixos.org'...
node4..> copying closure...
node3> copying path '/nix/store/g9baqjsh28swdy6mvvp4hp2by8xvd7af-ghc-8.2.2' from 'https://cache.nixos.org'...
node4..> copying path '/nix/store/g9baqjsh28swdy6mvvp4hp2by8xvd7af-ghc-8.2.2' from 'https://cache.nixos.org'...
node1.> copying path '/nix/store/g9baqjsh28swdy6mvvp4hp2by8xvd7af-ghc-8.2.2' from 'https://cache.nixos.org'...
node2....> got EOF while expecting 8 bytes from remote side
smtpServer..> copying closure...
node5...> copying path '/nix/store/g9baqjsh28swdy6mvvp4hp2by8xvd7af-ghc-8.2.2' from 'https://cache.nixos.org'...
node5...> got EOF while expecting 8 bytes from remote side
node4..> got EOF while expecting 8 bytes from remote side
node1.> got EOF while expecting 8 bytes from remote side
node3> got EOF while expecting 8 bytes from remote side
error: Multiple exceptions (5): 
  * node1: command ‘['nix-copy-closure', '--to', 'root@172.11.10.1', u'/nix/store/adxbmnfd0lrajz7dd0qkgf9ljpbyar89-nixos-system-node1-18.03pre123927.5be70c39f3e', '--use-substitutes']’ failed on machine ‘node1’ (exit code 255)
  * node2: command ‘['nix-copy-closure', '--to', 'root@172.11.10.2', u'/nix/store/908smr8wnfp53l4jq8bpnl3qid1gkimr-nixos-system-node2-18.03pre123927.5be70c39f3e', '--use-substitutes']’ failed on machine ‘node2’ (exit code 255)
  * node3: command ‘['nix-copy-closure', '--to', 'root@172.11.10.3', u'/nix/store/hxsqvgw1gd11i1vnmkj86jpirchndv62-nixos-system-node3-18.03pre123927.5be70c39f3e', '--use-substitutes']’ failed on machine ‘node3’ (exit code 255)
  * node4: command ‘['nix-copy-closure', '--to', 'root@172.11.10.4', u'/nix/store/l5s5ccyhw8zsbaynpjlw2nbl82dhh8jx-nixos-system-node4-18.03pre123927.5be70c39f3e', '--use-substitutes']’ failed on machine ‘node4’ (exit code 255)
  * node5: command ‘['nix-copy-closure', '--to', 'root@172.11.10.5', u'/nix/store/8v35mxc2gx7hl5w4ivpaqb06k4v19p0x-nixos-system-node5-18.03pre123927.5be70c39f3e', '--use-substitutes']’ failed on machine ‘node5’ (exit code 255)

------------------------------
Traceback (most recent call last):
Traceback (most recent call last):
  File "/nix/store/i9d97clz3l0pp65icjvkvdjfzlhv1yp0-nixops-1.6/bin/..nixops-wrapped-wrapped", line 995, in <module>
    e.print_all_backtraces()
  File "/nix/store/i9d97clz3l0pp65icjvkvdjfzlhv1yp0-nixops-1.6/lib/python2.7/site-packages/nixops/parallel.py", line 20, in print_all_backtraces
    traceback.print_exception(e[0], e[1], e[2])
  File "/nix/store/sz9x7gpmlpk05q952b9vq87hbgn26hkc-python-2.7.14/lib/python2.7/traceback.py", line 125, in print_exception
    print_tb(tb, limit, file)
  File "/nix/store/sz9x7gpmlpk05q952b9vq87hbgn26hkc-python-2.7.14/lib/python2.7/traceback.py", line 61, in print_tb
    f = tb.tb_frame
AttributeError: 'unicode' object has no attribute 'tb_frame'

Overcommit flag suppresses out of memory error but it still fails.

judah added a commit to FormationAI/rules_haskell that referenced this issue May 8, 2018
Previously, CI used `nixos/nix` without a tag.  7 hours ago, a `2.0` release
was pushed to Docker Hub:

https://hub.docker.com/r/nixos/nix/tags/

Unfortunately, that seems to trigger a bug when copying the "ghc" package; see
discussion in tweag#239.

Resolved for now by fixing the Linux build to a specific docker tag.
Not sure about the corresponding macOS failures.

We should separately investigate the issues with nix-2; they may be related
to:
NixOS/nix#1988
@nh2
Copy link
Contributor

nh2 commented Jun 2, 2018

Looks like this is totally breaking nixops for me too.

I can no longer deploy to AWS machines which even have 4 GB ram. Once nix-store --serve --write exceeds 50%, it crashes with node-1..> error: out of memory.

(Specifically this happens for copying path '/nix/store/36xxgd34g5q24zdh2dvklpkwx47g7bwp-cudatoolkit-9.1.85.1' to 'ssh://root@1.2.3.4, which is a very large derivation.)

How would we even go about fixing this for nixops? When an AWS machine boots up, we have little control over what nix version is running on the other side (it's defined by the image, which even with 18.03 doesn't have a nix that takes less memory).

@nh2
Copy link
Contributor

nh2 commented Jun 2, 2018

I'd also like to know why it crashes at 50%, maybe I could at least make that 90% of the 4 GB for a higher chance of success?

@nh2
Copy link
Contributor

nh2 commented Jun 2, 2018

How would we even go about fixing this for nixops? When an AWS machine boots up, we have little control over what nix version is running on the other side

I have a workaround: Deploy my EC2 instance with a 17.09 base image.

For that purpose, I added an option to nixops so I can simply set deployment.ec2.nixosAmiVersion = "17.09";:

https://github.com/NixOS/nixops/compare/v1.6...nh2:nixosAmiVersion-v1.6?expand=1

nh2 added a commit to nh2/nix that referenced this issue Jun 3, 2018
Fixes `error: out of memory` of `nix-store --serve --write`
when receiving packages via SSH (and perhaps other sources).

See NixOS#1681 NixOS#1969 NixOS#1988 NixOS/nixpkgs#38808.

Performance improvement on `nix-store --import` of a 2.2 GB cudatoolkit closure:

When the store path already exists:
  Before:
    10.82user 2.66system 0:20.14elapsed 66%CPU (0avgtext+0avgdata   12556maxresident)k
  After:
    11.43user 2.94system 0:16.71elapsed 86%CPU (0avgtext+0avgdata 4204664maxresident)k
When the store path doesn't yet exist (after `nix-store --delete`):
  Before:
    11.15user 2.09system 0:13.26elapsed 99%CPU (0avgtext+0avgdata 4204732maxresident)k
  After:
     5.27user 1.48system 0:06.80elapsed 99%CPU (0avgtext+0avgdata   12032maxresident)k

The reduction is 4200 MB -> 12 MB RAM usage, and it also takes less time.
@deepfire
Copy link

deepfire commented Jun 29, 2018

I think we might potentially have two different problems here.

  1. The symptoms of copy stalling.
  2. The Nix out-of-memory.

I have the (1), but not the (2), and in my case this was due to SSH itself stalling, so I don't think Nix is at fault in my case.

So, for the record:

  • version: nix-env (Nix) 2.0, nixpkgs: /nix/store/jj8hjkf34j2ar0zz63jycxkkkym5kvpq-release-18.03.tar.gz
  • EC2
  • Target machines exhibit extremely weird behavior, where even just cat-ting some files over ssh makes the SSH session stall. Such files when copied retain that same stalling property. The same issue happens when nix-copy-closure tries to copy them over SSH.

UPDATE: so this was traced down to MTU issues -- AWS DHCP sometimes advertises MTU of 9001, and it needs to be configured to 1500 in some cases -- and sometimes even for VPC-internal traffic. Go figure..

nh2 pushed a commit to nh2/nix that referenced this issue Jun 27, 2019
It adds a new operation, cmdAddToStoreNar, that does the same thing as
the corresponding nix-daemon operation, i.e. call addToStore(). This
replaces cmdImportPaths, which has the major issue that it sends the
NAR first and the store path second, thus requiring us to store the
incoming NAR either in memory or on disk until we decide what to do
with it.

For example, this reduces the memory usage of

  $ nix copy --to 'ssh://localhost?remote-store=/tmp/nix' /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79

from 267 MiB to 12 MiB.

Probably fixes NixOS#1988.

(cherry picked from commit 2825e05)
@teto
Copy link
Member

teto commented Sep 17, 2019

just wanted to mention that I had a similar problem with nixops libvirtd, one of my bridge had an mtu of 9000 and not sure where the fault lies but nix-copy-closure was stuck until it eventually timed out. Setting a more common mtu of 1400 fixed it.

@onixie
Copy link

onixie commented Mar 10, 2020

I encountered the similar issue as @teto mentioned when using OVS bridge with nixops libvirtd. But I think mtu issue is out of nixops control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants