Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos-unstable blocked on evaluation error “error: unexpected EOF reading a line” #162317

Closed
andersk opened this issue Mar 1, 2022 · 24 comments
Labels
0.kind: bug Something is broken

Comments

@andersk
Copy link
Contributor

andersk commented Mar 1, 2022

nixos-unstable has been blocked for a week, and currently fails to evaluate with the dreaded error: unexpected EOF reading a line, which gives no information about where the real failure originated.

Previous/related issues:

@andersk andersk added 0.kind: bug Something is broken 1.severity: channel blocker Blocks a channel labels Mar 1, 2022
@vcunat
Copy link
Member

vcunat commented Mar 1, 2022

The failure is almost surely running out of memory. At least that's what was reported several days ago.

@vcunat
Copy link
Member

vcunat commented Mar 1, 2022

Hydra has done many evaluation attempts in the past few days, as I increased the frequency to 10/day in hope that at least sometimes we could get lucky and fit the RAM, but apparently we're too high now. Now I also tried with empty limitedSupportedSystems.

/cc @grahamc

@vcunat
Copy link
Member

vcunat commented Mar 1, 2022

I wonder if that limitedSupportedSystems change broke it to

{UNKNOWN}: Died at /nix/store/19xkba2a2psaknwn09yi76sdmxm4l0d9-hydra-0.1.20220222.a254612/bin/.hydra-eval-jobset-wrapped line 802. at /nix/store/2d7iizyv1mkg6wklbsv92qzkq0vizn7j-hydra-perl-deps/lib/perl5/site_perl/5.34.0/Catalyst/Model/DBIC/Schema.pm line 526

@vcunat
Copy link
Member

vcunat commented Mar 1, 2022

I suspect that some success exceptions a week ago were thanks to lower RAM usage soon after evaluator restart. No idea why it would happen, but that's what I thought after looking at the graphs of ceres https://monitoring.nixos.org/grafana/d/hkRCcV0mk/instance-metrics?orgId=1&from=1645236595934&to=1645487941388 with usage only very rarely getting low (<20%), like some kind of "memory leak".

@andersk
Copy link
Contributor Author

andersk commented Mar 1, 2022

On my system, with the version of hydra-unstable in nixpkgs, hydra-eval-jobs -I . nixos/release-combined.nix --option allow-import-from-derivation false --verbose dies with

error: aggregate job 'tested' references non-existent job 'nixos.tests.nano.x86_64-linux'

presumably due to

With the newer version of hydra-unstable from #160202, this isn’t fatal to hydra-eval-jobs (NixOS/hydra#1025), though I’m guessing it would still be channel-blocking.

@vcunat
Copy link
Member

vcunat commented Mar 1, 2022

Thanks, dropped in 18bd82e.

@vcunat
Copy link
Member

vcunat commented Mar 1, 2022

Wow, we got a successful evaluation: https://hydra.nixos.org/eval/1746348

It's a shame that the "Evaluation Errors" tab is so often unusable (in our practice).

@vcunat
Copy link
Member

vcunat commented Mar 2, 2022

Well, we got a channel bump, but it still won't evaluate successfully in default setting.

@Artturin Artturin removed the 1.severity: channel blocker Blocks a channel label Mar 3, 2022
@vcunat
Copy link
Member

vcunat commented Mar 3, 2022

OK, so I dropped aarch64-linux from limitedSupportedSystems, as it seems better than not getting any evaluation, but we're apparently near our RAM limits. Locally the eval succeeded with 32G RAM:

$ hydra-eval-jobs nixos/release-combined.nix -I . --verbose

@leiserfg
Copy link
Contributor

leiserfg commented Mar 22, 2022

Is it happening again? one week without unstable updates already.

@drupol
Copy link
Contributor

drupol commented Mar 22, 2022

Is it happening again? one week without untable updates already.

I got this update yesterday, and for today, I don't know what is happening, it's seems to be stuck indeed (https://hydra.nixos.org/job/nixos/trunk-combined/tested#tabs-constituents).

@vcunat
Copy link
Member

vcunat commented Mar 22, 2022

I can't see anything stuck. nixos-unstable is on a two-days old commit, evals happen every day. You need to be more specific about what you think is wrong.

@drupol
Copy link
Contributor

drupol commented Mar 22, 2022

You need to be more specific about what you think is wrong.

I'm not saying something is wrong on my side. However, from my NixOS rookie eye, I just have the impression that sometimes builds are really fast to be done (like the last 3 successful builds 2 days ago) and sometimes they stay much longer and I have the impression that something is stuck somewhere.

@leiserfg
Copy link
Contributor

leiserfg commented Mar 22, 2022

Sorry, is the nixpkgs-unstable the one blocked.
https://status.nixos.org/

@vcunat
Copy link
Member

vcunat commented Mar 22, 2022

Yes, that's because our darwin resources were low, so it's been taking longer.

@drupol
Copy link
Contributor

drupol commented Mar 23, 2022

Yes, that's because our darwin resources were low, so it's been taking longer.

Yes indeed, it's extremely slow.

@leiserfg
Copy link
Contributor

8 days already, versionitis is killing me 😄 .

@vcunat
Copy link
Member

vcunat commented Apr 7, 2022

Thanks to the foundation and Graham, the evaluator was migrated to a much more powerful machine. So these RAM issues shouldn't happen anymore, and I lifted the temporary restriction of aarch64-linux from the primary jobset, since https://hydra.nixos.org/eval/1753830

@vcunat
Copy link
Member

vcunat commented Apr 7, 2022

Let's close this, so that we don't accumulate more not-that-much related issues into the same thread.

@vcunat vcunat closed this as completed Apr 7, 2022
@nh2
Copy link
Contributor

nh2 commented Apr 7, 2022

the evaluator was migrated to a much more powerful machine. So these RAM issues shouldn't happen anymore

@vcunat @grahamc Out of interest, how much RAM does that machine have? I would like to know how much one should provision for this type of task.

@vcunat
Copy link
Member

vcunat commented Apr 7, 2022

The old not-quite-sufficient one was PX62-NVME: 6 cores, 64G ram, https://www.hetzner.com/dedicated-rootserver/matrix-px

Note that the postgres DB runs on a different machine.

@nh2
Copy link
Contributor

nh2 commented Apr 7, 2022

And the new sufficient one?

@vcunat
Copy link
Member

vcunat commented Apr 7, 2022

I don't know exactly, but the RAM insufficiency came just from the large number of NixOS tests in a single evaluation. 100k normal builds in a single evaluation was OK.

@ncfavier
Copy link
Member

ncfavier commented Apr 7, 2022

I think I saw Hetzner AX101 mentioned on Matrix? (that would be 128 GB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

7 participants