Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: use Tweag's self-hosted runners #2037

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

yannham
Copy link
Member

@yannham yannham commented Sep 12, 2024

Try to use our self hosted runners, which should be beefier, don't need to install Nix again and again, and should be able to re-use the local store as a cache mechanism as well.

Copy link
Contributor

github-actions bot commented Sep 12, 2024

🐰 Bencher Report

Branch2037/merge
Testbedubuntu-latest

⚠️ WARNING: The following Measure does not have a Threshold. Without a Threshold, no Alerts will ever be generated!

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds CLI flag.

Click to view all benchmark results
BenchmarkLatencynanoseconds (ns)
fibonacci 10📈 view plot
⚠️ NO THRESHOLD
505,360.00
pidigits 100📈 view plot
⚠️ NO THRESHOLD
3,282,800.00
product 30📈 view plot
⚠️ NO THRESHOLD
843,220.00
scalar 10📈 view plot
⚠️ NO THRESHOLD
1,546,100.00
sum 30📈 view plot
⚠️ NO THRESHOLD
837,780.00
🐰 View full continuous benchmarking report in Bencher

@yannham
Copy link
Member Author

yannham commented Sep 13, 2024

Thanks for your precious help @YorikSar . It seems the macOS workflow is failing because of some security stuff. We have several possibilities, I guess, from here:

  1. Revert back to use GH hosted runners (but would be sad)
  2. Revert the macOS runner to be a GH runner (but then the whole approach is moot, because we have to wait for the slowest workflow to finish)
  3. Disable Cachix action on the self-hosted runners, and have separate runners (say, every upgrade, or every week) that build and upload artifacts to Cachix from master (honestly it might be even dubious to upload store paths built from PRs to Cachix)
  4. Disabe Cachix action but only on the macOS runner. Either not upload macOS artifacts at all, or upload them in a separate workflow as in 3.
  5. Dig deeper to try to fix the Cachix action issue

@YorikSar
Copy link
Member

I think we still need to investigate results on our Linux runner because even there the job seems to spend 24m when a random different run on master finished in 3m.

As for macOS, I think we could bring this up to Cachix maintainers. Calling out to security and failing in rather simple hosted runner would probably not be something expected.

@yannham
Copy link
Member Author

yannham commented Sep 13, 2024

I think we still need to investigate results on our Linux runner because even there the job seems to spend 24m when a random different run on master finished in 3m.

I think the continuous integration for merge queue (the one you linked) is different - I don't know what happens exactly but some caching comes into play. If you take the CI on a random PR, it's usually rather around 20+ minutes so 24 minutes don't stand out, I mean, it does because it should still be quite faster on our own runner. To be honest we seem to recompile a lot of stuff on each run, and we've not really been able to entirely understand if Cachix is being hit every time, if crane doesn't really play well with something else, causing from scratch recompilation, but it does look like we're doing quite a lot of work on each new PR. My plan is to take a deeper look at that, but moving the runners to self hosted has been the first move.

As for macOS, I think we could bring this up to Cachix maintainers. Calling out to security and failing in rather simple hosted runner would probably not be something expected.

Fair. I can take care of looking for a similar issue, and if not, opening one.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Issue opened, and Domen suggest to bump Nixpkgs: cachix/cachix-action#190 (comment) (@YorikSar). I guess this means the nixpkgs used by the build machines, as cachix will use that one, not the input of the Nickel flake.

@YorikSar
Copy link
Member

@yannham it’s used from this flake’s inputs:

nix build --print-out-paths --inputs-from . nixpkgs#cachix.bin

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Oh, my bad.

@YorikSar
Copy link
Member

@yannham Looks like jobs are not getting triggered. Maybe because of the conflicts?

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Ok, Cachix seems to work fine now, but we get a meson error when trying to build nix from the latest unstable.

@YorikSar
Copy link
Member

From logs from macOS:

error: builder for '/nix/store/qfjwky2wmfjhnn15nk5lq28spgncdci0-nix-store-2.25.0pre.drv' failed with exit code 1;
       last 10 log lines:
       >                            ^^^^^^^^^^
       >   File "/nix/store/9pj4rzx5pbynkkxq1srzwjhywmcfxws3-python3-3.12.5/lib/python3.12/pathlib.py", line 875, in is_dir
       >     return S_ISDIR(self.stat().st_mode)
       >                    ^^^^^^^^^^^
       >   File "/nix/store/9pj4rzx5pbynkkxq1srzwjhywmcfxws3-python3-3.12.5/lib/python3.12/pathlib.py", line 840, in stat
       >     return os.stat(self, follow_symlinks=follow_symlinks)
       >            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       > PermissionError: [Errno 1] Operation not permitted: '/nix/var'
       >
       > ERROR: Unhandled python OSError. This is probably not a Meson bug, but an issue with your build environment.
       For full logs, run 'nix log /nix/store/qfjwky2wmfjhnn15nk5lq28spgncdci0-nix-store-2.25.0pre.drv'.
error: 1 dependencies of derivation '/nix/store/crpmjh1glrbknz2dlkd0p8m6a8c784iq-nickel-lang-clippy-1.8.0.drv' failed to build

It seems this derivation wants to access something outside its sandbox... Although there is also this line at the top that might be related:

warning: Ignoring setting 'impure-env' because experimental feature 'configurable-impure-env' is not enabled

What does this nix-store derivation do and where does it come from?..

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

There is an experimental feature to evaluate a Nix expression in Nickel. It's been written before the Nix C API, so it binds directly to the Nix C++ API. Before we just used the default package of the github:nixos/nix flake and everything worked well. I guess some refactoring has been done in between, because after updating the flake.lock Nickel (well pkg-config to be precise) couldn't find the libraries nix-expr, nix-store, etc. any more.

I saw that the nixos/nix flake is exporting additional outputs, that match precisely the lib that we are using, so I added those to the buildInputs of the corresponding derivation and it built fine on my system again.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Maybe one solution would be to try to migrate that to use the Nix C API (we only really need to evaluate expressions, which should be quite straightforward), but I initially didn't want to go this route for this PR that has nothing to do with that, so I tried to get it working with minimal tweaking first.

@YorikSar
Copy link
Member

I would assume something was merged related to using meson in nixos/nix repo. Our macOS runner has sandbox enabled in Nix which is disabled by default, which is probably why this issue didn't come up before on hosted runners.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Right, it seems at the beginning of this summer, they started to switch to meson. I guess we're up for an issue.

"owner": "nixos",
"repo": "nix",
"rev": "0ab9369572f64b1ab70a8db29f79ae730ff31ab6",
"rev": "2c42a9dbaa805f4f29561d9a1c10b41dfe98dcfa",
"type": "github"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yannham I see you've updated all inputs here. Maybe rollback nixos/nix input to the previous value to see if this helps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we can do that of course since Nix isn't taken from Nixpkgs. Good catch, I'll try that

@YorikSar
Copy link
Member

Another option would be to revert changes to flake.lock and just add a new input for nixpkgs from which we'll install cachix. We could also install it from cachix repo directly instead. I think it should exclude issues that have risen because of this update.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

On the other hand we haven't been updating the nixpkgs input for some time, and I think it's good that we do - although I can cherry-pick the changes into a separate PR. But the thing is that now the error seems related to the extra Nix input that should have been reverted to a previous version. And I can't get it to not do the checks for some reason.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Ah, but the inputs of nix-input follows nixpkgs, so that is still a difference with master. Apparently they fixed like a few hours ago the meson build on Darwin, so let me try to update even more nix, and if that doesn't work, maybe your approach is the most sensible one.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

John ask if it's possible that build02 identifies as something else than macos: NixOS/nix#2503 (comment). I have no idea.

@YorikSar
Copy link
Member

Even though Linux job failed to build Nix this time, it looks like macOS one did something for 6 minutes, so we can assume it didn't break on Nix build. I think we should make that weird step output logs always with something like trap. By the way, I see NixOS/nix#8949 mentioned there, but from the discussion there it seems like piping into cat should work just as well as into file. Maybe we should we do that instead? Another option would be to run tail -f logs & before starting the build.

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

It's really annoying - I'm basically trying many different tags but either Boehm or Nix checks fail locally (so not even on MacOS). It's surprising that something as central as Nix is not reproducible to build on a pretty standard config (Debian unstable with Nix as a package manager).

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

Oh, I think this is because we follow the nixpkgs inputs from the main flake. It seems the Boehmgc update 8.2.4 -> 8.2.6 made a Nix-specific patch obsolete. The version of Nix I'm trying to build are anterior to this patch removal but I'm using latest Nixpkgs, I guess.

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

The checks now work locally, but fails on the Linux builder for some reason:

nix>     error: path '/build/nix-test/nars/out/â' already exists
nix>     ++(nars.sh:97) onError
nix>     ++(/build/source/tests/functional/common/functions.sh:237) set +x
nix>     nars.sh: test failed at:
nix>       main in nars.sh:97

This is the line in question: https://github.com/NixOS/nix/blob/322d2c767f2a3f8ef2ac3d1ba46c19caf9a1ffce/tests/functional/nars.sh#L97

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

This is the complete test: https://github.com/NixOS/nix/blob/master/tests/functional/nars.sh#L91-L100

@YorikSar What's the precise system of build01?

@YorikSar
Copy link
Member

What's the precise system of build01?

Which system are you referring to? It's Zen 4 EPYC with x86_64 Linux (NixOS 24.05). Filesystem is ZFS.

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

Thanks, I meant precisely this to include in a potential issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants