Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkManager's nm-online kills nixos-rebuild #180175

Open
tbidne opened this issue Jul 4, 2022 · 81 comments · Fixed by #344678
Open

NetworkManager's nm-online kills nixos-rebuild #180175

tbidne opened this issue Jul 4, 2022 · 81 comments · Fixed by #344678
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@tbidne
Copy link
Contributor

tbidne commented Jul 4, 2022

Describe the bug

The systemd service NetworkManager-wait-online.service can prevent nixos-rebuild from succeeding:

warning: the following units failed: NetworkManager-wait-online.service

× NetworkManager-wait-online.service - Network Manager Wait Online
     Loaded: loaded (/etc/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: enabled)
    Drop-In: /nix/store/k5yq51spcggip2h6aq1y0bydkpr4zahc-system-units/NetworkManager-wait-online.service.d
             └─overrides.conf
     Active: failed (Result: exit-code) since Tue 2022-07-05 10:18:52 NZST; 36ms ago
       Docs: man:nm-online(1)
    Process: 1258376 ExecStart=/nix/store/b4yhg54s70i0v0k1qnnv8vnja6018yrh-networkmanager-1.38.2/bin/nm-online -s -q (code=exited, status=1/FAILURE)
   Main PID: 1258376 (code=exited, status=1/FAILURE)
         IP: 0B in, 0B out
        CPU: 22ms

Jul 05 10:17:52 nixos systemd[1]: Starting Network Manager Wait Online...
Jul 05 10:18:52 nixos systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jul 05 10:18:52 nixos systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.

This service runs nm-online -s -q, and the nm-online man page says:

-s | --wait-for-startup
           Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is
           considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection
           which is available given the current network state. This corresponds to the moment when NetworkManager logs "startup
           complete". This mode is generally only useful at boot time. After startup has completed, nm-online -s will just
           return immediately, regardless of the current network state.

           There are various ways to affect when startup complete is reached. For details see NetworkManager-wait-
           online.service(8).

This corresponds to the moment when NetworkManager logs "startup complete". This mode is generally only useful at boot time.

I am not familiar with this tool, but my experience is that after my laptop has been up for some time (e.g. days), nm-online will often return an error code rather than correctly determine the network is up, thus killing any future nixos-rebuild commands.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Get your machine in a state where nm-online -s -q does not return success (not sure how to do this on demand).
  2. Witness nixos-rebuild failure.

Expected behavior

nixos-rebuild should not fail due to an erroneous network check.

Additional context

This is tricky as it is not a nix issue per se but rather an issue with a presumably flaky systemd service. It is easy enough to disable this service manually:

systemd.services.NetworkManager-wait-online.enable = false;

And perhaps this is the best solution. But a number of my coworkers all ran into this issue independently, so I thought it merited an issue for discoverability, if nothing else. My gut reaction is that a flaky check should probably not be required by default, but I don't know enough about this service's importance/fragility to say.

This issue was noticed only recently, both on nixos-unstable and nixos-22.05.

Notify maintainers

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.47, NixOS, 22.05 (Quokka)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.8.1`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Thanks!

@tbidne tbidne added the 0.kind: bug Something is broken label Jul 4, 2022
@tbidne
Copy link
Contributor Author

tbidne commented Jul 5, 2022

Possibly it relates to this? #178046

Edit: Nevermind, that commit is not in nixos-22.05.

@veprbl veprbl added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Jul 5, 2022
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-to-disable-networkmanager-wait-online-service-in-the-configuration-file/19963/4

@lorenz
Copy link
Contributor

lorenz commented Jul 16, 2022

This is related to udev not initializing devices. NetworkManager never completes startup because a WireGuard interface is never initialized by udev. A workaround is just putting the affected device into networking.networkmanager.unmanaged.

@DeskworkTrickster
Copy link

DeskworkTrickster commented Aug 9, 2022

I'm wondering if there is any solution to this. Since this triggers for me always. So, I'm not even sure if my nixos-rebuild switch does actually complete.

As mentioned in the initial description of the issue:

This corresponds to the moment when NetworkManager logs "startup complete". This mode is generally only useful at boot time.

This never seems to be the case, since the actual message is "Started Network Manager.".
Apart from that I'm wondering why it relies on a log message string, when the status of NetworkManager.service would be way less prone to errors.

EDIT: no idea why, but it went away ...
Everything works as expected again.

@oati
Copy link
Contributor

oati commented Aug 22, 2022

I think I'm running into this.

systemd-networkd-wait-online.service

Aug 22 08:38:49 erin-laptop systemd[1]: Starting Wait for Network to be Configured...
Aug 22 08:40:49 erin-laptop systemd-networkd-wait-online[26249]: Timeout occurred while waiting for network connectivity.
Aug 22 08:40:49 erin-laptop systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 08:40:49 erin-laptop systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Aug 22 08:40:49 erin-laptop systemd[1]: Failed to start Wait for Network to be Configured.

It only happens after I've connected my USB-C dock with an ethernet connection at least once after boot. (Note: I'm running tmpfs on root, so my system should "forget" everything about my dock on reboot.)

I'm also running systemd-networkd, not NetworkManager.

jammus added a commit to jammus/dotfiles that referenced this issue Aug 23, 2022
@NorfairKing
Copy link
Contributor

@Stale don't you dare!

svrana added a commit to svrana/nix-home that referenced this issue Oct 5, 2022
Attempt to fix nm-online-service from stalling on tailscale
interface. See NixOS/nixpkgs#180175
svrana added a commit to svrana/nix-home that referenced this issue Oct 5, 2022
@pjones
Copy link
Contributor

pjones commented Oct 10, 2022

@ncfavier Any chance you can help with this?

@ncfavier
Copy link
Member

I don't use NetworkManager so I wouldn't know, but in the case of systemd-networkd there are relevant options under systemd.network.wait-online: anyInterface and ignoredInterfaces. I recommend at least setting the former to true on laptops.

So, I'm not even sure if my nixos-rebuild switch does actually complete.

Warning about failed units is pretty much the last thing that the activation script does, so it's probably fine (but the failure should be fixed, of course).

@lorenz
Copy link
Contributor

lorenz commented Oct 10, 2022

BTW I've "fixed" this by setting

# udev 250 doesn't reliably reinitialize devices after restart
systemd.services.systemd-udevd.restartIfChanged = false;

But this is really an upstream systemd bug.

jack-michaud added a commit to jack-michaud/nix that referenced this issue Oct 15, 2022
Temporarily fixed by disabling nm-wait-online
NixOS/nixpkgs#180175
@blaggacao
Copy link
Contributor

blaggacao commented Nov 7, 2022

My Ubuntu has:

❯ systemctl cat NetworkManager-wait-online.service
# /lib/systemd/system/NetworkManager-wait-online.service
[Unit]
Description=Network Manager Wait Online
Documentation=man:nm-online(1)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target

[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

Type=oneshot
ExecStart=/usr/bin/nm-online -s -q
RemainAfterExit=yes

# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60

[Install]

My latest NixOS (22.05) config has:

> nix-repl> c.config.systemd.services.NetworkManager-wait-online
{ after = [ ... ]; aliases = [ ... ]; before = [ ... ]; bindsTo = [ ... ]; confinement = { ... }; conflicts = [ ... ]; description = ""; documentation = [ ... ]; enable = false; environment = { ... }; jobScripts = [ ... ]; onFailure = [ ... ]; partOf = [ ... ]; path = [ ... ]; postStart = ""; postStop = ""; preStart = ""; preStop = ""; reload = ""; reloadIfChanged = false; reloadTriggers = [ ... ]; requiredBy = [ ... ]; requires = [ ... ]; requisite = [ ... ]; restartIfChanged = true; restartTriggers = [ ... ]; runner = error: attribute 'ExecStart' missing

       at /nix/store/6dgpkrc0gxlndr4j2524ihlsr8209ph7-source/nixos/modules/testing/service-runner.nix:65:9:

           64|       my $cmd = <<END_CMD;
           65|       ${service.serviceConfig.ExecStart}
             |         ^
           66|       END_CMD
«derivation

based on this definition:

    systemd.services.NetworkManager-wait-online = {
      wantedBy = [ "network-online.target" ];
    };

Where is this ExecStart coming from in your configurations?

Also:

nixpkgs on  fix/teamviewer-service-deps [$]rg 'nm-online'
[ nothing ]

@bjornfor
Copy link
Contributor

bjornfor commented Nov 7, 2022

@blaggacao: The reference to nm-online comes from upstream service unit NetworkManager-wait-online.service, not from nixpkgs itself.

@domenkozar
Copy link
Member

I'd vote for disabling this service until we can make it reliable. It's doing no good currently.

@pinpox
Copy link
Member

pinpox commented Mar 17, 2023

I've been tripping over this bug for quite some time now and it is annoying for users. As mentioned above, the error can be worked around with:

systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
systemd.services.systemd-networkd-wait-online.enable = lib.mkForce false;

I was concernd if there might be other dependencies or services that require this to be enabled, so I grepped through nixpkgs for both. These are the mentions:

NetworkManager-wait-online.service

systemd-networkd-wait-online.service

  • modules/system/boot/networkd Service devinition
  • Used in various nixos tests, which should not be relevant to normal operation of the system (?)
    • nixos/tests/systemd-networkd-dhcpserver-static-leases.nix
    • nixos/tests/kea.nix
    • nixos/tests/systemd-networkd.nix
    • nixos/tests/systemd-bpf.nix
    • nixos/tests/systemd-networkd-dhcpserver.nix

TL;DR

Upon first look the usage of these two services seem minimal to me and they are causing more problems that doing good. Agreeing with @domenkozar's proposal, I'd vote to disable them per default. If this is agreed upon, I can submit a PR

@matthiasbeyer
Copy link
Contributor

I've been running with that service disabled for 6 months now and have not experienced a single issue. Don't count my voice too heavily, though 😉 ! 👍

@lorenz
Copy link
Contributor

lorenz commented Mar 17, 2023

If we're going to work around this I'd still prefer systemd.services.systemd-udevd.restartIfChanged = false; as the other workaround just masks the issue while udev's still half-broken.

gorschu added a commit to gorschu/nix-config that referenced this issue Mar 18, 2023
archer-65 added a commit to archer-65/nix-dotfiles that referenced this issue Mar 21, 2023
@pinpox
Copy link
Member

pinpox commented Mar 22, 2023

If we're going to work around this I'd still prefer systemd.services.systemd-udevd.restartIfChanged = false; as the other workaround just masks the issue while udev's still half-broken.

For some reason, that didn't work for me. On rebuild it said "not restarting service"

kalbasit added a commit to kalbasit/soxincfg that referenced this issue Apr 3, 2023
@supermarin
Copy link
Contributor

Yep confirming what @pinpox said:

updating GRUB 2 menu...
NOT restarting the following changed units: systemd-udevd.service

Still seeing the issue on HEAD. Disabling wait-online like mentioned previously fixes nixos-rebuild.

systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
systemd.services.systemd-networkd-wait-online.enable = lib.mkForce false;

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Sep 28, 2024

Is this only fixed when using tailscaled, for users not using tailscaled, they should apply which of the dozens workaround shared here ? Is this really a good solution for new users? Shouldn't something be also added in NetworkManager's module ?

Quoting previous comments :
#180175 (comment)

@supermarin
Copy link
Contributor

@AkechiShiro have you observed this without using Tailscale?

As far as I understand the issue is tied down to the combination of Tailscale firing up before NetworkManager and depending on it being online, ending up in a deadlock essentially

@pinpox
Copy link
Member

pinpox commented Sep 28, 2024

The same issue happened to me with wireguard in the past. I'll try the suggested fix

@AkechiShiro
Copy link
Contributor

@supermarin I have yes and I believe I'm not the only one

@supermarin
Copy link
Contributor

supermarin commented Sep 28, 2024

Mind posting your configuration.nix in a gist?
We should be able to create a minimal flake and reproduce this with build-vm. Maybe cut out the VM's network access on purpose.

We should reopen this issue then if this happens outside of just Tailscale & NM.

EDIT: do you use wireguard? Tailscale uses wireguard so that could be the lower common denominator

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Sep 28, 2024

I do use Wireguard yes, so it could be I have the same issue as @pinpox

Regarding my configuration.nix, it is all over the place currently, a minimal flake would be great but I can't provide my whole config at the moment, if I have time I'll try in a VM with a single Wireguard interface setup.

@Atemu
Copy link
Member

Atemu commented Sep 28, 2024

Can anyone confirm whether they still experience this issue without using tailscaled?

@sgraf812
Copy link
Contributor

sgraf812 commented Sep 30, 2024

As recently as two years ago, I had this issue without knowingly using tailscaled. (Posted to #59603.)
Since then, I'm using the workaround from the OP and have lived happily ever since.

The problem is perhaps that tailscaled is not the only service that activates NetworkManager-wait-online.service, so it doesn't make sense to fix this issue in tailscaled.

Perhaps this issue has since been fixed by an unrelated patch upstream (however that would happen), but I'm not optimistic. Multiple people (for example #180175 (comment), linking to #182449) have pointed that this is probably an upstream bug in systemd/udev. Let me try to get rid of the workaround and see if it is still an issue.

@AkechiShiro
Copy link
Contributor

Thanks @sgraf812 I may try to reach out to systemd/udev maintainers about issue #182449 hopefully we can nail this issue down once and for all

@AkechiShiro
Copy link
Contributor

We did receive an answer from Poettering but I don't know how to properly answer back, if anyone more experienced could pitch in : systemd/systemd#34585

github-actions bot pushed a commit that referenced this issue Oct 5, 2024
The wait will only be enabled on machines with NetworkManager enabled.

Closes #180175

(cherry picked from commit 0d822cc)
presto8 pushed a commit to presto8/nixpkgs that referenced this issue Oct 9, 2024
The wait will only be enabled on machines with NetworkManager enabled.

Closes NixOS#180175

(cherry picked from commit 0d822cc)
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/rebuild-error-failed-to-start-network-manager-wait-online/41977/5

@Atemu
Copy link
Member

Atemu commented Oct 16, 2024

Just ran into this again with tailscaled.

@Atemu Atemu reopened this Oct 16, 2024
@supermarin
Copy link
Contributor

@Atemu ran into it as well, but already had a running system with tailscale up. Manually shut down tailscale, and rebuild succeeded (and tailscale was back up). Tested with several rebuilds and restarts, works ok so far

@Atemu
Copy link
Member

Atemu commented Oct 16, 2024

An idea I had on this is that we could perhaps hack around this using a systemd unit that Conflicts the units which appear to block nm-online.target such as tailscaled. It'd be Required by the nm-online target and we'd then have another target which wants all those units and is wanted by the network-online target but After the nm-online target.

srid added a commit to srid/nixos-config that referenced this issue Oct 22, 2024
@inmaldrerah
Copy link
Contributor

Maybe this would help? (I closed because earlier I thought it was obsolete)

And I've made a small flake version for testing purpose: https://github.com/inmaldrerah/nixos-extensions

At least last time I used this, I didn't have to systemd stop tailscaled beforehand when tailscale was updated.

@supermarin
Copy link
Contributor

I can try testing with it.
Think the repro steps are:

  1. Have a system with Tailscale and NM-wait-online
  2. Disable nm-wait-online with the hack above
  3. Put nm-wait-online back in

On the next rebuild it should get stuck. I'll try this later today and report back if it reproduces consistently. Note: the Tailscale fix was backported to 24.05 so need to pin nixpkgs to a commit prior to that

@Atemu
Copy link
Member

Atemu commented Oct 25, 2024

Note that this issue occurs even with the supposed fix.

aftix added a commit to aftix/cfg that referenced this issue Nov 7, 2024
….service timing out on generation activation

This is caused by nmcli waiting on the mullvad wiregaurd network device to be up,
which never happens. The workaround is adding the device, wg0-mullvad, to the unmanaged devices list.
Upstream issue is NixOS/nixpkgs#180175 .
@JonnieCache
Copy link

Also still experiencing this with tailscaled on nixpkgs unstable bc947f541ae55e999ffdb4013441347d83b00feb (I think that's the right sha.)

If I stop the tailscaled unit I can then rebuild OK, and the service comes back up again.

@supermarin
Copy link
Contributor

Yeah just ran into it as well :(
@inmaldrerah mind posting what's the best way to test your flake?
I looked into it quickly but not 100% clear on how to plug in switch-to-configuration-ng.

huang12zheng pushed a commit to huang12zheng/nixos-config that referenced this issue Dec 5, 2024
@Munksgaard
Copy link
Contributor

I'm still running into this problem. However, I'm not using NetworkManager, but systemd-networkd. Adding the following line to my config seemed to fix things:

systemd.services.tailscaled.after = ["systemd-networkd-wait-online.service"];

Perhaps we should also add that line to the tailscale service definition?

@inmaldrerah
Copy link
Contributor

inmaldrerah commented Dec 18, 2024

Yeah just ran into it as well :( @inmaldrerah mind posting what's the best way to test your flake? I looked into it quickly but not 100% clear on how to plug in switch-to-configuration-ng.

That flake can be used by making an overlay over nixpkgs, disable system.switch.enable, and enable system.switch.enableNg in the configuration.nix.

Now I haven't been using my flake for a while, but still keep system.switch.enable = false and system.switch.enableNg = true, and I am no longer experiencing this problem on the unstable branch (until a73246e2eef4c6ed172979932bc80e1404ba2d56), for some reason.

Edit: I checked the code of nixos/modules/system/activation/switchable-system.nix and found that I don't have to disable system.switch.enable now, and system.switch.enableNg = system.switch.enable (= true) is now default.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/error-network-wait-online-service/57902/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet
Development

Successfully merging a pull request may close this issue.