Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/acme: nginx configuration test fails during nixos-rebuild switch when you have an existing virtualhost and add a new one #60180

Closed
arianvp opened this issue Apr 24, 2019 · 12 comments · Fixed by #60219

Comments

@arianvp
Copy link
Member

arianvp commented Apr 24, 2019

Issue description

nginx failed to restart after I added a second virtualHost to my nginx config.
running nixos-rebuild switch a second time fixed the issue. But my website went down due to a nixos-rebuild switch which seems like there is a serious bug in the enableACME logic in the nginx module

Friends of me also ran into this multiple times last week at their company

Steps to reproduce

  1. Have a virtualHost with enableACME = true
  2. Add an extra virtualHost with enableACME = true and nixos-rebuild switch
  3. The nginx prestart script (nginx -t) fails

This was the specific commit that failed to deploy: arianvp/nixos-stuff@4fe0a77

Logs:

Apr 24 14:27:56 arianvp.me systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Apr 24 14:27:56 arianvp.me systemd[1]: Stopping Nginx Web Server...
-- Subject: Unit nginx.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nginx.service has begun shutting down.
Apr 24 14:27:56 arianvp.me systemd[1]: Stopped Nginx Web Server.
-- Subject: Unit nginx.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nginx.service has finished shutting down.
Apr 24 14:27:56 arianvp.me systemd[1]: Starting Nginx Web Server...
-- Subject: Unit nginx.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nginx.service has begun starting up.
Apr 24 14:27:56 arianvp.me s0966gsjr6cilylnniqzh5l35z99jyng-unit-script-nginx-pre-start[18364]: nginx: [emerg] BIO_new_file("/var/lib/acme/techstock.photos/fullchain.pem") fai>
Apr 24 14:27:56 arianvp.me s0966gsjr6cilylnniqzh5l35z99jyng-unit-script-nginx-pre-start[18364]: nginx: configuration file /nix/store/5g561hwc1lw3bbc29qdr83fwl5iykjnv-nginx.con>
Apr 24 14:27:56 arianvp.me systemd[1]: nginx.service: Control process exited, code=exited status=1
Apr 24 14:27:56 arianvp.me systemd[1]: nginx.service: Failed with result 'exit-code'.
Apr 24 14:27:56 arianvp.me systemd[1]: Failed to start Nginx Web Server.
-- Subject: Unit nginx.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nginx.service has failed.
-- 
-- The result is RESULT.

Technical details

Nixos 19.03 but also happened on 18.03 to me in the past

@arianvp
Copy link
Member Author

arianvp commented Apr 24, 2019

I can reliably reproduce the bug now.

First deploy:

{ ... }:
{
 services.nginx.virtualHosts."a.example.com" = {
      forceSSL = true;
      enableACME = true;
};
}

and after that, deploy:

{ ... }:
{
 services.nginx.virtualHosts."a.example.com" = {
      forceSSL = true;
      enableACME = true;
};
 services.nginx.virtualHosts."b.example.com" = {
      forceSSL = true;
      enableACME = true;
};
}

Nginx should fail to start now.

If you manually sudo systemctl restart nginx a few seconds later, everything works. So
it seems like some kind of race coniditon

However, im not sure how to make a regression
test for bugs in switching configurations. #20886 seems relevant for that.

I noticed actually another bug here. if a preStart script failed for a systemd unit, nixos-rebuild switch
doesn't call systemctl status in the failed unit. So nginx.service status is not shown eventhough it failed, whilst the subsequent acme-* service that failed (because nginx is down) did show its status

[arian@t430s:~/Projects/nixos-stuff]$ ./deploy.sh arianvp.me root@arianvp.me switch
Building closure
Built /nix/store/qz9bbiz8yfqkka4jxx62f48pimzf3sln-nixos-system-unnamed-19.03pre-git
Copying closure
C_GetAttributeValue failed: 18
Enter PIN for 'YubiKey PIV': 
Setting profile
Switching to configuration
updating GRUB 2 menu...
activating the configuration...
setting up /etc...
reloading user units for root...
setting up tmpfiles
restarting the following units: nginx.service
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
the following new units were started: acme-techstock.photos.timer
warning: the following units failed: acme-techstock.photos.service

● acme-techstock.photos.service - Renew ACME Certificate for techstock.photos
   Loaded: loaded (/nix/store/v1gbcjamgdskip29anw979kd7xqbjjv1-unit-acme-techstock.photos.service/acme-techstock.photos.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2019-04-24 20:12:35 UTC; 1s ago
  Process: 20689 ExecStopPost=/nix/store/nqr5924pcspanpzfpy284c6wmkbvwf9c-unit-script-acme-techstock.photos-post-stop (code=exited, status=0/SUCCESS)
  Process: 20657 ExecStart=/nix/store/wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start (code=exited, status=2)
  Process: 20637 ExecStartPre=/nix/store/d2hww5ld6fsplrwzm618g6q3wpx212cn-unit-script-acme-techstock.photos-pre-start (code=exited, status=0/SUCCESS)
 Main PID: 20657 (code=exited, status=2)

Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]:   File "/nix/store/sdxdf6j87iwzlz4r5d5bphka0937wlxh-simp_le-client-0.9.0/lib/python3.7/site-packages/simp_le.py", line 563, in save
Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]:     pems = [self.dump_cert(data.cert)]
Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]:   File "/nix/store/sdxdf6j87iwzlz4r5d5bphka0937wlxh-simp_le-client-0.9.0/lib/python3.7/site-packages/simp_le.py", line 449, in dump_cert
Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]:     return OpenSSL.crypto.dump_certificate(self.typ, data.wrapped).strip()
Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]: AttributeError: 'NoneType' object has no attribute 'wrapped'
Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]: Unhandled error has happened, traceback is above
Apr 24 20:12:35 arianvp.me wms6xpqb88axk26y9l6q9jj0iqzkv5i8-unit-script-acme-techstock.photos-start[20657]: Debugging tips: -v improves output verbosity. Help is available under --help.
Apr 24 20:12:35 arianvp.me systemd[1]: acme-techstock.photos.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 24 20:12:35 arianvp.me systemd[1]: acme-techstock.photos.service: Failed with result 'exit-code'.
Apr 24 20:12:35 arianvp.me systemd[1]: Failed to start Renew ACME Certificate for techstock.photos.

@arianvp
Copy link
Member Author

arianvp commented Apr 24, 2019

cc @globin

@arianvp arianvp changed the title nginx configuration test fails during nixos-rebuild switch when you have an existing virtualhost and add a new one nixos/acme: nginx configuration test fails during nixos-rebuild switch when you have an existing virtualhost and add a new one Apr 24, 2019
@arianvp
Copy link
Member Author

arianvp commented Apr 24, 2019

I think the bug is as follows:

Nginx has a wants on acme-certificates.target and acme-selfsigned-certificates.target through:

          servicesAttr = listToAttrs services;
          injectServiceDep = {
            after = [ "acme-selfsigned-certificates.target" ];
            wants = [ "acme-selfsigned-certificates.target" "acme-certificates.target" ];
          };
        in
          servicesAttr //
          (if config.services.nginx.enable then { nginx = injectServiceDep; } else {}) //
          (if config.services.lighttpd.enable then { lighttpd = injectServiceDep; } else {});

Now, if you deploy a single virtualHost (and subsequently start nginx and before that acme-selfsigned-certificates.target) then everything goes well. fullchain.pem is generated before nginx is started after a restart.

However, once you deploy the second virtualHost in the next nixos-rebuild switch, selfsign-certificate.target is already in the active state. Thus, nginx's Wants directives are already met, and nginx restart immediatelly, before fullchain.pem is even generated.

Proposed fix

Using a target here is not good enough, as a .target is only activated once for the first virtualHost deploy and and remains in active state afterwards and then causes any subsequent deploys of other virtualHosts to fail.

Instead, we should create a .path unit for each certificate, and let nginx/httpd depend on the existence of each certificate individually https://www.freedesktop.org/software/systemd/man/systemd.path.html

So:

systemd.services.nginx.wants = [ "acme-a.example.com.path" "acme-b.example.com.path" ]

Where

systemd.paths."acme-a.example.com".pathConfig = {
  PathExists = "/var/lib/acme/a.example.com/fullchain.pem";
};
systemd.paths."acme-b.example.com".pathConfig = {
  PathExists = "/var/lib/acme/b.example.com/fullchain.pem";
};

This way, nginx will always wait for the corresponding cert to show up. not only on first deploy

@arianvp arianvp mentioned this issue Apr 25, 2019
10 tasks
@arianvp
Copy link
Member Author

arianvp commented Apr 25, 2019

I have created PR that demonstrates the bug using a regression test. But don't have an actual fix implemented so far #60219

@arianvp
Copy link
Member Author

arianvp commented Apr 26, 2019

I have misinterpreted what path units do. I don't think we can use them in this way

@arianvp
Copy link
Member Author

arianvp commented Apr 27, 2019

We can instead use a target unit per certificate. I'll WIP up a PR today

@wmertens
Copy link
Contributor

wmertens commented May 7, 2019

@arianvp Any luck?

@arianvp
Copy link
Member Author

arianvp commented May 11, 2019

@wmertens still working on this; progress can be seen on my draft PR. While i'm at it I'm also fixing #53852

@solson
Copy link
Member

solson commented Jul 1, 2019

I just ran into this. Same cause, same symptoms, same easy workaround of manually starting nginx (thankfully). This is the relevant part of the nginx logs, showing that it's starting before ACME gets a chance to create the files it needs:

Jul 01 08:34:09 shannon systemd[1]: Starting Nginx Web Server...
Jul 01 08:34:09 shannon c9k38wc85dgg55vmxaw5fcwjzpgmwjgq-unit-script-nginx-pre-start[6661]: nginx: [emerg] BIO_new_file("/var/lib/acme/dev.solson.me/fullchain.pem") failed (SSL: error:02001002:system library:fopen:No such file or directory:fopen('/var/lib/acme/dev.solson.me/fullchain.pem','r') error:2006D080:BIO routines:BIO_new_file:no such file)
Jul 01 08:34:09 shannon c9k38wc85dgg55vmxaw5fcwjzpgmwjgq-unit-script-nginx-pre-start[6661]: nginx: configuration file /nix/store/2xj6haxpdshy0pbwmk14c4ycx2nr08vm-nginx.conf test failed
Jul 01 08:34:09 shannon systemd[1]: nginx.service: Control process exited, code=exited status=1
Jul 01 08:34:09 shannon systemd[1]: nginx.service: Failed with result 'exit-code'.
Jul 01 08:34:09 shannon systemd[1]: Failed to start Nginx Web Server.

@arianvp
Copy link
Member Author

arianvp commented Jul 2, 2019

@solson ive made some more progress last weekend. Please have a look at the referenced PR if you have time. It could use a review + feedback

@nh2
Copy link
Contributor

nh2 commented Jul 14, 2019

Also happened to me just now on 1 of 3 nixops machines running 19.03:

machine2.................................> activation finished successfully
machine3.................................> activation finished successfully
machine1.................................> warning: the following units failed: acme-deadpager.com.service
machine1.................................> 
machine1.................................> ● acme-deadpager.com.service - Renew ACME Certificate for deadpager.com
machine1.................................>    Loaded: loaded (/nix/store/krldvvxfh4r659h7a48s699197mz2jiv-unit-acme-deadpager.com.service/acme-deadpager.com.service; enabled; vendor preset: enabled)
machine1.................................>    Active: failed (Result: exit-code) since Sun 2019-07-14 21:43:17 UTC; 30ms ago
machine1.................................>   Process: 5829 ExecStopPost=/nix/store/bs7aydb9ynh1zrcg1va4gkfqr6lj67yf-unit-script-acme-deadpager.com-post-stop (code=exited, status=0/SUCCESS)
machine1.................................>   Process: 5737 ExecStart=/nix/store/l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start (code=exited, status=2)
machine1.................................>   Process: 5729 ExecStartPre=/nix/store/4cj9ifn95j8c014rdh7q4gcwziv4yy7k-unit-script-acme-deadpager.com-pre-start (code=exited, status=0/SUCCESS)
machine1.................................>  Main PID: 5737 (code=exited, status=2)
machine1.................................> 
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]:   File "/nix/store/lgsqhw0cyfw6zy0bs38q7rpx30j9kqv8-simp_le-client-0.9.0/lib/python3.7/site-packages/simp_le.py", line 563, in save
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]:     pems = [self.dump_cert(data.cert)]
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]:   File "/nix/store/lgsqhw0cyfw6zy0bs38q7rpx30j9kqv8-simp_le-client-0.9.0/lib/python3.7/site-packages/simp_le.py", line 449, in dump_cert
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]:     return OpenSSL.crypto.dump_certificate(self.typ, data.wrapped).strip()
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]: AttributeError: 'NoneType' object has no attribute 'wrapped'
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]: Unhandled error has happened, traceback is above
machine1.................................> Jul 14 21:43:17 machine1 l95avd3g50xbgbbhck6vblavy0g2yxkc-unit-script-acme-deadpager.com-start[5737]: Debugging tips: -v improves output verbosity. Help is available under --help.
machine1.................................> Jul 14 21:43:17 machine1 systemd[1]: acme-deadpager.com.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
machine1.................................> Jul 14 21:43:17 machine1 systemd[1]: acme-deadpager.com.service: Failed with result 'exit-code'.
machine1.................................> Jul 14 21:43:17 machine1 systemd[1]: Failed to start Renew ACME Certificate for deadpager.com.

@arianvp
Copy link
Member Author

arianvp commented Jul 14, 2019

@nh2 there is a PR that fixes this that is waiting for review here: #60219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants