Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing nixos-rebuild switch #20886

Open
danbst opened this issue Dec 3, 2016 · 15 comments
Open

Testing nixos-rebuild switch #20886

danbst opened this issue Dec 3, 2016 · 15 comments
Labels
0.kind: enhancement Add something new 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@danbst
Copy link
Contributor

danbst commented Dec 3, 2016

There are many tests in nixos to test some functionality but only a few of them test activating or deactivating that functionality. Because of that, NixOS tends to be "don't forget to reboot machine after nixos-rebuild switch"

Specifically, avahi service works, but enabling it doesn't #19034 . Fix is proposed (#20871), but I think there should be an automated way to test enabling/disabling/upgrade of a service.

I tried some time ago to write such test (for #15815), but was hit with immutable /nix/store in test machine, so I have no idea how to do that properly.

cast @kampfschlaefer for suggestions

(of course, not everything can be enabled/disabled on switch. Perhaps this can be documented as test too)

@kampfschlaefer
Copy link
Contributor

Hey @danbst, what about https://nixos.org/nixos/manual/index.html#sec-writing-nixos-tests where it says:

virtualisation.writableStore
By default, the Nix store in the VM is not writable. If you enable this option, a writable union file system is mounted on top of the Nix store to make it appear writable. This is necessary for tests that run Nix operations that modify the store.

Isn't that what you are looking for in terms of a writable store?

Do you have your trials for this kind of test available somewhere?

@danbst
Copy link
Contributor Author

danbst commented Dec 4, 2016

Thanks for a pointer. Looks like I skipped large part of documentation. Also, looking at firewall test, I do understand now that my intent can be implemented. But first let's consider two cases:

  1. We are changing system configuration via /etc/nixos/configuration.nix and applying changes with $machine->succeed("NIXOS_CONFIG=/etc/nixos/configuration.nix nixos-rebuild switch"
    This is where virtualisation.writableStore should come handy.
  2. We are switching between prebuilt configurations, with switch-to-configuration instead of nixos-rebuild switch. This is very much the same, because we care mostly about activation step.

So, first case is closer to user experience, but hard to implement (how would you know what is configuration.nix content in virtual environment? and how would you express the change?)

As for the second case, it is doable. Here is an example that shows problem with networking.nat:

let execSwitchTo = cfg: "${cfg.config.system.build.toplevel}/bin/switch-to-configuration test";

in import <nixpkgs/nixos/tests/make-test.nix> rec {

  nodes.machine = {
    networking.firewall.enable = false;
  };
  nodes.enableNat = {
    networking.firewall.enable = false;    
    networking.nat.enable = true;
  };

  testScript = { nodes, ...}: ''
      $machine->start;
      $machine->waitForUnit("default.target");

      # This shows that NAT is started automatically after boot
      $enableNat->start;
      $enableNat->waitForUnit("default.target");
      $enableNat->succeed("systemctl status nat");

      # This shows that NAT isn't started after nixos-rebuild switch, it
      # should be started manually
      $machine->succeed("${execSwitchTo nodes.enableNat}");
      $machine->waitForUnit("default.target");
      $machine->fail("systemctl status nat");
      $machine->succeed("systemctl start nat");
      $machine->succeed("systemctl status nat");
    '';
}

But it looks hacky.

  • instead of expressing a change of a system, we do the switch to completely different system, with different IP
  • we don't express a change, we supply full configs every time (I hope there is a function, that do same merge operations as imports = [ ... ] do)

@kampfschlaefer
Copy link
Contributor

Regarding your second solution: I don't think its as hacky as you think. There is no difference between switching to a modified version of your current profile and switching to a completely different profile. Both are treated as a new profile and nix/nixos only cares about the changes as a convenience to not restart the whole system.

Of course your current implementation of 2) does the IP-address change, but that is because of the magic the test-system does with the defined nodes. If there is a way to define the same node with your changes maybe within the 'let' block without the IP-address change, then its perfectly valid. Maybe you can mimic the stuff the test system does to the nodes (basically its configuration for eth0). If we found a way to define several states for one node, we could even add something like machine->switchTo() to the test API to make this more official.

Maybe you are onto something here, maybe all tests should be of the kind:

  • start a simple default config
  • switch to the configuration-to-test
  • do the tests
  • maybe switch to a second config
  • do some more tests for reload and start/stop

@vcunat vcunat added 0.kind: enhancement Add something new 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS labels Dec 5, 2016
@layus
Copy link
Member

layus commented Dec 5, 2016

I really like the idea of always starting a minimal nixos iso, then switching to the intended config. This should really be baked into the testing procedure. This would be tremendous !

@cleverca22
Copy link
Contributor

related, when i was trying to fix xen, a nixos-rebuild switch would restart all xen related services, but messed something up

about 2 minutes after that, a watchdog within the hypervisor would trip, and the machine would hard-reset

it made it rather difficult to debug stuff until i had figured that out, and switched to nixos-rebuild boot && shutdown -r now

but automatically testing that with nixos's qemu framework would be even more difficult

@kampfschlaefer
Copy link
Contributor

Proposal:

  1. Add switch to test configurations to prevent it from setting IP addresses on network devices.
  2. Check switching of configurations at test runtime.
  3. Add "$machine.switchToConfiguration()" to perl test system to make it more convenient.
  4. Maybe change "$machine.start()" to start with minimal default configuration and then switch to intended configuration. This could break a lot of tests. But well, thats what it will show.

@danbst
Copy link
Contributor Author

danbst commented Feb 11, 2017

Some more insights on this issue.

  1. There exists option nesting.clone which actually solves the problem with IP and network devices. It just needs to be nicely integrated into test-runner API
    1a. Though I don't like that nesting.clone accepts list. attrsOf nixosSubmodule would be better, because you can specify name (which can be used as configuration name later). Do you agree, @nbp ?
  2. Another very-very useful test is regression testing switch from stable to unstable. For example, simulate situation when user changes a channel from 16.09 to 17.03. This feature strongly relates to reverted Add a way to pin a NixOS version within the module system. #11106, I think

@danbst
Copy link
Contributor Author

danbst commented May 30, 2017

One more datapoint for such tests - https://stackoverflow.com/questions/44220338/taskserver-fails-to-start-on-nixos

The test should:

  • set NixOS version previous before stable
  • start an instance of a service
  • rebuild NixOS against stable
  • check that service was restarted and works. If it wasn't restarted, reboot machine and check it works
  • rebuild NixOS against unstable/master
  • check that service was restarted and works. If it wasn't restarted, reboot machine and check it works

@danbst
Copy link
Contributor Author

danbst commented May 30, 2017

(cc @aszlig btw ---^)

@aszlig
Copy link
Member

aszlig commented May 30, 2017

Testing for switch to a new configuration is already done in the Taskserver test, but the stackoverflow issue could have been triggered more easily if we'd have a test from an older version to a newer one, not just stable/unstable.

I personally have a similar use-case for something like this in my deployments, where I test whether database migrations work successfully. Unfortunately, this needs a lot of imports from derivations and I usually have eval times of several minutes on these deployments :-/

So I guess testing between releases should probably end up in a dedicated jobset, which has two inputs like nixpkgs_old and nixpkgs_new, so we have a low eval time. Gong further, this could probably be automated as well, if we just save the disk state for each test in nixpkgs_old and run the same test with that disk state against nixpkgs_new.

Unfortunately, I think this won't work for more complicated tests, so I'd propose something like a snapshot command in the Perl test driver, which would save the current machine state (including a memory dump).

Having such a snapshot command would also help for speeding up boot times in tests, because all that's needed is to restore the state instead of going through the full boot. The idea of starting with a basic machine would also have that advantage.

Also, I've been thinking about making it easier to switch between system configurations, but didn't yet come up with a very good solution, but these are the ones I had in mind:

  • If a node isn't a function or and attribute set, but a list, use it as a series of generations and make it switchable using something like $machine->switch(10);.
    Cons:

    • Could end up in lots of duplicated configuration attributes.
    • Using integer indices would probably make tests very hard to read, imagine 20 generations, you want to switch to generation 13 and now you need to count through all this mess.
  • Use a separate attribute, which in turn contains an attribute set mapping from generation name to the corresponding configuration.
    Pros:

    • No counting of indices anymore.

    Cons:

    • Still with duplicated configuration values.
  • Same as both options before, but use some kind of a delta DSL for transforming the configuration.
    Cons:

    • Complicated, because people would need to learn the DSL.
    • Even more complicated, because it's not so sure which config you want to change.
  • (Mis-)use the module key attribute to define conditions for specific generations, for example:

    {
      machine = { lib, ... }: {
        imports = lib.singleton {
          key = "disable-foo";
          services.foo.enable = lib.mkForce false;
        };
        services.foo.enable = true;
      };
    }

    So the initial generation has service foo enabled, but if you do $machine->switch('disable-foo'); the foo service is disabled.

  • Ditch all of these and implement these generations as ordinary module options, so you have something like config.generations which reference the configuration values to merge.
    Pros:

    • Easy to understand if you know the module system already.
      Cons:
    • Very tricky to implement because we'd need to merge top-level module attributes, which could trigger an infinite recursion if we reference top-level config attributes.
    • Can be very ugly if lists are involved, because you can't easily patch out single items or filter them (might however be a good idea for the module system in general as well, something like lib.filterList <prio> <fun>) so you need to mkOverride whole lists.
  • Similar to the previous one but instead of merging configurations, expose these options as boolean values, so the main config would look like:

    {
      machine = { generation, lib, ... }: {
        generations = [ "enabled-foo" ];
        services.foo.enable = generation.enabled-foo;
      };
    }

Had some more ideas, but I'm currently writing-impaired (broken shoulder and typing with one hand) so I won't list all of them as I didn't come up with a good solution that's both simple and declarative yet.

@danbst
Copy link
Contributor Author

danbst commented Feb 21, 2019

Another case that should have been caught by rebuild-switch tests: #56134

@arianvp
Copy link
Member

arianvp commented May 25, 2019

I tried the suggested methods to write a test that tests switch-to-configuration to aid fixing #60180 but with no success.

The problem is that both webserver and webserver2 add a DNS entry for a.example.com so
when trying to request a letsencrypt cert, sometimes letsencrypt tries to contact webserver2 which
is only there for building up the switchToNewServer command, and isn't actually running,
in which the test fails.

I want a way to build a config which I can switch to, but not create a bogus node for that config, as that
is actually undesired behaviour in this case, as it screws with the tests

let
  commonConfig = ./common/letsencrypt/common.nix;
in import ./make-test.nix {
  name = "acme";

  nodes = rec {
    letsencrypt = ./common/letsencrypt;

    webserver = { config, pkgs, ... }: {
      imports = [ commonConfig ];
      networking.firewall.allowedTCPPorts = [ 80 443 ];

      networking.extraHosts = ''
        ${config.networking.primaryIPAddress} a.example.com
      '';

      services.nginx.enable = true;
      services.nginx.virtualHosts."a.example.com" = {
        enableACME = true;
        forceSSL = true;
        locations."/".root = pkgs.runCommand "docroot" {} ''
          mkdir -p "$out"
          echo hello world > "$out/index.html"
        '';
      };
    };

    webserver2 = { config, pkgs, ... }: {
      imports = [ webserver ];
      networking.extraHosts = ''
        ${config.networking.primaryIPAddress} b.example.com
      '';
      services.nginx.virtualHosts."b.example.com" = {
        enableACME = true;
        forceSSL = true;
        locations."/".root = pkgs.runCommand "docroot" {} ''
          mkdir -p "$out"
          echo hello world > "$out/index.html"
        '';
      };
    };

    client = commonConfig;
  };

  testScript = {nodes, ...}: 
    let
      newServerSystem = nodes.webserver2.config.system.build.toplevel;
      switchToNewServer = "${newServerSystem}/bin/switch-to-configuration test";
    in
    ''
      $client->waitForUnit("default.target");

      $letsencrypt->waitForUnit("default.target");
      $letsencrypt->waitForUnit("boulder.service");


      $webserver->waitForUnit("nginx.service");
      # This step fails already, as "a.example.com" points to two servers
      $webserver->waitForUnit("acme-a.example.com.service");
      $client->succeed('curl https://a.example.com/ | grep -qF "hello world"');

      $webserver->succeed("${switchToNewServer}");
      $webserver->waitForUnit("acme-b.example.com.service");
      $client->succeed('curl https://b.example.com/ | grep -qF "hello world"');

    '';
}

methods doesn't work in this particular case, because both nodes register their domain names and now I have two nodes with the same domain name running, which obviously breaks requesting certificates. So now the tests already fail before doing switch-to-configuration

Any way to have a "new config" and "old config" without creating two nodes?

@arianvp
Copy link
Member

arianvp commented May 25, 2019

@danbst thanks for pointing me to nesting.clone that very much seems to be what we're looking for here.

However, it doesn't seem to work in this case. build-vms.nix injects the evaluated nodes argument
into the nixos modules it is evaluating (https://github.com/NixOS/nixpkgs/blob/master/nixos/lib/build-vms.nix#L27)

whilst nesting.clone doesn't preserve this argument (it calls eval-config.nix manually, whilst build-vms-nix does too, but passes different arguments)
https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/activation/top-level.nix#L13-L23

So when you try to use it in a NixOS test, you'll get the following error:

while evaluating the module argument `nodes' in "/home/arian/Projects/nixpkgs/nixos/tests/common/letsencrypt/common.nix":
while evaluating the attribute '_module.args.nodes' at undefined position:
attribute 'nodes' missing, at /home/arian/Projects/nixpkgs/lib/modules.nix:163:28

Example test here:

let
  commonConfig = ./common/letsencrypt/common.nix;
in import ./make-test.nix {
  name = "acme";

  nodes = {
    letsencrypt = ./common/letsencrypt;

    webserver = { config, pkgs, ... }: {
      imports = [ commonConfig ];
      networking.firewall.allowedTCPPorts = [ 80 443 ];

      networking.extraHosts = ''
        ${config.networking.primaryIPAddress} a.example.com
      '';

      services.nginx.enable = true;
      services.nginx.virtualHosts."a.example.com" = {
        enableACME = true;
        forceSSL = true;
        locations."/".root = pkgs.runCommand "docroot" {} ''
          mkdir -p "$out"
          echo hello world > "$out/index.html"
        '';
      };

      nesting.clone = [ 
        ({config, pkgs, ... }: {
          networking.extraHosts = ''
            ${config.networking.primaryIPAddress} b.example.com
          '';
          services.nginx.virtualHosts."b.example.com" = {
            enableACME = true;
            forceSSL = true;
            locations."/".root = pkgs.runCommand "docroot" {} ''
              mkdir -p "$out"
              echo hello world > "$out/index.html"
            '';
          };
        })
      ];
    };

    client = commonConfig;
  };

  testScript =
    ''
      $client->waitForUnit("default.target");

      $letsencrypt->waitForUnit("default.target");
      $letsencrypt->waitForUnit("boulder.service");


      $webserver->waitForUnit("nginx.service");
      $webserver->waitForUnit("acme-a.example.com.service");
      $client->succeed('curl https://a.example.com/ | grep -qF "hello world"');


      # Incrementally enabling virtual hosts should just work
      $webserver->succeed("/run/current-system/fine-tune/child-1/bin/switch-to-configuration test");
      $webserver->waitForUnit("acme-b.example.com.service");
      $client->succeed('curl https://b.example.com/ | grep -qF "hello world"');

    ''; 
}

arianvp added a commit to arianvp/nixpkgs that referenced this issue May 25, 2019
Because nesting.clone calls 'eval-config.nix' manually,
without the 'extraArgs' argument that provides the 'nodes'
argument to nixos modules in nixos tests, evaluating
of 'nesting.clone' definitions would fail with the following error

while evaluating the module argument `nodes' in "<redacted>"
while evaluating the attribute '_module.args.nodes' at undefined position:
attribute 'nodes' missing, at <redacted./nixpkgs/lib/modules.nix:163:28

by not using 'extraArgs' but a nixos module instead, the nodes parameter
gets propagated to the 'eval-config.nix' call that  'nesting.clone'
makes too -  getting rid of the error.

See  https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/activation/top-level.nix#L13-L23
See  https://github.com/NixOS/nixpkgs/blob/master/nixos/lib/build-vms.nix#L27
See  NixOS#20886 (comment)
@arianvp
Copy link
Member

arianvp commented May 25, 2019

I just created a PR that fixes these issues.

@stale
Copy link

stale bot commented Jun 2, 2020

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: enhancement Add something new 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet
Development

No branches or pull requests

7 participants