Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add "live tests" #6427

Merged
merged 28 commits into from
Aug 28, 2024
Merged

add "live tests" #6427

merged 28 commits into from
Aug 28, 2024

Conversation

davepacheco
Copy link
Collaborator

(some of this text is copied from the new README)

This PR adds a new Omicron package called omicron-live-tests to contain automated tests that operate in the context of an already-deployed "real" Oxide system (e.g., a4x2 or our london or madrid test environments). The motivation is to provide a home for automated tests for all kinds of Reconfigurator behavior (e.g., add/expunge of all zones, add/expunge sled, upgrades, etc.), though it could be used for non-Reconfigurator behavior, too.

What makes these tests different from the rest of the test suite is that they require connectivity to the underlay network of the deployed system and they make API calls to various components in that system and they assume that this will behave like a real production system. By contrast, the normal tests instead set up a bunch of components using simulated sled agents and localhost networking, which is great for starting from a predictable state and running tests in parallel, but the simulated sled agents and networking make it impossible to exercise quite a lot of Reconfigurator's functionality.

There are some safeguards so that these tests won't run on production systems: they refuse to run if they find any Oxide-hardware sleds in the system whose serial numbers don't correspond to known test environments.

This is similar to end-to-end-tests in a lot of ways but I didn't think it made sense to use the same package because the test environment is pretty different. The end-to-end-tests run on a Helios system on which we have deployed Sled Agent and a bunch of components in zones. But there aren't multiple sleds and there isn't a real underlay network. I don't think it's faithful enough to carry out a lot of the Reconfigurator tests that we'd like to do.

Like the end-to-end-tests, this package is not built or tested by default because the tests generally can't work in a dev environment and there's no way to have cargo build and check them but not run the tests by default.

Eventually I hope we can find a way to run these tests automatically in CI, but that's future work. For now, there are instructions for running these by hand on an Omicron system. Start by running cargo xtask live-tests to build an archive and then follow the instructions:

$ cargo xtask live-tests
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.96s
     Running `target/debug/xtask live-tests`
using temporary directory: /dangerzone/omicron_tmp/.tmp0ItZUD
will create archive file:  /dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst
output tarball:            /home/dap/omicron-work/target/live-tests-archive.tgz

running: /home/dap/.rustup/toolchains/1.80.1-x86_64-unknown-illumos/bin/cargo "nextest" "archive" "--package" "omicron-live-tests" "--archive-file" "/dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst"
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.89s
info: experimental features enabled: setup-scripts
   Archiving 1 binary, 1 build script output directory, and 1 linked path to /dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst
    Archived 35 files to /dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst in 0.31s
running: bash "-c" "tar cf - Cargo.toml .config/nextest.toml live-tests | tar xf - -C \"/dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive\""
running: tar "cf" "/home/dap/omicron-work/target/live-tests-archive.tgz" "-C" "/dangerzone/omicron_tmp/.tmp0ItZUD" "live-tests-archive"
created: /home/dap/omicron-work/target/live-tests-archive.tgz

To use this:

1. Copy the tarball to the switch zone in a deployed Omicron system.

     e.g., scp \
              /home/dap/omicron-work/target/live-tests-archive.tgz \
              root@YOUR_SCRIMLET_GZ_IP:/zone/oxz_switch/root/root

2. Copy the `cargo-nextest` binary to the same place.

     e.g., scp \
              $(which cargo-nextest) \
              root@YOUR_SCRIMLET_GZ_IP:/zone/oxz_switch/root/root

3. On that system, unpack the tarball with:

     tar xzf live-tests-archive.tgz

4. On that system, run tests with:

     TMPDIR=/var/tmp ./cargo-nextest nextest run \
         --archive-file live-tests-archive/omicron-live-tests.tar.zst \
         --workspace-remap live-tests-archive

Follow the instructions, run the tests, and you'll see the usual nextest-style output:

root@oxz_switch:~# TMPDIR=/var/tmp ./cargo-nextest nextest run          --archive-file live-tests-archive/omicron-live-tests.tar.zst          --workspace-remap live-tests-archive
  Extracting 1 binary, 1 build script output directory, and 1 linked path to /var/tmp/nextest-archive-Lqx9VZ
   Extracted 35 files to /var/tmp/nextest-archive-Lqx9VZ in 1.01s
info: experimental features enabled: setup-scripts
    Starting 1 test across 1 binary (run ID: a5fc9163-9dd5-4b23-b89f-55f8f39ebbbc, nextest profile: default)
        SLOW [> 60.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        PASS [  61.975s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
------------
     Summary [  61.983s] 1 test run: 1 passed (1 slow), 0 skipped
root@oxz_switch:~# 

@davepacheco davepacheco self-assigned this Aug 23, 2024
@davepacheco davepacheco marked this pull request as draft August 23, 2024 22:06
@davepacheco davepacheco marked this pull request as ready for review August 23, 2024 22:20
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few questions

{
#input_func

let ctx = crate::common::LiveTestContext::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming - the use of crate:: here means this macro is only useful within the live-tests crate, right? (Or technically some other crate that defines this same module / type / method, but ignoring that pathology...) I'd maybe note that in the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote this comment in c7cd0ed.

///
/// We use this instead of implementing Drop on LiveTestContext because we want
/// the teardown to only happen when the test doesn't fail (which causes a panic
/// and unwind).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about an API like

LiveTestContext::with_context("test_name", |lc: &mut LiveTestContext| {
    // ... my test ...
})

which gives the same opportunity to clean up only on success without needing a proc macro? It's certainly a little uglier and causes a level of indentation. I'd live with that but maybe I dislike proc macros more than most.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling either way but I built this to parallel nexus_test and I think there's some benefit to consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, that's fair; a big part of my comment was because I don't love nexus_test because it's hard to find exactly what it's doing (IMO).

Copy link
Collaborator Author

@davepacheco davepacheco Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down this path in a follow-on branch:
https://github.com/oxidecomputer/omicron/compare/dap/experiments/live-tests-no-macro

A few notes:

  • I wanted to avoid type parameters in LiveTestContext::run_test because I didn't want it monomorphized for every test that we add. So I had run_test accept a &dyn Fn instead of F: Fn and that closure returns a BoxFuture.
  • rustfmt seems to have lost its mind on the code and it looks awful. I have spent no time trying to figure out why.
  • It doesn't quite compile due to a lifetime issue. I gave up (not at all insurmountable, I was just spending more time on it than I had hoped to for just an experiment).

There are a couple of problems:

  • the rustfmt problem, but I'll assume that is surmountable
  • it's much easier to copy/paste the code and get the wrong test_name in the argument
  • [tokio::test] and so [live_test] allow your test function to return nothing at all or a Result type. I had this version return a Result. It would be tricky to support both here. (But most of our tests probably just return () and we could probably just have it support that.)

In summary, I don't like it better but I'm still open to it if people really do!

Copy link
Contributor

@sunshowers sunshowers Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks! Looking at 3f43f78:

  • The rustfmt issue is because of the additional nesting involved + our shorter-than-usual 80 char limit
  • I think monomorphizing that is okay? The function looks pretty small to me, and the dependent functions aren't generic.
  • Passing in borrowed data to closures is difficult. async closures solve this but aren't stable yet. I ran into this with the update engine, and that has to pass an owned context in for that reason.

opctx: &OpContext,
datastore: &DataStore,
) -> Result<(), anyhow::Error> {
const ALLOWED_GIMLET_SERIALS: &[&str] = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to take extra allowed serials as input at runtime, maybe via an env var? It would be annoying to get this all built and transferred over to madrid/london, then fail because one of the gimlets had been swapped out for a new one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could. For what it's worth, when I touch this file and rebuild, it takes 13.5s. You do have the extra copying step(s) that are also annoying.

//
// - that after adding:
// - the new Nexus appears in external DNS
// - we can _use_ the new Nexus from the outside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we connect to an external IP from within the switch zone? Agreed on the other points (checking external DNS before and after).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly not today. I suspect figuring this out will be worth the investment. Maybe we can use the techport interface for the external API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I'm not sure it gains us much. If we're in the switch zone we can either connect to the TCP proxy wicketd runs or directly to the "Nexus external API exposed on the underlay" that wicketd forwards to. But in either case, it wouldn't really confirm external connectivity, because it's not actually using the external IP. It does confirm the new Nexus would be reachable via the techport, which is something though.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool, thanks for doing it!

.config/nextest.toml Show resolved Hide resolved
# While most Omicron tests operate with their own simulated control plane, the
# live-tests operate on a more realistic, shared control plane and test
# behaviors that conflict with each other. They need to be run serially.
live-tests = { max-threads = 1 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Makes sense!
  2. As mentioned in the DM, you can now filter out omicron-live-tests from the default set. (You'll want to bump the required and recommended versions in this file, as well as the version used in CI, to 0.9.76)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool. I'd prefer we do that in a separate PR.

dev-tools/xtask/src/live_tests.rs Show resolved Hide resolved
dev-tools/xtask/src/live_tests.rs Show resolved Hide resolved
Comment on lines +187 to +190
"WARNING: temporary directory appears to be under /tmp, \
which is generally tmpfs. Consider setting \
TMPDIR=/var/tmp to avoid runaway tests using too much\
memory and swap."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for a minimum amount of free space required for TMPDIR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Management of tmp space seems outside the scope of all this. I almost didn't bother with this warning at all except that the failure mode of running out of space in /tmp (i.e., swap) is usually pretty bad for the whole zone, and that could be pretty annoying to clean up in the switch zone. That said, I'm barely pro for even this warning -- I could be convinced to just remove this altogether.

Also: there are so many ways I've seen that kind of check be wrong, in part because it's really hard to know how much physical disk space something is going to need before it writes it. Example: false positive errors when using zfs with compression=on because a thing thinks it needs a GiB even though after compression it'll only use 150 MiB of space. Plus other things can use space in the meantime or space can free up in the meantime.

Comment on lines +176 to +178
// We could also just go ahead and use /var/tmp, but it's not clear we can
// reliably do that at this point (if Rust or other components have cached
// TMPDIR) and it would be hard to override.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, you could use a setup script to set TMPDIR outside of the test process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test process inherit the environment of the setup script?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you write it out to $NEXTEST_ENV: https://nexte.st/docs/configuration/setup-scripts/#environment-variables

(This is how the crdb-seed setup script works)

@davepacheco
Copy link
Collaborator Author

I was hoping to land this at a point where the live tests built with this branch passed against a deployment built from the tip of this branch. But I can't keep up with "main" today -- it's moving faster than I can iterate. I have run the live tests built with ab5448b (current tip of this PR) against e85d012 from this branch, which is approximately main at 758818a. Although that's only about 6 hours old, it's 8 commits behind.

@davepacheco
Copy link
Collaborator Author

Okay, from the tip of this branch, locally, I merged with main @ 3dcf1e766c4d44c56df5ce80f3180fbbea00de4a, deployed a system with that, and ran the live-tests with that and that worked, too.

@davepacheco davepacheco enabled auto-merge (squash) August 27, 2024 23:32
@davepacheco davepacheco merged commit 9f4ba06 into main Aug 28, 2024
24 checks passed
@davepacheco davepacheco deleted the dap/drafts/reconfigurator-tests branch August 28, 2024 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants