-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add "live tests" #6427
add "live tests" #6427
Conversation
…r, more accurate tarball)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few questions
{ | ||
#input_func | ||
|
||
let ctx = crate::common::LiveTestContext::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming - the use of crate::
here means this macro is only useful within the live-tests
crate, right? (Or technically some other crate that defines this same module / type / method, but ignoring that pathology...) I'd maybe note that in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote this comment in c7cd0ed.
/// | ||
/// We use this instead of implementing Drop on LiveTestContext because we want | ||
/// the teardown to only happen when the test doesn't fail (which causes a panic | ||
/// and unwind). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about an API like
LiveTestContext::with_context("test_name", |lc: &mut LiveTestContext| {
// ... my test ...
})
which gives the same opportunity to clean up only on success without needing a proc macro? It's certainly a little uglier and causes a level of indentation. I'd live with that but maybe I dislike proc macros more than most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong feeling either way but I built this to parallel nexus_test
and I think there's some benefit to consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, that's fair; a big part of my comment was because I don't love nexus_test
because it's hard to find exactly what it's doing (IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went down this path in a follow-on branch:
https://github.com/oxidecomputer/omicron/compare/dap/experiments/live-tests-no-macro
A few notes:
- I wanted to avoid type parameters in
LiveTestContext::run_test
because I didn't want it monomorphized for every test that we add. So I hadrun_test
accept a&dyn Fn
instead ofF: Fn
and that closure returns aBoxFuture
. - rustfmt seems to have lost its mind on the code and it looks awful. I have spent no time trying to figure out why.
- It doesn't quite compile due to a lifetime issue. I gave up (not at all insurmountable, I was just spending more time on it than I had hoped to for just an experiment).
There are a couple of problems:
- the rustfmt problem, but I'll assume that is surmountable
- it's much easier to copy/paste the code and get the wrong
test_name
in the argument [tokio::test]
and so[live_test]
allow your test function to return nothing at all or a Result type. I had this version return aResult
. It would be tricky to support both here. (But most of our tests probably just return()
and we could probably just have it support that.)
In summary, I don't like it better but I'm still open to it if people really do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks! Looking at 3f43f78:
- The rustfmt issue is because of the additional nesting involved + our shorter-than-usual 80 char limit
- I think monomorphizing that is okay? The function looks pretty small to me, and the dependent functions aren't generic.
- Passing in borrowed data to closures is difficult. async closures solve this but aren't stable yet. I ran into this with the update engine, and that has to pass an owned context in for that reason.
opctx: &OpContext, | ||
datastore: &DataStore, | ||
) -> Result<(), anyhow::Error> { | ||
const ALLOWED_GIMLET_SERIALS: &[&str] = &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to take extra allowed serials as input at runtime, maybe via an env var? It would be annoying to get this all built and transferred over to madrid/london, then fail because one of the gimlets had been swapped out for a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could. For what it's worth, when I touch this file and rebuild, it takes 13.5s. You do have the extra copying step(s) that are also annoying.
// | ||
// - that after adding: | ||
// - the new Nexus appears in external DNS | ||
// - we can _use_ the new Nexus from the outside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we connect to an external IP from within the switch zone? Agreed on the other points (checking external DNS before and after).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly not today. I suspect figuring this out will be worth the investment. Maybe we can use the techport interface for the external API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I'm not sure it gains us much. If we're in the switch zone we can either connect to the TCP proxy wicketd runs or directly to the "Nexus external API exposed on the underlay" that wicketd forwards to. But in either case, it wouldn't really confirm external connectivity, because it's not actually using the external IP. It does confirm the new Nexus would be reachable via the techport, which is something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool, thanks for doing it!
# While most Omicron tests operate with their own simulated control plane, the | ||
# live-tests operate on a more realistic, shared control plane and test | ||
# behaviors that conflict with each other. They need to be run serially. | ||
live-tests = { max-threads = 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Makes sense!
- As mentioned in the DM, you can now filter out
omicron-live-tests
from the default set. (You'll want to bump the required and recommended versions in this file, as well as the version used in CI, to 0.9.76)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. I'd prefer we do that in a separate PR.
"WARNING: temporary directory appears to be under /tmp, \ | ||
which is generally tmpfs. Consider setting \ | ||
TMPDIR=/var/tmp to avoid runaway tests using too much\ | ||
memory and swap." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for a minimum amount of free space required for TMPDIR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Management of tmp space seems outside the scope of all this. I almost didn't bother with this warning at all except that the failure mode of running out of space in /tmp (i.e., swap) is usually pretty bad for the whole zone, and that could be pretty annoying to clean up in the switch zone. That said, I'm barely pro for even this warning -- I could be convinced to just remove this altogether.
Also: there are so many ways I've seen that kind of check be wrong, in part because it's really hard to know how much physical disk space something is going to need before it writes it. Example: false positive errors when using zfs with compression=on because a thing thinks it needs a GiB even though after compression it'll only use 150 MiB of space. Plus other things can use space in the meantime or space can free up in the meantime.
// We could also just go ahead and use /var/tmp, but it's not clear we can | ||
// reliably do that at this point (if Rust or other components have cached | ||
// TMPDIR) and it would be hard to override. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, you could use a setup script to set TMPDIR
outside of the test process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the test process inherit the environment of the setup script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you write it out to $NEXTEST_ENV
: https://nexte.st/docs/configuration/setup-scripts/#environment-variables
(This is how the crdb-seed setup script works)
I was hoping to land this at a point where the live tests built with this branch passed against a deployment built from the tip of this branch. But I can't keep up with "main" today -- it's moving faster than I can iterate. I have run the live tests built with ab5448b (current tip of this PR) against e85d012 from this branch, which is approximately main at 758818a. Although that's only about 6 hours old, it's 8 commits behind. |
Okay, from the tip of this branch, locally, I merged with main @ 3dcf1e766c4d44c56df5ce80f3180fbbea00de4a, deployed a system with that, and ran the live-tests with that and that worked, too. |
(some of this text is copied from the new README)
This PR adds a new Omicron package called
omicron-live-tests
to contain automated tests that operate in the context of an already-deployed "real" Oxide system (e.g.,a4x2
or ourlondon
ormadrid
test environments). The motivation is to provide a home for automated tests for all kinds of Reconfigurator behavior (e.g., add/expunge of all zones, add/expunge sled, upgrades, etc.), though it could be used for non-Reconfigurator behavior, too.What makes these tests different from the rest of the test suite is that they require connectivity to the underlay network of the deployed system and they make API calls to various components in that system and they assume that this will behave like a real production system. By contrast, the normal tests instead set up a bunch of components using simulated sled agents and localhost networking, which is great for starting from a predictable state and running tests in parallel, but the simulated sled agents and networking make it impossible to exercise quite a lot of Reconfigurator's functionality.
There are some safeguards so that these tests won't run on production systems: they refuse to run if they find any Oxide-hardware sleds in the system whose serial numbers don't correspond to known test environments.
This is similar to
end-to-end-tests
in a lot of ways but I didn't think it made sense to use the same package because the test environment is pretty different. The end-to-end-tests run on a Helios system on which we have deployed Sled Agent and a bunch of components in zones. But there aren't multiple sleds and there isn't a real underlay network. I don't think it's faithful enough to carry out a lot of the Reconfigurator tests that we'd like to do.Like the end-to-end-tests, this package is not built or tested by default because the tests generally can't work in a dev environment and there's no way to have
cargo
build and check them but not run the tests by default.Eventually I hope we can find a way to run these tests automatically in CI, but that's future work. For now, there are instructions for running these by hand on an Omicron system. Start by running
cargo xtask live-tests
to build an archive and then follow the instructions:Follow the instructions, run the tests, and you'll see the usual
nextest
-style output: