Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: skip tests that fail during "setup" #33377

Open
petermattis opened this issue Dec 27, 2018 · 13 comments · May be fixed by #101814
Open

roachtest: skip tests that fail during "setup" #33377

petermattis opened this issue Dec 27, 2018 · 13 comments · May be fixed by #101814
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-testeng TestEng Team

Comments

@petermattis
Copy link
Collaborator

petermattis commented Dec 27, 2018

We periodically see nightly roachtest failures which are due to a failure during setup and not a failure of the functionality being tested. For example, if the nightly roachtests run too long, then tests which start after the timeout will fail during cluster setup. We've seen backup2TB fail because the store dumps cannot be downloaded. We've seen cdc tests fail because of docker flakiness and we've seen other tests fail because of apt-get flakiness.

In addition to trying to improve the reliability of the setup steps, we should avoid considering failure during setup as test failure. I think this can be achieved by marking tests as "skipped" if they fail during setup. Perhaps "skipped" is the wrong term and tests should be marked as "setup-failed". One possibility for achieving this is to mark a test as skipped if failure occurs before cluster.Start is invoked. That could be a bit too subtle. Another possibility is to add a cluster.Setup() function that holds all of the steps performed during setup. If a test doesn't use cluster.Setup then any failure would be considered a real test failure. I think that default is fine.

@danhhz do you have any thoughts about this?

Epic CRDB-10428

Jira issue: CRDB-4691

@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure labels Dec 27, 2018
@andreimatei
Copy link
Contributor

Out of all the badness you've enumerated, I'm only impressed by "tests which start after the timeout will fail during cluster setup". How exactly to the future tests fail? I'll make sure to have a better story here.

Other than this, I don't quite see a need for any common "setup" infrastructure. Individual tests can, of course, deal with their individual hardships and retry / mark themselves skipped / do whatever they seem fit for their particular situation.

@danhhz
Copy link
Contributor

danhhz commented Dec 27, 2018

Yeah, there's definitely a real problem here. In an ideal world, we'd add retries/backoff/rescheduling as necessary to get everything to consistently pass, but I agree that it's not worth the time it would take to root out every single one of these.

If we had something like you propose, the first thing I'd do is go mark everything up to and including the fixture restore in all the cdc tests as setup. If I could, I'd also mark any errors coming back from workload run as a skip. (These happen all the time and are purely a burden from a "testing cdc" perspective. See #33317, #32813 (comment), and https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/cdc.go#L101-L102)

The downside to this is that these things I'd skip have historically be under tested. For example, for quite a while, cdc was the only thing regularly testing a nightly tpcc-1000. This has gotten better recently, but it's still something we should be careful of.

As for a concrete proposal, I'd suggest something parallel to StartTimer and StopTimer on (*testing.B). So a StartSetup/StopSetup or the inverse semantics StartInterestingTest/StopInterestingTest that turns fatals into skips. I don't have a strong opinion on "skipped" vs "setup-failed".

@andreimatei
Copy link
Contributor

Can't the CDC tests do whatever they want themselves by using their own helpers or t wrapper? If possible, let's pilot things there for a while. Please don't add things lightly to the interface between the tests and the test runner.

I don't have a strong opinion on "skipped" vs "setup-failed".

You seem to have an opinion. Or, rather, what's the difference y'all are seeking to produce between "skipped" and "failed"? Or, conversely, between "setup-failed" and "failed"?

As for a concrete proposal, I'd suggest something parallel to StartTimer and StopTimer on (*testing.B).

How would having tests use that be less burdensome than having them... not call t.Fatal() inside the respective block?

@petermattis
Copy link
Collaborator Author

The purpose of some sort of cluster.Setup or test.Setup infrastructure is to make the pattern for specifying some steps as part of setup obvious. There is currently only a single example of a test "skipping" itself (disk-full). I would hazard a guess that most folks don't know that is possible. Why do I want to distinguish failure from setup-failure? Because the latter should not cause an issue to be filed and should look different in the summary slack message.

@danhhz StartSetup / StopSetup would also work. Not sure if that has advantages over test.Setup(fn func()).

How would having tests use that be less burdensome than having them... not call t.Fatal() inside the respective block?

The t.Fatal() call might be hidden from the test's control.

@danhhz
Copy link
Contributor

danhhz commented Dec 27, 2018

@danhhz StartSetup / StopSetup would also work. Not sure if that has advantages over test.Setup(fn func()).

For one, we'd have to change the name but it could also be used for teardown. Also, dunno that I have any examples for setup in current roachtests, but I've found the StartTimer/StopTimer/b.N pattern much more flexible for benchmarks than, for example, what rust does with just a callback function. However, I think either is workable.

@andreimatei
Copy link
Contributor

The t.Fatal() call might be hidden from the test's control.

That's a harmful anti-pattern! t.Fatal() should always be under the test's control!

Why do I want to distinguish failure from setup-failure? Because the latter should not cause an issue to be filed and should look different in the summary slack message.

Right. But it seems to me that exactly what behavior we want is still nebulous. We do want some issue to be filed for "setup failure", right? But perhaps not one per test, but one per failure type. So perhaps what's needed is more control over issue creation, not a blunt setup/non-setup false dichotomy.

@petermattis
Copy link
Collaborator Author

That's a harmful anti-pattern! t.Fatal() should always be under the test's control!

Except for cluster creation, t.Fatal is under the test's control, though sometimes it is hidden within various cluster convenience methods. That seems fine to me. And it seems fine to add more convenience that those fatals are not actually fatal if done during a "setup" phase.

Right. But it seems to me that exactly what behavior we want is still nebulous. We do want some issue to be filed for "setup failure", right? But perhaps not one per test, but one per failure type. So perhaps what's needed is more control over issue creation, not a blunt setup/non-setup false dichotomy.

Well, no. I don't want an issue filed for "setup failure". I do want notification via slack. I don't think an issue is the right medium for this "expected" failure.

@tbg
Copy link
Member

tbg commented Jan 2, 2019

I think we should let these tests fail nominally, though we may want to hold off on posting an issue. I'm open to introducing a separate error state for this (setup-failed) but I would rather not just mark them as skipped. Part of getting reliable tests is to invest in a clean setup. If we're finding that tests fail regularly during setup, we need to invest in improving that.

For example, data imports on GCE can be baked into the image, creating an image from an existing disk seems straightforward. We could bake this kind of detection into downloadStoreDumps and create the image with the --delete-in flag to avoid build-up of gunk. (And then we'd need to do something similar on any other cloud platform that we run into problems in).

We're obviously short on resources to work on this sort of thing (but I think roles in that department are listing soon).

@andreimatei
Copy link
Contributor

Since #34687 there's a t.Skip() which is used when cluster creation fails.

@knz
Copy link
Contributor

knz commented Oct 1, 2019

Copying this comment over from #41227 which I mistakenly filed as a separate issue this morning:


The following issues are caused by an I/O timeout during bulk i/o, which prevent the test to proceed beyond its initialization stage:

Although the cause of these failures is serious and needs its own investigation, it is immaterial to the main feature being tested by the various roachtests.

To facilitate triage and enable different team members to focus on specific issues (as opposed to the same investigation of a common cause by many team members), it would be useful to ensure that roachtest files separate github issues depending on which phase of a test fails.


so my take on this is to have a marker (possibly a go string) of which "phase" a test is in and add this to the string used to file the github issue.

@knz knz mentioned this issue Oct 1, 2019
18 tasks
@andreimatei
Copy link
Contributor

There's two things that need custom handling, in my opinion.

  1. cluster operation failing - in particular roachprod put and roachprod start.
  2. imports and restores

What I think I would do is make it so that these operations file issues separately (not even related to the particular test that was running). By isolating these operations to dedicated helpers that deal with issues creation, or perhaps in some other ways.
The separately I think we can also introduce this idea of a test "phase" that you talk about and let tests use phases as they see fit. But, in my opinion, if we address these two particular cases, we'll be in a much better place.

@irfansharif
Copy link
Contributor

I think addressing this, or something similar, would go a long way to address the very annoying amount of integration test noise we have at CRL. I'll give a very recent example: #59562 (see the 200+ linked issues). When we make changes to our testing infrastructure (like roachprod, workload, or CI images, etc), or there are real infra-issues (apt-get doesn't work), or restoring from a fixture fails (maybe we can't do much about this one with this issue), the way we find out about them is that the underlying single issue ends up tripping up virtually every integration test we have, and our test filer files an issue for each one. This is just too much noise to wade through, and real failures very likely end up getting lost because of it.

If we could massage our test filer to file "differently" for failures that occurred after "setup", it can improve the noise:signal ratio tremendously.

+cc @kenliu / @jlinder.

srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 21, 2023
TBD: summarize

Epic: none

Release note: None

Resolves: cockroachdb#33377
srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 21, 2023
TBD: summarize

Epic: none

Release note: None

Resolves: cockroachdb#33377
srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 21, 2023
TBD: summarize

Epic: none

Release note: None

Resolves: cockroachdb#33377
srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 25, 2023
TBD: summarize

Epic: none

Release note: None

Resolves: cockroachdb#33377
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.