Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test node operations with streaming (NOT with RBNO) #8289

Closed
mykaul opened this issue Aug 7, 2024 · 9 comments · Fixed by #8402
Closed

Test node operations with streaming (NOT with RBNO) #8289

mykaul opened this issue Aug 7, 2024 · 9 comments · Fixed by #8402
Assignees

Comments

@mykaul
Copy link
Contributor

mykaul commented Aug 7, 2024

We apparently do not test node operations with streaming - only with RBNO, which is the default. That's fine for the majority of the tests, but we need some sanity around streaming. Please make sure we have simple add/decommission/replace nemsis tests with streaming (RBNO disabled).

CC @kbr-scylla

@mykaul mykaul removed their assignment Aug 7, 2024
@pehala
Copy link
Contributor

pehala commented Aug 8, 2024

@kbr-scylla @fruch
Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

@mykaul
Copy link
Contributor Author

mykaul commented Aug 8, 2024

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.

@fruch
Copy link
Contributor

fruch commented Aug 8, 2024

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.

Adding new, or converting some, both are o.k.

Someone that owns the feature and its testing can take those calls.

Since for quite some time it wasn't tested with streaming, we don't have any information regarding any part that is ready for that or not, at some point they did work with streaming.

Picking the cases, should be random, but it might.

We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized

@kbr-scylla
Copy link

We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized

What exactly was the problem? Yes it can waste people's time, but I think it won't if we do it in a controlled manner: it must be clear which parameters were randomized in this test run, and how to rerun the test with exactly the same param values (most conveniently by passing the seed that was used)

@kbr-scylla
Copy link

Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.

The kind of randomization I propose is much more manageable

@fruch
Copy link
Contributor

fruch commented Aug 8, 2024

Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.

The kind of randomization I propose is much more manageable

I.e. someone would need to manage it, are you volunteering ? :)

@soyacz
Copy link
Contributor

soyacz commented Aug 12, 2024

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

@fruch
Copy link
Contributor

fruch commented Aug 12, 2024

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

it might be relevant, if it's a nemesis level call,
but we are talking about scylla configuration, that control which operation is RBNO and which is not, so it's on a test case level.

@pehala
Copy link
Contributor

pehala commented Aug 12, 2024

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

I think that is good idea regardless, I think SCT needs more transparency in general, but I do not think it solves the randomization problem completely. By adding another layer of randomization we are lowering chances that the configuration we want happens, add additional level of review requires (did we run it with streaming or rbno, do we need to run again to test the other one as well?) and make in even less transparent that it is today.

I also think randomization is not scalable, truth be told, current proposed solution (i.e. switching some tests to streaming) is also not scalable, and we need to discuss after this "hotfix" how to deal with these issues in the future, but at least it does not make SCT even less transparent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants