Test node operations with streaming (NOT with RBNO) #8289

mykaul · 2024-08-07T14:28:54Z

We apparently do not test node operations with streaming - only with RBNO, which is the default. That's fine for the majority of the tests, but we need some sanity around streaming. Please make sure we have simple add/decommission/replace nemsis tests with streaming (RBNO disabled).

CC @kbr-scylla

pehala · 2024-08-08T10:02:38Z

@kbr-scylla @fruch
Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

mykaul · 2024-08-08T10:52:24Z

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.

fruch · 2024-08-08T11:04:23Z

@kbr-scylla @fruch Are you aware of any problems that GrowShrinkClusterNemesis, TerminateAndRemoveNodeMonkey, NodeTerminateAndReplace, DecommissionMonkey might have with streaming instead of RBNO? If not, then this task would be just adding the respective jobs/testcases into our regular CI?

The only question is if we wish to clone more jobs and change the method from RBNO to streaming. I prefer converting SOME of the existing jobs to use streaming instead of RBNO. I understand the concern with randomizing - it is less predicatable - but it also has its value as well, as it'll ensure we continue to test more features, they will be tested with streaming. I do suggest we just change some existing longevities to use streaming for the time being.

Adding new, or converting some, both are o.k.

Someone that owns the feature and its testing can take those calls.

Since for quite some time it wasn't tested with streaming, we don't have any information regarding any part that is ready for that or not, at some point they did work with streaming.

Picking the cases, should be random, but it might.

We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized

kbr-scylla · 2024-08-08T16:22:22Z

We are not gonna randomize it at test run time, every time we did such a thing, it waste x10 of people time, chasing the wind with this no on remembered is randomized

What exactly was the problem? Yes it can waste people's time, but I think it won't if we do it in a controlled manner: it must be clear which parameters were randomized in this test run, and how to rerun the test with exactly the same param values (most conveniently by passing the seed that was used)

kbr-scylla · 2024-08-08T16:23:32Z

Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.

The kind of randomization I propose is much more manageable

fruch · 2024-08-08T17:01:22Z

Note that there's a lot of randomness in longevity already. cassandra-stress loads are generated by random distributions. And this is the less convenient randomization case: you cannot really repeat what cassandra-stress did in a given run, it all depends on timing and the environment etc.

The kind of randomization I propose is much more manageable

I.e. someone would need to manage it, are you volunteering ? :)

soyacz · 2024-08-12T07:15:20Z

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

fruch · 2024-08-12T07:21:56Z

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

it might be relevant, if it's a nemesis level call,
but we are talking about scylla configuration, that control which operation is RBNO and which is not, so it's on a test case level.

pehala · 2024-08-12T07:23:23Z

one idea would be enhancing SCT to include randomized params along with disruption name in Nemesis tab in Argus (and a log message in sct.log, close to disruption start if possible). So it's much clearer how/what we tested.

I think that is good idea regardless, I think SCT needs more transparency in general, but I do not think it solves the randomization problem completely. By adding another layer of randomization we are lowering chances that the configuration we want happens, add additional level of review requires (did we run it with streaming or rbno, do we need to run again to test the other one as well?) and make in even less transparent that it is today.

I also think randomization is not scalable, truth be told, current proposed solution (i.e. switching some tests to streaming) is also not scalable, and we need to discuss after this "hotfix" how to deal with these issues in the future, but at least it does not make SCT even less transparent

mykaul assigned pehala Aug 7, 2024

github-actions bot assigned mykaul Aug 7, 2024

mykaul removed their assignment Aug 7, 2024

pehala added the area/streaming label Aug 8, 2024

pehala mentioned this issue Aug 19, 2024

Use streaming instead of RBNO for a few longevity tests #8402

Merged

4 tasks

fruch closed this as completed in #8402 Sep 11, 2024

mergify bot mentioned this issue Sep 11, 2024

[Backport 2024.2] Use streaming instead of RBNO for a few longevity tests #8648

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test node operations with streaming (NOT with RBNO) #8289

Test node operations with streaming (NOT with RBNO) #8289

mykaul commented Aug 7, 2024

pehala commented Aug 8, 2024

mykaul commented Aug 8, 2024

fruch commented Aug 8, 2024

kbr-scylla commented Aug 8, 2024

kbr-scylla commented Aug 8, 2024

fruch commented Aug 8, 2024

soyacz commented Aug 12, 2024

fruch commented Aug 12, 2024

pehala commented Aug 12, 2024

Test node operations with streaming (NOT with RBNO) #8289

Test node operations with streaming (NOT with RBNO) #8289

Comments

mykaul commented Aug 7, 2024

pehala commented Aug 8, 2024

mykaul commented Aug 8, 2024

fruch commented Aug 8, 2024

kbr-scylla commented Aug 8, 2024

kbr-scylla commented Aug 8, 2024

fruch commented Aug 8, 2024

soyacz commented Aug 12, 2024

fruch commented Aug 12, 2024

pehala commented Aug 12, 2024