Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateIndex nemesis was started (and failed) on a node that was previously terminated by NodeTerminateAndReplace parallel nemesis #8198

Closed
dimakr opened this issue Jul 30, 2024 · 2 comments · Fixed by #8381
Assignees

Comments

@dimakr
Copy link
Contributor

dimakr commented Jul 30, 2024

Packages

Scylla version: 2024.1.8-20240724.fc3e399a25f3 with build-id 646cf933d8926947ade5b2a7cbc5bacb145df4fb
Kernel Version: 5.15.0-1066-aws

Issue description

During enterprise-2024.1/longevity/longevity-multidc-schema-topology-changes-12h-test#26 test disrupt_terminate_and_replace_node and disrupt_create_index nemeses were started in parallel and were targeted onto the same node-7.
NodeTerminateAndReplace nemesis started node termination at 02:43:52 and finished at 02:49:38:

2024-07-25 02:43:52,385 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 02:43:52.383: (InfoEvent Severity.NORMAL) period_type=not-set event_id=b225fcb9-0252-4021-b079-81bdd9c5508c: message=StartEvent - Terminate node and wait 5 minutes
...
2024-07-25 02:49:38,553 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 02:49:38.552: (InfoEvent Severity.NORMAL) period_type=not-set event_id=7f4d8643-394f-4b82-b161-8b1661ea938f: message=FinishEvent - target_node was terminated

CreateIndex tried to start index creation on node-7 at 02:54:47:

2024-07-25 02:54:47,644 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 02:54:47.641: (InfoEvent Severity.NORMAL) period_type=not-set event_id=3f17c704-6a15-44d2-8442-7f30371e414f: message=Starting creating index: keyspace1.standard2(c1)

and eventually failed as the node was no longer available, with the error:

Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5094, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4758, in disrupt_create_index
wait_for_index_to_be_built(self.target_node, ks, index_name, timeout=timeout * 2)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/nemesis_utils/indexes.py", line 73, in wait_for_index_to_be_built
wait_for_view_to_be_built(node=node, ks=ks, view_name=f'{index_name}_index', timeout=timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/nemesis_utils/indexes.py", line 80, in wait_for_view_to_be_built
result = node.run_nodetool(f"viewbuildstatus {ks}.{view_name}",
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2605, in run_nodetool
runner(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
if self._run_on_retryable_exception(exc, new_session):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
raise RetryableNetworkException(str(exc), original=exc)
sdcm.remote.base.RetryableNetworkException: Failed to run a command due to exception!
Command: '/usr/bin/nodetool  viewbuildstatus keyspace1.standard2_c1_nemesis_index '
Stdout:
Stderr:
Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 588, in run
self.connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 524, in connect
raise ConnectTimeout(ex_msg) from exc
Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.3.1.199:22' - timed out

Impact

Parallel nemeses affected one another in a disruptive manner.

How frequently does it reproduce?

No other occurrences of the issue were noticed.

Installation details

Cluster size: 12 nodes (i3en.2xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: ami-072fc07743bf86cd3 ami-09a43832bb62c9b19 (aws: undefined_region)

Test: longevity-multidc-schema-topology-changes-12h-test
Test id: 97c11d18-65ec-4dfa-9b9d-70ba669c3f11
Test name: enterprise-2024.1/longevity/longevity-multidc-schema-topology-changes-12h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 97c11d18-65ec-4dfa-9b9d-70ba669c3f11
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 97c11d18-65ec-4dfa-9b9d-70ba669c3f11

Logs:

Jenkins job URL
Argus

@dimakr dimakr removed their assignment Jul 30, 2024
@soyacz
Copy link
Contributor

soyacz commented Jul 31, 2024

hmm, shouldn't nemesis select a node that is not the target_node in parallel nemesis?

@fruch
Copy link
Contributor

fruch commented Aug 15, 2024

the problem is that they both are the first

sdcm.nemesis.SisyphusMonkey: Current Target: Node parallel-topology-schema-changes-mu-db-node-97c11d18-7 [13.40.68.247 | 10.3.1.199] (dc name: eu-west-2scylla_node_west, rack: 2a) with running nemesis: None

running nemesis: None is the problem, it means both select a node, while that node isn't decorated with a running nemesis yet

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Aug 15, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: scylladb#8198
@fruch fruch self-assigned this Aug 15, 2024
fruch added a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)

# Conflicts:
#	sdcm/nemesis.py
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)

# Conflicts:
#	sdcm/nemesis.py
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)

# Conflicts:
#	sdcm/nemesis.py
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
mergify bot pushed a commit that referenced this issue Aug 27, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
fruch added a commit that referenced this issue Aug 28, 2024
…ion`

before there was two seprate calls to `set_target_node` and to
`set_current_disruption` that could end up with `set_target_node`
setting None to the `target_node.running_nemesis`

this fix move `set_current_disruption` into `set_target_node`
to avoid this problem, and also introduce a new paramter to
`set_target_node` so any user of it can set what's the data
that would be saved on the target node (it's only for debugging)

Fixes: #8198
(cherry picked from commit a12ee91)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants