Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disrupt_decommission_streaming_err - "status in nodetool.status is UL, but status in gossip NORMAL" #7067

Closed
roydahan opened this issue Jan 4, 2024 · 8 comments
Assignees

Comments

@roydahan
Copy link
Contributor

roydahan commented Jan 4, 2024

Happens in:

https://argus.scylladb.com/test/a5d1f97b-064a-40ed-a517-70e2092b51c2/runs?additionalRuns[]=38e1b036-3163-4f2a-92f4-5f66f3b0a116,

Discussion from @temichus:

I see that we hit

except NodeStayInClusterAfterDecommission:
self.log.debug('The decommission of target node is successfully interrupted')
< t:2023-12-31 19:05:59,665 f:nemesis.py l:3777 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: The decommission of target node is successfully interrupted
nodetool status output

&lt; t:2023-12-31 19:14:16,553 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; Status=Up/Down
&lt; t:2023-12-31 19:14:16,553 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; |/ State=Normal/Leaving/Joining/Moving
&lt; t:2023-12-31 19:14:16,555 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; --  Address      Load       Tokens       Owns    Host ID                               Rack
&lt; t:2023-12-31 19:14:16,560 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UL  10.12.6.205  14.73 GB   256          ?       cb214edc-0c52-4878-8301-9166f480e1f3  1b
&lt; t:2023-12-31 19:14:16,561 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.1.56   14.09 GB   256          ?       3e776526-92ec-45e3-b7c4-0f189d5692a6  1a
&lt; t:2023-12-31 19:14:16,563 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.5.26   25.76 GB   256          ?       f43c8b8b-ed43-41b6-a21e-6e83ca454ac8  1b
&lt; t:2023-12-31 19:14:16,566 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.2.171  15.7 GB    256          ?       9e0b03b0-65a0-442c-acd4-4d7bbc78215d  1a
&lt; t:2023-12-31 19:14:16,567 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.8.37   14.88 GB   256          ?       781999f6-f067-4b47-bbaf-320fd5b17b83  1c
&lt; t:2023-12-31 19:14:16,569 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.8.243  14.92 GB   256          ?       ad7c88b4-ae78-445e-9370-2043e9e4d1f7  1c

one node in UL state
then function wait_node_fully_start calls tats for some reason wants "all nodes to be Up Normal"

but I think UL state will be removed only after

self.target_node.run_nodetool(sub_cmd="rebuild", retry=0)
It looks like SCT logic issue. @fruch please double-check me

cc @aleksbykov

@roydahan
Copy link
Contributor Author

roydahan commented Jan 4, 2024

something should take the node out of UL, maybe the behavior changed and the reboot that used to stop the decommission also restored the node from UL to UN.

We need to make sure this nemesis isn't broken and won't fail our runs.

@fruch
Copy link
Contributor

fruch commented Jan 4, 2024

For sure rebuild isn't changing any state

This nemesis has so many options, we can't really tell what flow it's doing, and which is a correct behavior or not.

I still think it needs to be broken into different nemesis, which would do a specific well defined process, way too many degrees of freedom in that one

@aleksbykov
Copy link
Contributor

The problem is that nodetool decommission has been running longer than 3 minutes and command was aborted. Decommission was running longer than 10 minutes. These 2 reasons causes too earlier exit from Decommissioning interrupt process.
Decommission ran longer because because additionally was run reshaping for cdc tables:

Dec 31 19:13:47.487871 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction_manager - Starting off-strategy compaction for cdc_test.test_table_preimage_postimage_scylla_cdc_log compaction_group=0/1, 11 candidates were found
Dec 31 19:13:47.488168 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction - [Reshape cdc_test.test_table_preimage_postimage_scylla_cdc_log 

we can increase timeouts for nodetool as fast patch

@roydahan
Copy link
Contributor Author

roydahan commented Jan 9, 2024

The problem is that nodetool decommission has been running longer than 3 minutes and command was aborted. Decommission was running longer than 10 minutes. These 2 reasons causes too earlier exit from Decommissioning interrupt process. Decommission ran longer because because additionally was run reshaping for cdc tables:

Dec 31 19:13:47.487871 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction_manager - Starting off-strategy compaction for cdc_test.test_table_preimage_postimage_scylla_cdc_log compaction_group=0/1, 11 candidates were found
Dec 31 19:13:47.488168 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction - [Reshape cdc_test.test_table_preimage_postimage_scylla_cdc_log 

we can increase timeouts for nodetool as fast patch

@aleksbykov not sure I understand, what ran more than 3 minutes? the decommission command? or nodetool status command?
Decommission can take hours....

@temichus temichus removed their assignment Jan 19, 2024
@aleksbykov
Copy link
Contributor

@aleksbykov
Copy link
Contributor

With fix job is passed.
Several jobs are running for different versions: master/ 5.4/2024.1

@aleksbykov
Copy link
Contributor

PR: #7144

@temichus
Copy link
Contributor

temichus commented Feb 1, 2024

pr merged

@temichus temichus closed this as completed Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants