disrupt_decommission_streaming_err - "status in nodetool.status is UL, but status in gossip NORMAL" #7067

roydahan · 2024-01-04T18:38:27Z

Happens in:

https://argus.scylladb.com/test/a5d1f97b-064a-40ed-a517-70e2092b51c2/runs?additionalRuns[]=38e1b036-3163-4f2a-92f4-5f66f3b0a116,

Discussion from @temichus:

I see that we hit

except NodeStayInClusterAfterDecommission:
self.log.debug('The decommission of target node is successfully interrupted')
< t:2023-12-31 19:05:59,665 f:nemesis.py l:3777 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: The decommission of target node is successfully interrupted
nodetool status output

&lt; t:2023-12-31 19:14:16,553 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; Status=Up/Down
&lt; t:2023-12-31 19:14:16,553 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; |/ State=Normal/Leaving/Joining/Moving
&lt; t:2023-12-31 19:14:16,555 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; --  Address      Load       Tokens       Owns    Host ID                               Rack
&lt; t:2023-12-31 19:14:16,560 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UL  10.12.6.205  14.73 GB   256          ?       cb214edc-0c52-4878-8301-9166f480e1f3  1b
&lt; t:2023-12-31 19:14:16,561 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.1.56   14.09 GB   256          ?       3e776526-92ec-45e3-b7c4-0f189d5692a6  1a
&lt; t:2023-12-31 19:14:16,563 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.5.26   25.76 GB   256          ?       f43c8b8b-ed43-41b6-a21e-6e83ca454ac8  1b
&lt; t:2023-12-31 19:14:16,566 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.2.171  15.7 GB    256          ?       9e0b03b0-65a0-442c-acd4-4d7bbc78215d  1a
&lt; t:2023-12-31 19:14:16,567 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.8.37   14.88 GB   256          ?       781999f6-f067-4b47-bbaf-320fd5b17b83  1c
&lt; t:2023-12-31 19:14:16,569 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG &gt; UN  10.12.8.243  14.92 GB   256          ?       ad7c88b4-ae78-445e-9370-2043e9e4d1f7  1c

one node in UL state
then function wait_node_fully_start calls tats for some reason wants "all nodes to be Up Normal"

but I think UL state will be removed only after

self.target_node.run_nodetool(sub_cmd="rebuild", retry=0)
It looks like SCT logic issue. @fruch please double-check me

cc @aleksbykov

The text was updated successfully, but these errors were encountered:

roydahan · 2024-01-04T18:39:33Z

something should take the node out of UL, maybe the behavior changed and the reboot that used to stop the decommission also restored the node from UL to UN.

We need to make sure this nemesis isn't broken and won't fail our runs.

fruch · 2024-01-04T19:00:06Z

For sure rebuild isn't changing any state

This nemesis has so many options, we can't really tell what flow it's doing, and which is a correct behavior or not.

I still think it needs to be broken into different nemesis, which would do a specific well defined process, way too many degrees of freedom in that one

aleksbykov · 2024-01-09T13:06:44Z

The problem is that nodetool decommission has been running longer than 3 minutes and command was aborted. Decommission was running longer than 10 minutes. These 2 reasons causes too earlier exit from Decommissioning interrupt process.
Decommission ran longer because because additionally was run reshaping for cdc tables:

Dec 31 19:13:47.487871 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction_manager - Starting off-strategy compaction for cdc_test.test_table_preimage_postimage_scylla_cdc_log compaction_group=0/1, 11 candidates were found
Dec 31 19:13:47.488168 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction - [Reshape cdc_test.test_table_preimage_postimage_scylla_cdc_log

we can increase timeouts for nodetool as fast patch

roydahan · 2024-01-09T16:54:09Z

The problem is that nodetool decommission has been running longer than 3 minutes and command was aborted. Decommission was running longer than 10 minutes. These 2 reasons causes too earlier exit from Decommissioning interrupt process. Decommission ran longer because because additionally was run reshaping for cdc tables:
Dec 31 19:13:47.487871 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction_manager - Starting off-strategy compaction for cdc_test.test_table_preimage_postimage_scylla_cdc_log compaction_group=0/1, 11 candidates were found
Dec 31 19:13:47.488168 longevity-cdc-100gb-4h-5-4-db-node-38e1b036-4 scylla[6146]:  [shard  8:stre] compaction - [Reshape cdc_test.test_table_preimage_postimage_scylla_cdc_log 
we can increase timeouts for nodetool as fast patch

@aleksbykov not sure I understand, what ran more than 3 minutes? the decommission command? or nodetool status command?
Decommission can take hours....

aleksbykov · 2024-01-19T12:20:05Z

Start running staging jobs with increased nodetool duration
Staging job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/longevity-cdc-100gb-4h-test/
Branch: https://github.com/aleksbykov/scylla-cluster-tests/tree/fix-decommission-stream-err

aleksbykov · 2024-01-24T08:00:43Z

With fix job is passed.
Several jobs are running for different versions: master/ 5.4/2024.1

aleksbykov · 2024-01-25T07:57:18Z

PR: #7144

temichus · 2024-02-01T11:19:59Z

pr merged

github-actions bot assigned roydahan Jan 4, 2024

roydahan assigned temichus and aleksbykov and unassigned roydahan Jan 4, 2024

temichus removed their assignment Jan 19, 2024

aleksbykov mentioned this issue Jan 25, 2024

fix(decommissionstreamerr): Set valid decommission nodetool timeout #7144

Merged

2 tasks

temichus closed this as completed Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disrupt_decommission_streaming_err - "status in nodetool.status is UL, but status in gossip NORMAL" #7067

disrupt_decommission_streaming_err - "status in nodetool.status is UL, but status in gossip NORMAL" #7067

roydahan commented Jan 4, 2024

roydahan commented Jan 4, 2024

fruch commented Jan 4, 2024 •

edited

Loading

aleksbykov commented Jan 9, 2024

roydahan commented Jan 9, 2024

aleksbykov commented Jan 19, 2024

aleksbykov commented Jan 24, 2024

aleksbykov commented Jan 25, 2024

temichus commented Feb 1, 2024

disrupt_decommission_streaming_err - "status in nodetool.status is UL, but status in gossip NORMAL" #7067

disrupt_decommission_streaming_err - "status in nodetool.status is UL, but status in gossip NORMAL" #7067

Comments

roydahan commented Jan 4, 2024

roydahan commented Jan 4, 2024

fruch commented Jan 4, 2024 • edited Loading

aleksbykov commented Jan 9, 2024

roydahan commented Jan 9, 2024

aleksbykov commented Jan 19, 2024

aleksbykov commented Jan 24, 2024

aleksbykov commented Jan 25, 2024

temichus commented Feb 1, 2024

fruch commented Jan 4, 2024 •

edited

Loading