`disrupt_disable_binary_gossip_execute_major_compaction` nemesis always fails on it's `gate closed` assertion not checking `CQL` and `UN` state as it should #6819

vponomaryov · 2023-11-17T11:03:06Z

Running the disrupt_disable_binary_gossip_execute_major_compaction nemesis against the DB node using 2023.1.2 Scylla version on K8S the following error appears all the time:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5080, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4973, in disrupt_disable_binary_gossip_execute_major_compaction
    assert not gate_closed_appearing, \
AssertionError: After re-enabling binary and gossip, 'gate closed' messages continue to appear

DB node logs indeed have the gate closed kind of messages for long time.
But it is the only problem I observe about it - flood of those messages:

INFO  2023-11-15 14:42:17,680 [shard  6] compaction - [Compact keyspace1.standard1 22988180-83c5-11ee-ad04-52268e54852b] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ed6d804083c011ee892052258e54852b/me-48-big-Data.db:level=0]. 1GB to 1GB (~100% of original) in 16927ms = 86MB/s. ~1349632 total partitions merged to 1349466.
INFO  2023-11-15 14:49:17,084 [shard  0] cql_server_controller - CQL server stopped
I1115 14:49:25.269323       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:49:25.272549       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
INFO  2023-11-15 14:49:27,440 [shard  3] api - stop_gossiping
WARN  2023-11-15 14:49:27,440 [shard  0] storage_service - Stopping gossip by operator request
INFO  2023-11-15 14:49:27,440 [shard  0] gossip - My status = NORMAL
INFO  2023-11-15 14:49:27,440 [shard  0] gossip - Announcing shutdown
INFO  2023-11-15 14:49:27,446 [shard  0] storage_service - Node 10.4.9.183 state jump to normal
...
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.4.8.142 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.0.8.238 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.0.11.23 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.0.11.5 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.4.9.133 with generation 1700055546
...
INFO  2023-11-15 14:49:27,496 [shard  4] compaction - [Compact system.peers 2cdf9ec0-83c6-11ee-a9e1-52218e54852b] Compacted 2 sstables to [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-46-big-Data.db:level=0]. 28kB to 14kB (~50% of original) in 8ms = 3MB/s. ~256 total partitions merged to 1.
INFO  2023-11-15 14:49:28,942 [shard  0] rpc - client 10.4.9.133:50792 msg_id 2:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:29,323 [shard  0] rpc - client 10.4.9.133:50792 msg_id 3:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:29,323 [shard  0] rpc - client 10.4.9.133:50792 msg_id 4:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:29,470 [shard  0] gossip - Disable and wait for gossip loop started
INFO  2023-11-15 14:49:30,447 [shard  0] rpc - client 10.4.8.142:59864 msg_id 4:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:31,016 [shard  0] gossip - failure_detector_loop: Finished main loop
INFO  2023-11-15 14:49:31,017 [shard  0] gossip - Gossip is now stopped
I1115 14:49:35.267692       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:49:45.267849       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:49:55.268491       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:50:05.268160       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
INFO  2023-11-15 14:50:05,769 [shard  0] compaction_manager - User initiated compaction started on behalf of system_traces.node_slow_log_time_idx
...
INFO  2023-11-15 14:50:52,282 [shard  0] compaction - [Compact keyspace1.standard1 4f840740-83c6-11ee-8920-52258e54852b] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ed6d804083c011ee892052258e54852b/me-98-big-Data.db:level=0]. 2GB to 1GB (~66% of original) in 26691ms = 90MB/s. ~2207488 total partitions merged to 1463046.
I1115 14:50:53.229163       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
INFO  2023-11-15 14:50:53,538 [shard 12] api - start_gossiping
WARN  2023-11-15 14:50:53,538 [shard  0] storage_service - Starting gossip by operator request
INFO  2023-11-15 14:50:53,553 [shard  0] storage_service - Node 10.4.9.183 state jump to normal
INFO  2023-11-15 14:50:53,587 [shard  0] gossip - failure_detector_loop: Started main loop
INFO  2023-11-15 14:50:53,653 [shard 12] rpc - client 10.0.11.142:61332 msg_id 19:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:50:54,416 [shard  0] cql_server_controller - Enabling encrypted CQL connections between client and server
INFO  2023-11-15 14:50:54,467 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted, non-shard-aware)
INFO  2023-11-15 14:50:54,467 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19042 (unencrypted, shard-aware)
INFO  2023-11-15 14:50:54,493 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9142 (encrypted, non-shard-aware)
INFO  2023-11-15 14:50:54,493 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19142 (encrypted, shard-aware)
INFO  2023-11-15 14:50:55,629 [shard 12] rpc - client 10.0.11.142:61332 msg_id 20:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:50:55,722 [shard  2] rpc - client 10.0.10.164:22304 msg_id 18:  exception "gate closed" in no_wait handler ignored

According to the https://github.com/scylladb/scylla-enterprise/issues/2897 we should check that CQL and gossiper work.

So, if we look at the nodetool status results, we see that it is considered to be UN:

2023-11-15 14:51:42,056 f:remote_base.py  l:521  c:KubernetesCmdRunner  p:DEBUG > Running command "/usr/bin/nodetool  status "...
2023-11-15 14:51:44,486 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-north-1
2023-11-15 14:51:44,486 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > ======================
2023-11-15 14:51:44,487 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:51:44,487 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:51:44,493 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:51:44,494 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.5   31.73 GB   256          ?       91fc2434-72dc-4ce1-b86a-ea77cb9e5f39  rack-1
2023-11-15 14:51:44,495 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.23  33.48 GB   256          ?       bb581f72-5b7a-416a-a913-d358f99061fa  rack-1
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.8.238  31.73 GB   256          ?       d0820b1c-a78c-43d2-aaeb-32d91ed0b747  rack-1
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-west-1
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > =====================
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:51:44,498 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:51:44,500 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.133  31.72 GB   256          ?       08f2fd07-15b0-45f7-92a6-c0d220d9185d  rack-1
2023-11-15 14:51:44,501 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.183  31.34 GB   256          ?       3298b3b1-a043-49e3-b8c2-be264bde09d9  rack-1
2023-11-15 14:51:44,501 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.8.142  31.72 GB   256          ?       db040b36-7297-46b1-8915-78acbe137076  rack-1
2023-11-15 14:51:44,503 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG >               
2023-11-15 14:51:44,503 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2023-11-15 14:51:44,552 f:base.py         l:142  c:KubernetesCmdRunner  p:DEBUG > Command "/usr/bin/nodetool  status " finished with status 0
...
2023-11-15 14:58:41,975 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-north-1
2023-11-15 14:58:41,976 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > ======================
2023-11-15 14:58:41,976 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:58:41,976 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:58:41,977 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:58:41,979 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.5   29.14 GB   256          ?       91fc2434-72dc-4ce1-b86a-ea77cb9e5f39  rack-1
2023-11-15 14:58:41,981 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.23  30.72 GB   256          ?       bb581f72-5b7a-416a-a913-d358f99061fa  rack-1
2023-11-15 14:58:41,982 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.8.238  29.14 GB   256          ?       d0820b1c-a78c-43d2-aaeb-32d91ed0b747  rack-1
2023-11-15 14:58:41,982 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-west-1
2023-11-15 14:58:41,983 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > =====================
2023-11-15 14:58:41,983 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:58:41,983 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:58:41,984 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:58:41,984 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.133  29.14 GB   256          ?       08f2fd07-15b0-45f7-92a6-c0d220d9185d  rack-1
2023-11-15 14:58:41,987 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.183  33.53 GB   256          ?       3298b3b1-a043-49e3-b8c2-be264bde09d9  rack-1
2023-11-15 14:58:41,988 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.8.142  29.13 GB   256          ?       db040b36-7297-46b1-8915-78acbe137076  rack-1
2023-11-15 14:58:41,988 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG >               
2023-11-15 14:58:41,988 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2023-11-15 14:58:42,068 f:base.py         l:142  c:KubernetesCmdRunner  p:DEBUG > Command "/usr/bin/nodetool  status " finished with status 0

Then, monitoring shows that the load is dynamic on it, which we can consider as it serves requests:

So, it turns out that the referenced nemesis checks wrong thing.
It should check the UN state and CQL availability, which was the original problem.
Also, it must restart that node in case of the found bug, because it is the remedy for the bug.

Impact

Probably false nemesis failure?

How frequently does it reproduce?

100% using 2023.1.2 Scylla version

Installation details

Kernel Version: 5.10.198-187.748.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a

Operator Image: scylladb/scylla-operator:1.11.0
Operator Helm Version: v1.11.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/stable
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run (pod Ips ephemeral):
sct-cluster-eu-north-1-rack-1-0 | 10.0.11.23
sct-cluster-eu-north-1-rack-1-1 | 10.0.8.238
sct-cluster-eu-north-1-rack-1-2 | 10.0.11.5
sct-cluster-eu-west-1-rack-1-0 | 10.4.9.183
sct-cluster-eu-west-1-rack-1-1 | 10.4.9.133
sct-cluster-eu-west-1-rack-1-2 | 10.4.8.142

OS / Image: `` (k8s-eks: multi-dc: eu-north-1 and `eu-west-1`)

Test: vp-longevity-scylla-operator-multidc-12h-eks
Test id: 181576c2-fca5-483c-a520-ce2108b9874a
Test name: scylla-staging/valerii/vp-longevity-scylla-operator-multidc-12h-eks
Test config file(s):

longevity-scylla-operator-multidc-12h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 181576c2-fca5-483c-a520-ce2108b9874a
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 181576c2-fca5-483c-a520-ce2108b9874a

Logs:

kubernetes-181576c2.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/181576c2-fca5-483c-a520-ce2108b9874a/20231115_174733/kubernetes-181576c2.tar.gz
db-cluster-181576c2.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/181576c2-fca5-483c-a520-ce2108b9874a/20231115_174733/db-cluster-181576c2.tar.gz
sct-runner-events-181576c2.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/181576c2-fca5-483c-a520-ce2108b9874a/20231115_174733/sct-runner-events-181576c2.tar.gz
sct-181576c2.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/181576c2-fca5-483c-a520-ce2108b9874a/20231115_174733/sct-181576c2.log.tar.gz
loader-set-181576c2.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/181576c2-fca5-483c-a520-ce2108b9874a/20231115_174733/loader-set-181576c2.tar.gz
monitor-set-181576c2.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/181576c2-fca5-483c-a520-ce2108b9874a/20231115_174733/monitor-set-181576c2.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

vponomaryov · 2023-11-17T11:09:30Z

@shlomi , @fruch , @roy ^

fruch · 2023-11-17T13:23:48Z

Since @ShlomiBalalis isn't available currently, we'll need to handle it ourselves

Check the gossip status and CQL workability in the end of the 'disrupt_disable_binary_gossip_execute_major_compaction' nemesis instead of looking for the 'gate closed' message in DB logs. Fixes: scylladb#6819

Check the gossip status and CQL workability in the end of the 'disrupt_disable_binary_gossip_execute_major_compaction' nemesis instead of looking for the 'gate closed' message in DB logs. Fixes: #6819

Check the gossip status and CQL workability in the end of the 'disrupt_disable_binary_gossip_execute_major_compaction' nemesis instead of looking for the 'gate closed' message in DB logs. Fixes: #6819 (cherry picked from commit 6f48e10)

github-actions bot assigned vponomaryov Nov 17, 2023

vponomaryov removed their assignment Nov 17, 2023

This was referenced Nov 17, 2023

test(1238): added a disable binary and gossip nemesis #6478

Merged

fix(k8s): make setting of seeds for K8S multiDC setup work correctly #6820

Merged

ci(k8s-eks): add CI job configuration for the EKS multiDC setups #6821

Merged

fruch assigned fruch, roydahan and vponomaryov Nov 17, 2023

roydahan assigned fruch and unassigned roydahan and fruch Nov 26, 2023

vponomaryov mentioned this issue Jan 23, 2024

fix(nemesis): fix disable_binary_gossip_execute_major_compaction #7137

Merged

2 tasks

fruch closed this as completed in #7137 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`disrupt_disable_binary_gossip_execute_major_compaction` nemesis always fails on it's `gate closed` assertion not checking `CQL` and `UN` state as it should #6819

`disrupt_disable_binary_gossip_execute_major_compaction` nemesis always fails on it's `gate closed` assertion not checking `CQL` and `UN` state as it should #6819

vponomaryov commented Nov 17, 2023 •

edited

Loading

Logs:

vponomaryov commented Nov 17, 2023

fruch commented Nov 17, 2023

disrupt_disable_binary_gossip_execute_major_compaction nemesis always fails on it's gate closed assertion not checking CQL and UN state as it should #6819

disrupt_disable_binary_gossip_execute_major_compaction nemesis always fails on it's gate closed assertion not checking CQL and UN state as it should #6819

Comments

vponomaryov commented Nov 17, 2023 • edited Loading

Impact

How frequently does it reproduce?

Installation details

Logs:

vponomaryov commented Nov 17, 2023

fruch commented Nov 17, 2023

`disrupt_disable_binary_gossip_execute_major_compaction` nemesis always fails on it's `gate closed` assertion not checking `CQL` and `UN` state as it should #6819

`disrupt_disable_binary_gossip_execute_major_compaction` nemesis always fails on it's `gate closed` assertion not checking `CQL` and `UN` state as it should #6819

vponomaryov commented Nov 17, 2023 •

edited

Loading