Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disrupt_disable_binary_gossip_execute_major_compaction nemesis always fails on it's gate closed assertion not checking CQL and UN state as it should #6819

Closed
vponomaryov opened this issue Nov 17, 2023 · 2 comments · Fixed by #7137
Assignees

Comments

@vponomaryov
Copy link
Contributor

vponomaryov commented Nov 17, 2023

Running the disrupt_disable_binary_gossip_execute_major_compaction nemesis against the DB node using 2023.1.2 Scylla version on K8S the following error appears all the time:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5080, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4973, in disrupt_disable_binary_gossip_execute_major_compaction
    assert not gate_closed_appearing, \
AssertionError: After re-enabling binary and gossip, 'gate closed' messages continue to appear

DB node logs indeed have the gate closed kind of messages for long time.
But it is the only problem I observe about it - flood of those messages:

INFO  2023-11-15 14:42:17,680 [shard  6] compaction - [Compact keyspace1.standard1 22988180-83c5-11ee-ad04-52268e54852b] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ed6d804083c011ee892052258e54852b/me-48-big-Data.db:level=0]. 1GB to 1GB (~100% of original) in 16927ms = 86MB/s. ~1349632 total partitions merged to 1349466.
INFO  2023-11-15 14:49:17,084 [shard  0] cql_server_controller - CQL server stopped
I1115 14:49:25.269323       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:49:25.272549       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
INFO  2023-11-15 14:49:27,440 [shard  3] api - stop_gossiping
WARN  2023-11-15 14:49:27,440 [shard  0] storage_service - Stopping gossip by operator request
INFO  2023-11-15 14:49:27,440 [shard  0] gossip - My status = NORMAL
INFO  2023-11-15 14:49:27,440 [shard  0] gossip - Announcing shutdown
INFO  2023-11-15 14:49:27,446 [shard  0] storage_service - Node 10.4.9.183 state jump to normal
...
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.4.8.142 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.0.8.238 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.0.11.23 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.0.11.5 with generation 1700055546
INFO  2023-11-15 14:49:27,470 [shard  0] gossip - Sending a GossipShutdown to 10.4.9.133 with generation 1700055546
...
INFO  2023-11-15 14:49:27,496 [shard  4] compaction - [Compact system.peers 2cdf9ec0-83c6-11ee-a9e1-52218e54852b] Compacted 2 sstables to [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-46-big-Data.db:level=0]. 28kB to 14kB (~50% of original) in 8ms = 3MB/s. ~256 total partitions merged to 1.
INFO  2023-11-15 14:49:28,942 [shard  0] rpc - client 10.4.9.133:50792 msg_id 2:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:29,323 [shard  0] rpc - client 10.4.9.133:50792 msg_id 3:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:29,323 [shard  0] rpc - client 10.4.9.133:50792 msg_id 4:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:29,470 [shard  0] gossip - Disable and wait for gossip loop started
INFO  2023-11-15 14:49:30,447 [shard  0] rpc - client 10.4.8.142:59864 msg_id 4:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:49:31,016 [shard  0] gossip - failure_detector_loop: Finished main loop
INFO  2023-11-15 14:49:31,017 [shard  0] gossip - Gossip is now stopped
I1115 14:49:35.267692       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:49:45.267849       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:49:55.268491       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
I1115 14:50:05.268160       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
INFO  2023-11-15 14:50:05,769 [shard  0] compaction_manager - User initiated compaction started on behalf of system_traces.node_slow_log_time_idx
...
INFO  2023-11-15 14:50:52,282 [shard  0] compaction - [Compact keyspace1.standard1 4f840740-83c6-11ee-8920-52258e54852b] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-ed6d804083c011ee892052258e54852b/me-98-big-Data.db:level=0]. 2GB to 1GB (~66% of original) in 26691ms = 90MB/s. ~2207488 total partitions merged to 1463046.
I1115 14:50:53.229163       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-eu-west-1-rack-1-0"
INFO  2023-11-15 14:50:53,538 [shard 12] api - start_gossiping
WARN  2023-11-15 14:50:53,538 [shard  0] storage_service - Starting gossip by operator request
INFO  2023-11-15 14:50:53,553 [shard  0] storage_service - Node 10.4.9.183 state jump to normal
INFO  2023-11-15 14:50:53,587 [shard  0] gossip - failure_detector_loop: Started main loop
INFO  2023-11-15 14:50:53,653 [shard 12] rpc - client 10.0.11.142:61332 msg_id 19:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:50:54,416 [shard  0] cql_server_controller - Enabling encrypted CQL connections between client and server
INFO  2023-11-15 14:50:54,467 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted, non-shard-aware)
INFO  2023-11-15 14:50:54,467 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19042 (unencrypted, shard-aware)
INFO  2023-11-15 14:50:54,493 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9142 (encrypted, non-shard-aware)
INFO  2023-11-15 14:50:54,493 [shard  0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19142 (encrypted, shard-aware)
INFO  2023-11-15 14:50:55,629 [shard 12] rpc - client 10.0.11.142:61332 msg_id 20:  exception "gate closed" in no_wait handler ignored
INFO  2023-11-15 14:50:55,722 [shard  2] rpc - client 10.0.10.164:22304 msg_id 18:  exception "gate closed" in no_wait handler ignored

According to the https://github.com/scylladb/scylla-enterprise/issues/2897 we should check that CQL and gossiper work.

So, if we look at the nodetool status results, we see that it is considered to be UN:

2023-11-15 14:51:42,056 f:remote_base.py  l:521  c:KubernetesCmdRunner  p:DEBUG > Running command "/usr/bin/nodetool  status "...
2023-11-15 14:51:44,486 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-north-1
2023-11-15 14:51:44,486 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > ======================
2023-11-15 14:51:44,487 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:51:44,487 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:51:44,493 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:51:44,494 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.5   31.73 GB   256          ?       91fc2434-72dc-4ce1-b86a-ea77cb9e5f39  rack-1
2023-11-15 14:51:44,495 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.23  33.48 GB   256          ?       bb581f72-5b7a-416a-a913-d358f99061fa  rack-1
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.8.238  31.73 GB   256          ?       d0820b1c-a78c-43d2-aaeb-32d91ed0b747  rack-1
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-west-1
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > =====================
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:51:44,497 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:51:44,498 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:51:44,500 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.133  31.72 GB   256          ?       08f2fd07-15b0-45f7-92a6-c0d220d9185d  rack-1
2023-11-15 14:51:44,501 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.183  31.34 GB   256          ?       3298b3b1-a043-49e3-b8c2-be264bde09d9  rack-1
2023-11-15 14:51:44,501 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.8.142  31.72 GB   256          ?       db040b36-7297-46b1-8915-78acbe137076  rack-1
2023-11-15 14:51:44,503 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG >               
2023-11-15 14:51:44,503 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2023-11-15 14:51:44,552 f:base.py         l:142  c:KubernetesCmdRunner  p:DEBUG > Command "/usr/bin/nodetool  status " finished with status 0
...
2023-11-15 14:58:41,975 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-north-1
2023-11-15 14:58:41,976 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > ======================
2023-11-15 14:58:41,976 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:58:41,976 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:58:41,977 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:58:41,979 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.5   29.14 GB   256          ?       91fc2434-72dc-4ce1-b86a-ea77cb9e5f39  rack-1
2023-11-15 14:58:41,981 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.11.23  30.72 GB   256          ?       bb581f72-5b7a-416a-a913-d358f99061fa  rack-1
2023-11-15 14:58:41,982 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.0.8.238  29.14 GB   256          ?       d0820b1c-a78c-43d2-aaeb-32d91ed0b747  rack-1
2023-11-15 14:58:41,982 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Datacenter: eu-west-1
2023-11-15 14:58:41,983 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > =====================
2023-11-15 14:58:41,983 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Status=Up/Down
2023-11-15 14:58:41,983 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
2023-11-15 14:58:41,984 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
2023-11-15 14:58:41,984 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.133  29.14 GB   256          ?       08f2fd07-15b0-45f7-92a6-c0d220d9185d  rack-1
2023-11-15 14:58:41,987 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.9.183  33.53 GB   256          ?       3298b3b1-a043-49e3-b8c2-be264bde09d9  rack-1
2023-11-15 14:58:41,988 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > UN  10.4.8.142  29.13 GB   256          ?       db040b36-7297-46b1-8915-78acbe137076  rack-1
2023-11-15 14:58:41,988 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG >               
2023-11-15 14:58:41,988 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2023-11-15 14:58:42,068 f:base.py         l:142  c:KubernetesCmdRunner  p:DEBUG > Command "/usr/bin/nodetool  status " finished with status 0

Then, monitoring shows that the load is dynamic on it, which we can consider as it serves requests:
Screenshot from 2023-11-17 13-07-46
Screenshot from 2023-11-17 13-08-00

So, it turns out that the referenced nemesis checks wrong thing.
It should check the UN state and CQL availability, which was the original problem.
Also, it must restart that node in case of the found bug, because it is the remedy for the bug.

Impact

Probably false nemesis failure?

How frequently does it reproduce?

100% using 2023.1.2 Scylla version

Installation details

Kernel Version: 5.10.198-187.748.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a

Operator Image: scylladb/scylla-operator:1.11.0
Operator Helm Version: v1.11.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/stable
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run (pod Ips ephemeral):
sct-cluster-eu-north-1-rack-1-0 | 10.0.11.23
sct-cluster-eu-north-1-rack-1-1 | 10.0.8.238
sct-cluster-eu-north-1-rack-1-2 | 10.0.11.5
sct-cluster-eu-west-1-rack-1-0 | 10.4.9.183
sct-cluster-eu-west-1-rack-1-1 | 10.4.9.133
sct-cluster-eu-west-1-rack-1-2 | 10.4.8.142

OS / Image: `` (k8s-eks: multi-dc: eu-north-1 and `eu-west-1`)

Test: vp-longevity-scylla-operator-multidc-12h-eks
Test id: 181576c2-fca5-483c-a520-ce2108b9874a
Test name: scylla-staging/valerii/vp-longevity-scylla-operator-multidc-12h-eks
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 181576c2-fca5-483c-a520-ce2108b9874a
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 181576c2-fca5-483c-a520-ce2108b9874a

Logs:

Jenkins job URL
Argus

@vponomaryov
Copy link
Contributor Author

@shlomi , @fruch , @roy ^

@fruch
Copy link
Contributor

fruch commented Nov 17, 2023

Since @ShlomiBalalis isn't available currently, we'll need to handle it ourselves

@roydahan roydahan assigned fruch and unassigned roydahan and fruch Nov 26, 2023
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Jan 23, 2024
Check the gossip status and CQL workability in the end of the
'disrupt_disable_binary_gossip_execute_major_compaction' nemesis
instead of looking for the 'gate closed' message in DB logs.

Fixes: scylladb#6819
fruch pushed a commit that referenced this issue Jan 23, 2024
Check the gossip status and CQL workability in the end of the
'disrupt_disable_binary_gossip_execute_major_compaction' nemesis
instead of looking for the 'gate closed' message in DB logs.

Fixes: #6819
fruch pushed a commit that referenced this issue Jan 23, 2024
Check the gossip status and CQL workability in the end of the
'disrupt_disable_binary_gossip_execute_major_compaction' nemesis
instead of looking for the 'gate closed' message in DB logs.

Fixes: #6819
(cherry picked from commit 6f48e10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants