fullscan thread failed after running for 2 minutes #8287

yarongilor · 2024-08-07T12:55:28Z

Packages

Scylla version: 6.1.0~dev-20240625.c80dc5715668 with build-id bf0032dbaafe5e4d3e01ece0dcb7785d2ec7a098

Kernel Version: 5.15.0-1063-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

A (side effect) followup of scylladb/scylladb#19573 (comment)

the run_fullscan is configured to run every minute:

run_fullscan: ['{"mode": "table_and_aggregate", "ks_cf": "keyspace1.standard1", "interval": 60}']

So probably the thread failed for some reason (not expected).

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-04447069-7 (52.49.241.60 | 10.4.22.189) (shards: 14)
longevity-tls-50gb-3d-master-db-node-04447069-6 (108.128.184.168 | 10.4.22.96) (shards: 14)
longevity-tls-50gb-3d-master-db-node-04447069-5 (54.220.255.133 | 10.4.22.193) (shards: 14)
longevity-tls-50gb-3d-master-db-node-04447069-4 (34.241.17.128 | 10.4.22.222) (shards: 14)
longevity-tls-50gb-3d-master-db-node-04447069-3 (52.210.187.188 | 10.4.23.139) (shards: 14)
longevity-tls-50gb-3d-master-db-node-04447069-2 (52.16.185.73 | 10.4.23.195) (shards: 14)
longevity-tls-50gb-3d-master-db-node-04447069-1 (54.155.186.240 | 10.4.21.83) (shards: 14)

OS / Image: ami-09006ca344092e50b (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 04447069-14c9-47f9-8cf8-fe71a0ec1276
Test name: scylla-master/tier1/longevity-50gb-3days-test

Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 04447069-14c9-47f9-8cf8-fe71a0ec1276
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 04447069-14c9-47f9-8cf8-fe71a0ec1276

Logs:

db-cluster-04447069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/db-cluster-04447069.tar.gz
sct-runner-events-04447069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-runner-events-04447069.tar.gz
sct-04447069.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/sct-04447069.log.tar.gz
loader-set-04447069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/loader-set-04447069.tar.gz
monitor-set-04447069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/04447069-14c9-47f9-8cf8-fe71a0ec1276/20240630_041812/monitor-set-04447069.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

fruch · 2024-08-07T13:31:53Z

@yarongilor from whom wrote that feature, I would expect to have a bit more insights into why this has failed, and why the check aren't continuing.

i.e. I don't expect to ever see ThreadFailedEvent, if they pop up, it mean the code didn't handled some exception, and that thread stop executing

yarongilor · 2024-08-08T10:45:41Z

probably failed for the same root cause of #8292

fruch · 2024-09-12T17:04:34Z

probably failed for the same root cause of #8292

@yarongilor
#8292 is now closed, that meaning this can be closed ?

yarongilor · 2024-09-15T12:30:33Z

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue.
As for what happened here, it seems that during a refactor by @temichus , in this commit:
64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52

the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages)
so it should probably put back in its place.
The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

temichus · 2024-09-15T12:50:31Z

2 minutes?

i see 1st result report

 t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |        op_type         |      duration      | exceptions | nemesis_at_start | nemesis_at_end | success |                                    cmd                                    |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent | 24.877952337265015 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+

ant last

< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > Thread stats:
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |        op_type         |       duration      | exceptions | nemesis_at_start | nemesis_at_end | success |                                    cmd                                    |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  24.877952337265015 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      |  0.3228580951690674 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  28.062369346618652 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  21.24626350402832  |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  28.70813512802124  |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.17992639541625977 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.12150192260742188 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.17914700508117676 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
<

last executing attempt that I see

< t:2024-06-29 03:32:14,467 f:operations_thread.py l:140  c:ScanOperationThread  p:DEBUG > Operation instance map: {'table_and_aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96dd0>, 'random': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e97910>, 'table': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e975b0>, 'partition': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96ef0>, 'aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e979a0>}
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,469 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:32:14.467: (FullScanEvent Severity.NORMAL) period_type=begin event_id=709bd269-6371-49a2-9fb0-5bd0e4a5f8e0 node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s

after that - silence, not clear why

but thread is alive at the end of tests

2024-06-30 04:17:59,475 f:threads_and_processes_alive.py l:66   c:sdcm.utils.threads_and_processes_alive p:ERROR > There are some threads left alive from following modules: sdcm.tester,multiprocessing.queues,socketserver,cassandra.io.libevreactor,sdcm.utils.operations_thread,sdcm.nemesis,cassandra.cluster,cassandra.connection,concurrent.futures.thread,sdcm.remote.libssh2_client

temichus · 2024-09-15T12:52:17Z

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52

the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

this commit was merge 1.5 years ago

and before this log thread is working

< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..

yarongilor · 2024-09-15T13:16:14Z

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52
the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

this commit was merge 1.5 years ago

and before this log thread is working

< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..

@temichus , it only get the problem when read_pages = 0.
It is left running in an endless loop of:

        while result.has_more_pages and pages <= read_pages:
            if read_pages > 0:
                pages += 1

sending a fix now

…pages fetch_result_pages had a "while" loop where a call to result.fetch_next_page() is missing. Adding it back. Fixes: scylladb#8287

…pages fetch_result_pages had a "while" loop where a call to result.fetch_next_page() is missing. Adding it back. Fixes: #8287 (cherry picked from commit 3f0b2f3)

yarongilor self-assigned this Aug 7, 2024

yarongilor mentioned this issue Sep 15, 2024

fix(full scan thread): Added missing fetch-next-page to fetch_result_pages #8705

Merged

2 tasks

fruch closed this as completed in #8705 Oct 31, 2024

fruch closed this as completed in 3f0b2f3 Oct 31, 2024

mergify bot mentioned this issue Oct 31, 2024

[Backport 6.0] fix(full scan thread): Added missing fetch-next-page to fetch_result_pages #9107

Merged

2 tasks

mergify bot mentioned this issue Oct 31, 2024

[Backport 6.1] fix(full scan thread): Added missing fetch-next-page to fetch_result_pages #9108

Merged

2 tasks

mergify bot mentioned this issue Oct 31, 2024

[Backport 2024.1] fix(full scan thread): Added missing fetch-next-page to fetch_result_pages #9109

Merged

2 tasks

mergify bot mentioned this issue Oct 31, 2024

[Backport 2024.2] fix(full scan thread): Added missing fetch-next-page to fetch_result_pages #9110

Merged

2 tasks

mergify bot mentioned this issue Nov 1, 2024

[Backport 6.2] fix(full scan thread): Added missing fetch-next-page to fetch_result_pages #9115

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fullscan thread failed after running for 2 minutes #8287

fullscan thread failed after running for 2 minutes #8287

yarongilor commented Aug 7, 2024

Logs:

fruch commented Aug 7, 2024

yarongilor commented Aug 8, 2024

fruch commented Sep 12, 2024

yarongilor commented Sep 15, 2024

temichus commented Sep 15, 2024 •

edited

Loading

temichus commented Sep 15, 2024

yarongilor commented Sep 15, 2024

fullscan thread failed after running for 2 minutes #8287

fullscan thread failed after running for 2 minutes #8287

Comments

yarongilor commented Aug 7, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Aug 7, 2024

yarongilor commented Aug 8, 2024

fruch commented Sep 12, 2024

yarongilor commented Sep 15, 2024

temichus commented Sep 15, 2024 • edited Loading

temichus commented Sep 15, 2024

yarongilor commented Sep 15, 2024

temichus commented Sep 15, 2024 •

edited

Loading