Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (KgoVerifier failed waiting for worker) in CloudStorageChunkReadTest.test_read_when_cache_smaller_than_segment_size #11449

Closed
andijcr opened this issue Jun 15, 2023 · 9 comments · Fixed by #11773
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/storage ci-failure kind/bug Something isn't working

Comments

@andijcr
Copy link
Contributor

andijcr commented Jun 15, 2023

https://buildkite.com/redpanda/redpanda/builds/31319#0188bbd1-22a0-41f9-b16e-153d93522399

failure similar to #9544

Module: rptest.tests.cloud_storage_chunk_read_path_test
Class:  CloudStorageChunkReadTest
Method: test_read_when_cache_smaller_than_segment_size
====================================================================================================
test_id:    rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_when_cache_smaller_than_segment_size
status:     FAIL
run time:   3 minutes 50.100 seconds


    TimeoutError(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_chunk_read_path_test.py", line 213, in test_read_when_cache_smaller_than_segment_size
    self._consume_baseline(timeout=180)
  File "/root/tests/rptest/tests/cloud_storage_chunk_read_path_test.py", line 190, in _consume_baseline
    consumer.wait(timeout_sec=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 181, in wait_node
    self._redpanda.wait_until(lambda: self._status.active is False or self.
  File "/root/tests/rptest/services/redpanda.py", line 1991, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None
@abhijat
Copy link
Contributor

abhijat commented Jun 16, 2023

In this one the consumer seems completely stuck:

[INFO  - 2023-06-14 21:52:17,507 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:22,510 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:27,513 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:32,515 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:37,518 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:42,521 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:47,523 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:52,525 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:52:57,528 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:02,530 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:07,533 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:12,536 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:17,538 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:22,540 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:27,543 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:32,546 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:37,549 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:42,551 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:47,554 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:52,556 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:53:57,558 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:02,561 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:07,563 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:12,566 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:17,569 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:22,572 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:27,575 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:32,577 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:37,580 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:42,583 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:47,585 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:52,588 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:54:57,590 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:55:02,593 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:55:07,596 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:55:12,598 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>
[INFO  - 2023-06-14 21:55:17,600 - kgo_verifier_services - _ingest_status - lineno:304]: Worker KgoVerifierSeqConsumer-0-139778511994016 status: ConsumerStatus<True, 0, ValidatorStatus<0 0 0>>

@piyushredpanda
Copy link
Contributor

Ugh, that is sev/high then, @abhijat ?

@abhijat
Copy link
Contributor

abhijat commented Jun 16, 2023

No, I think it might a problem with the consumer, but will be able to say more after investigating

@abhijat
Copy link
Contributor

abhijat commented Jun 16, 2023

The consumer doesn't seem to send any fetch requests to broker:

TRACE 2023-06-14 21:52:17,005 [shard 0] kafka - requests.cc:94 - [172.16.16.6:36252] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 8128, ctx_size: 13
TRACE 2023-06-14 21:52:17,005 [shard 0] kafka - request_context.h:189 - [172.16.16.6:36252] sending 18:api_versions for {KgoVerifierSeqConsumer-0-139778511994016}, response {error_code={ error_code: none [0] } api_keys={{api_key=0 min_version=0 max_version=7}, {api_key=1 min_version=4 max_version=11}, {api_key=2 min_version=0 max_version=4}, {api_key=3 min_version=0 max_version=7}, {api_key=9 min_version=1 max_version=7}, {api_key=47 min_version=0 max_version=0}, {api_key=10 min_version=0 max_version=3}, {api_key=16 min_version=0 max_version=3}, {api_key=18 min_version=0 max_version=3}, {api_key=11 min_version=0 max_version=6}, {api_key=12 min_version=0 max_version=4}, {api_key=13 min_version=0 max_version=4}, {api_key=14 min_version=0 max_version=4}, {api_key=19 min_version=0 max_version=6}, {api_key=8 min_version=1 max_version=8}, {api_key=32 min_version=0 max_version=4}, {api_key=33 min_version=0 max_version=2}, {api_key=20 min_version=0 max_version=5}, {api_key=15 min_version=0 max_version=5}, {api_key=17 min_version=0 max_version=1}, {api_key=36 min_version=0 max_version=2}, {api_key=44 min_version=0 max_version=1}, {api_key=42 min_version=0 max_version=2}, {api_key=29 min_version=0 max_version=2}, {api_key=35 min_version=0 max_version=2}, {api_key=30 min_version=0 max_version=2}, {api_key=31 min_version=0 max_version=2}, {api_key=22 min_version=0 max_version=3}, {api_key=24 min_version=0 max_version=3}, {api_key=28 min_version=0 max_version=3}, {api_key=25 min_version=0 max_version=1}, {api_key=26 min_version=0 max_version=3}, {api_key=37 min_version=0 max_version=3}, {api_key=23 min_version=0 max_version=4}, {api_key=45 min_version=0 max_version=0}, {api_key=46 min_version=0 max_version=0}, {api_key=61 min_version=0 max_version=0}, {api_key=65 min_version=0 max_version=0}, {api_key=66 min_version=0 max_version=0}} throttle_time_ms=0}
TRACE 2023-06-14 21:52:17,008 [shard 0] kafka - requests.cc:94 - [172.16.16.6:36252] processing name:metadata, key:3, version:7 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 30822, ctx_size: 18
TRACE 2023-06-14 21:52:17,010 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-139778511994016}] handling metadata v7 request {topics={{{name={panda-topic}}}} allow_auto_topic_creation=false include_cluster_authorized_operations=false include_topic_authorized_operations=false}
TRACE 2023-06-14 21:52:17,010 [shard 0] kafka - request_context.h:189 - [172.16.16.6:36252] sending 3:metadata for {KgoVerifierSeqConsumer-0-139778511994016}, response {throttle_time_ms=0 brokers={{node_id=1 host=docker-rp-10 port=9092 rack={nullopt}}, {node_id=3 host=docker-rp-14 port=9092 rack={nullopt}}, {node_id=2 host=docker-rp-19 port=9092 rack={nullopt}}} cluster_id={redpanda.c8071057-d122-424f-a3a1-81728ce51bb3} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=1 leader_epoch=1 replica_nodes={{1}, {2}, {3}} isr_nodes={{1}, {2}, {3}} offline_replicas={}},] topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}
TRACE 2023-06-14 21:52:17,026 [shard 1] kafka - requests.cc:94 - [172.16.16.6:36266] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 8128, ctx_size: 13
TRACE 2023-06-14 21:52:17,027 [shard 1] kafka - request_context.h:189 - [172.16.16.6:36266] sending 18:api_versions for {KgoVerifierSeqConsumer-0-139778511994016}, response {error_code={ error_code: none [0] } api_keys={{api_key=0 min_version=0 max_version=7}, {api_key=1 min_version=4 max_version=11}, {api_key=2 min_version=0 max_version=4}, {api_key=3 min_version=0 max_version=7}, {api_key=9 min_version=1 max_version=7}, {api_key=47 min_version=0 max_version=0}, {api_key=10 min_version=0 max_version=3}, {api_key=16 min_version=0 max_version=3}, {api_key=18 min_version=0 max_version=3}, {api_key=11 min_version=0 max_version=6}, {api_key=12 min_version=0 max_version=4}, {api_key=13 min_version=0 max_version=4}, {api_key=14 min_version=0 max_version=4}, {api_key=19 min_version=0 max_version=6}, {api_key=8 min_version=1 max_version=8}, {api_key=32 min_version=0 max_version=4}, {api_key=33 min_version=0 max_version=2}, {api_key=20 min_version=0 max_version=5}, {api_key=15 min_version=0 max_version=5}, {api_key=17 min_version=0 max_version=1}, {api_key=36 min_version=0 max_version=2}, {api_key=44 min_version=0 max_version=1}, {api_key=42 min_version=0 max_version=2}, {api_key=29 min_version=0 max_version=2}, {api_key=35 min_version=0 max_version=2}, {api_key=30 min_version=0 max_version=2}, {api_key=31 min_version=0 max_version=2}, {api_key=22 min_version=0 max_version=3}, {api_key=24 min_version=0 max_version=3}, {api_key=28 min_version=0 max_version=3}, {api_key=25 min_version=0 max_version=1}, {api_key=26 min_version=0 max_version=3}, {api_key=37 min_version=0 max_version=3}, {api_key=23 min_version=0 max_version=4}, {api_key=45 min_version=0 max_version=0}, {api_key=46 min_version=0 max_version=0}, {api_key=61 min_version=0 max_version=0}, {api_key=65 min_version=0 max_version=0}, {api_key=66 min_version=0 max_version=0}} throttle_time_ms=0}
TRACE 2023-06-14 21:52:17,036 [shard 1] kafka - requests.cc:94 - [172.16.16.6:36266] processing name:metadata, key:3, version:7 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 30822, ctx_size: 18
TRACE 2023-06-14 21:52:17,037 [shard 1] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-139778511994016}] handling metadata v7 request {topics={{{name={panda-topic}}}} allow_auto_topic_creation=false include_cluster_authorized_operations=false include_topic_authorized_operations=false}
TRACE 2023-06-14 21:52:17,038 [shard 1] kafka - request_context.h:189 - [172.16.16.6:36266] sending 3:metadata for {KgoVerifierSeqConsumer-0-139778511994016}, response {throttle_time_ms=0 brokers={{node_id=1 host=docker-rp-10 port=9092 rack={nullopt}}, {node_id=3 host=docker-rp-14 port=9092 rack={nullopt}}, {node_id=2 host=docker-rp-19 port=9092 rack={nullopt}}} cluster_id={redpanda.c8071057-d122-424f-a3a1-81728ce51bb3} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=1 leader_epoch=1 replica_nodes={{1}, {2}, {3}} isr_nodes={{1}, {2}, {3}} offline_replicas={}},] topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}
TRACE 2023-06-14 21:52:17,044 [shard 0] kafka - requests.cc:94 - [172.16.16.6:36274] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 8128, ctx_size: 13
TRACE 2023-06-14 21:52:17,044 [shard 0] kafka - request_context.h:189 - [172.16.16.6:36274] sending 18:api_versions for {KgoVerifierSeqConsumer-0-139778511994016}, response {error_code={ error_code: none [0] } api_keys={{api_key=0 min_version=0 max_version=7}, {api_key=1 min_version=4 max_version=11}, {api_key=2 min_version=0 max_version=4}, {api_key=3 min_version=0 max_version=7}, {api_key=9 min_version=1 max_version=7}, {api_key=47 min_version=0 max_version=0}, {api_key=10 min_version=0 max_version=3}, {api_key=16 min_version=0 max_version=3}, {api_key=18 min_version=0 max_version=3}, {api_key=11 min_version=0 max_version=6}, {api_key=12 min_version=0 max_version=4}, {api_key=13 min_version=0 max_version=4}, {api_key=14 min_version=0 max_version=4}, {api_key=19 min_version=0 max_version=6}, {api_key=8 min_version=1 max_version=8}, {api_key=32 min_version=0 max_version=4}, {api_key=33 min_version=0 max_version=2}, {api_key=20 min_version=0 max_version=5}, {api_key=15 min_version=0 max_version=5}, {api_key=17 min_version=0 max_version=1}, {api_key=36 min_version=0 max_version=2}, {api_key=44 min_version=0 max_version=1}, {api_key=42 min_version=0 max_version=2}, {api_key=29 min_version=0 max_version=2}, {api_key=35 min_version=0 max_version=2}, {api_key=30 min_version=0 max_version=2}, {api_key=31 min_version=0 max_version=2}, {api_key=22 min_version=0 max_version=3}, {api_key=24 min_version=0 max_version=3}, {api_key=28 min_version=0 max_version=3}, {api_key=25 min_version=0 max_version=1}, {api_key=26 min_version=0 max_version=3}, {api_key=37 min_version=0 max_version=3}, {api_key=23 min_version=0 max_version=4}, {api_key=45 min_version=0 max_version=0}, {api_key=46 min_version=0 max_version=0}, {api_key=61 min_version=0 max_version=0}, {api_key=65 min_version=0 max_version=0}, {api_key=66 min_version=0 max_version=0}} throttle_time_ms=0}
TRACE 2023-06-14 21:52:17,049 [shard 0] kafka - requests.cc:94 - [172.16.16.6:36274] processing name:list_offsets, key:2, version:4 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 8184, ctx_size: 42
TRACE 2023-06-14 21:52:17,049 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-139778511994016}] handling list_offsets v4 request {replica_id=-1 isolation_level=0 topics={{name={panda-topic} partitions={{partition_index=0 current_leader_epoch=-1 timestamp={timestamp: missing} max_num_offsets=0}}}}}
TRACE 2023-06-14 21:52:17,062 [shard 0] kafka - request_context.h:189 - [172.16.16.6:36274] sending 2:list_offsets for {KgoVerifierSeqConsumer-0-139778511994016}, response {throttle_time_ms=0 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: none [0] } old_style_offsets={} timestamp={timestamp: missing} offset=50000 leader_epoch=1}}}}}
TRACE 2023-06-14 21:52:17,070 [shard 1] kafka - requests.cc:94 - [172.16.16.6:36282] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 8128, ctx_size: 13
TRACE 2023-06-14 21:52:17,070 [shard 1] kafka - request_context.h:189 - [172.16.16.6:36282] sending 18:api_versions for {KgoVerifierSeqConsumer-0-139778511994016}, response {error_code={ error_code: none [0] } api_keys={{api_key=0 min_version=0 max_version=7}, {api_key=1 min_version=4 max_version=11}, {api_key=2 min_version=0 max_version=4}, {api_key=3 min_version=0 max_version=7}, {api_key=9 min_version=1 max_version=7}, {api_key=47 min_version=0 max_version=0}, {api_key=10 min_version=0 max_version=3}, {api_key=16 min_version=0 max_version=3}, {api_key=18 min_version=0 max_version=3}, {api_key=11 min_version=0 max_version=6}, {api_key=12 min_version=0 max_version=4}, {api_key=13 min_version=0 max_version=4}, {api_key=14 min_version=0 max_version=4}, {api_key=19 min_version=0 max_version=6}, {api_key=8 min_version=1 max_version=8}, {api_key=32 min_version=0 max_version=4}, {api_key=33 min_version=0 max_version=2}, {api_key=20 min_version=0 max_version=5}, {api_key=15 min_version=0 max_version=5}, {api_key=17 min_version=0 max_version=1}, {api_key=36 min_version=0 max_version=2}, {api_key=44 min_version=0 max_version=1}, {api_key=42 min_version=0 max_version=2}, {api_key=29 min_version=0 max_version=2}, {api_key=35 min_version=0 max_version=2}, {api_key=30 min_version=0 max_version=2}, {api_key=31 min_version=0 max_version=2}, {api_key=22 min_version=0 max_version=3}, {api_key=24 min_version=0 max_version=3}, {api_key=28 min_version=0 max_version=3}, {api_key=25 min_version=0 max_version=1}, {api_key=26 min_version=0 max_version=3}, {api_key=37 min_version=0 max_version=3}, {api_key=23 min_version=0 max_version=4}, {api_key=45 min_version=0 max_version=0}, {api_key=46 min_version=0 max_version=0}, {api_key=61 min_version=0 max_version=0}, {api_key=65 min_version=0 max_version=0}, {api_key=66 min_version=0 max_version=0}} throttle_time_ms=0}
TRACE 2023-06-14 21:52:17,074 [shard 1] kafka - requests.cc:94 - [172.16.16.6:36282] processing name:metadata, key:3, version:7 for KgoVerifierSeqConsumer-0-139778511994016, mem_units: 30822, ctx_size: 18
TRACE 2023-06-14 21:52:17,076 [shard 1] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-139778511994016}] handling metadata v7 request {topics={{{name={panda-topic}}}} allow_auto_topic_creation=false include_cluster_authorized_operations=false include_topic_authorized_operations=false}
TRACE 2023-06-14 21:52:17,076 [shard 1] kafka - request_context.h:189 - [172.16.16.6:36282] sending 3:metadata for {KgoVerifierSeqConsumer-0-139778511994016}, response {throttle_time_ms=0 brokers={{node_id=1 host=docker-rp-10 port=9092 rack={nullopt}}, {node_id=3 host=docker-rp-14 port=9092 rack={nullopt}}, {node_id=2 host=docker-rp-19 port=9092 rack={nullopt}}} cluster_id={redpanda.c8071057-d122-424f-a3a1-81728ce51bb3} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=1 leader_epoch=1 replica_nodes={{1}, {2}, {3}} isr_nodes={{1}, {2}, {3}} offline_replicas={}},] topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}

@abhijat
Copy link
Contributor

abhijat commented Jun 16, 2023

Compared with a passing run of the test where the fetch requests are sent to broker and processed and responded to:

TRACE 2023-06-16 07:59:50,984 [shard 0] kafka - requests.cc:94 - [172.18.0.5:35540] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140352342039840, mem_units: 8260, ctx_size: 80
TRACE 2023-06-16 07:59:50,984 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-140352342039840}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=0 session_epoch=0 topics={{name={panda-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id={}}
TRACE 2023-06-16 07:59:51,043 [shard 0] kafka - request_context.h:189 - [172.18.0.5:35540] sending 1:fetch for {KgoVerifierSeqConsumer-0-140352342039840}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=50000 last_stable_offset=50000 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 1048165}}}}}}}
TRACE 2023-06-16 07:59:51,046 [shard 0] kafka - requests.cc:94 - [172.18.0.5:35540] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140352342039840, mem_units: 8260, ctx_size: 80
TRACE 2023-06-16 07:59:51,046 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-140352342039840}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=1 session_epoch=1 topics={{name={panda-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=946 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id={}}
TRACE 2023-06-16 07:59:51,047 [shard 0] kafka - request_context.h:189 - [172.18.0.5:35540] sending 1:fetch for {KgoVerifierSeqConsumer-0-140352342039840}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=50000 last_stable_offset=50000 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 168583}}}}}}}
TRACE 2023-06-16 07:59:51,048 [shard 0] kafka - requests.cc:94 - [172.18.0.5:35540] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140352342039840, mem_units: 8260, ctx_size: 80
TRACE 2023-06-16 07:59:51,048 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-140352342039840}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=1 session_epoch=2 topics={{name={panda-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=1098 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id={}}
TRACE 2023-06-16 07:59:51,049 [shard 0] kafka - request_context.h:189 - [172.18.0.5:35540] sending 1:fetch for {KgoVerifierSeqConsumer-0-140352342039840}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=50000 last_stable_offset=50000 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 1048165}}}}}}}
TRACE 2023-06-16 07:59:51,050 [shard 0] kafka - requests.cc:94 - [172.18.0.5:35540] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140352342039840, mem_units: 8260, ctx_size: 80
TRACE 2023-06-16 07:59:51,050 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-140352342039840}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=1 session_epoch=3 topics={{name={panda-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=2044 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id={}}
TRACE 2023-06-16 07:59:51,051 [shard 0] kafka - request_context.h:189 - [172.18.0.5:35540] sending 1:fetch for {KgoVerifierSeqConsumer-0-140352342039840}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=50000 last_stable_offset=50000 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 86590}}}}}}}
TRACE 2023-06-16 07:59:51,051 [shard 0] kafka - requests.cc:94 - [172.18.0.5:35540] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140352342039840, mem_units: 8260, ctx_size: 80
TRACE 2023-06-16 07:59:51,051 [shard 0] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-140352342039840}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=1 session_epoch=4 topics={{name={panda-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=2122 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id={}}
TRACE 2023-06-16 07:59:51,052 [shard 0] kafka - request_context.h:189 - [172.18.0.5:35540] sending 1:fetch for {KgoVerifierSeqConsumer-0-140352342039840}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=50000 last_stable_offset=50000 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 1048165}}}}}}}

@travisdowns
Copy link
Member

@VladLazar
Copy link
Contributor

FAIL test: CloudStorageChunkReadTest.test_read_when_cache_smaller_than_segment_size (12/52 runs)
  failure at 2023-06-28T08:26:12.886Z: TimeoutError(None)
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/32109#018900b1-a2a0-4f8a-b5d8-68cf8d985c2c
  failure at 2023-06-28T07:40:30.513Z: TimeoutError(None)
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/32106#018900a9-375b-4412-abc2-e3bb9302405e
  failure at 2023-06-28T00:15:11.882Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32093#0188ff20-b0a6-4ce0-9dc5-feedab13e733
  failure at 2023-06-27T19:43:20.178Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32077#0188fe20-ba96-4b9f-9d49-405d0ffce4eb
  failure at 2023-06-27T19:14:31.630Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32076#0188fe06-b403-46d3-8039-128f0f0ab626
  failure at 2023-06-27T19:24:43.524Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32080#0188fe1c-ec9f-4832-a4a9-fca94c088661
  failure at 2023-06-28T00:00:33.911Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32093#0188ff13-47e1-4cf8-99f8-8cb801059695
  failure at 2023-06-28T07:08:43.605Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32109#0189009d-1aff-470e-b3a2-452a7eede513
  failure at 2023-06-28T06:59:48.961Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32106#01890094-13de-4b77-a73f-899ed785380f
  failure at 2023-06-28T07:07:06.517Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32103#0189009a-ac00-46f4-8b9d-de77aaf5f42d
  failure at 2023-06-28T07:42:50.537Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32113#018900ba-c7f7-464e-80fd-0da24a893f99
  failure at 2023-06-28T06:53:34.855Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/32103#0189008e-32a0-4d2a-ab80-859291ddba6a

@dotnwat
Copy link
Member

dotnwat commented Jun 28, 2023

@abhijat
Copy link
Contributor

abhijat commented Jun 29, 2023

There seem to be a couple of reasons for the recent increased frequency of this test failing:

[INFO  - 2023-06-28 06:12:01,655 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-140346127718896 status: ConsumerStatus<True, 0, ValidatorStatus<39528 0 0>>

[INFO  - 2023-06-28 06:45:12,517 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-139709294245440 status: ConsumerStatus<True, 0, ValidatorStatus<43208 0 0>>

[INFO  - 2023-06-28 06:37:13,423 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-139967800634736 status: ConsumerStatus<True, 0, ValidatorStatus<38990 0 0>>

[INFO  - 2023-06-27 18:43:13,479 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-140549645997184 status: ConsumerStatus<True, 0, ValidatorStatus<39001 0 0>>

[INFO  - 2023-06-27 18:30:01,346 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-139989579734208 status: ConsumerStatus<True, 0, ValidatorStatus<38976 0 0>>

[INFO  - 2023-06-27 18:58:37,171 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-139928648099024 status: ConsumerStatus<True, 0, ValidatorStatus<39005 0 0>>

[INFO  - 2023-06-27 23:42:47,401 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-140626917587312 status: ConsumerStatus<False, 0, ValidatorStatus<50000 0 0>>

[INFO  - 2023-06-28 07:12:42,253 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-281472822534800 status: ConsumerStatus<True, 0, ValidatorStatus<40578 0 0>>

[INFO  - 2023-06-28 07:57:51,089 - kgo_verifier_services - _ingest_status - lineno:327]: Worker KgoVerifierSeqConsumer-0-281473762691920 status: ConsumerStatus<True, 0, ValidatorStatus<42247 0 0>>

With the change in coarse index entry calculation #11705 we are creating chunks much closer to the requested size of 1MB:

TRACE 2023-06-28 07:54:50,504 [shard 1] cloud_storage - remote_segment_index.cc:181 - building coarse index from file offset index with 0 rows
TRACE 2023-06-28 07:54:50,504 [shard 1] cloud_storage - remote_segment_index.cc:219 - adding entry to coarse index, current file pos: 1134863, step size: 1048576, span size: 1134864
TRACE 2023-06-28 07:54:50,504 [shard 1] cloud_storage - remote_segment_index.cc:219 - adding entry to coarse index, current file pos: 2267404, step size: 1048576, span size: 1132541
TRACE 2023-06-28 07:54:50,504 [shard 1] cloud_storage - remote_segment_index.cc:219 - adding entry to coarse index, current file pos: 3472011, step size: 1048576, span size: 1204607
TRACE 2023-06-28 07:54:50,504 [shard 1] cloud_storage - remote_segment_index.cc:219 - adding entry to coarse index, current file pos: 4623343, step size: 1048576, span size: 1151332

Before the above change, an offset index with 0 rows (see offset index with 0 rows above) would result in the entire segment being treated as one chunk, because with 0 rows we could not build a coarse index. With the change we also check the arrays in offset index and are able to create 1MiB sized chunks as seen above.

An entire segment being treated as a chunk results in the consumer being quite fast because it just has to download the segment once, overshooting the cache size (for this test the cache is 1MiB and segment is 5MiB) but resulting in very few calls to cloud storage.

With the new calculation we are actually downloading 1MiB chunks, thus resulting in more calls to cloud storage and having to wait after each chunk is finished to get the next one.

Additionally with a 1MiB chunk and only 4MiB cache, we now also see frequent throttling of cache trimming:

[abhijat@fedora logs]$ grep throttled redpanda\ \(1\).log
INFO  2023-06-28 07:54:50,547 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4990ms
INFO  2023-06-28 07:54:55,553 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4985ms
INFO  2023-06-28 07:55:00,550 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:55:05,552 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:55:10,553 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:55:15,555 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:55:20,557 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:55:25,559 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4987ms
INFO  2023-06-28 07:55:30,559 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4990ms
INFO  2023-06-28 07:55:35,561 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:55:40,561 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4991ms
INFO  2023-06-28 07:55:45,566 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:55:50,568 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:55:55,570 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:56:00,573 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:56:05,576 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:56:10,579 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4987ms
INFO  2023-06-28 07:56:15,580 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:56:20,582 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4990ms
INFO  2023-06-28 07:56:25,585 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:56:30,586 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4990ms
INFO  2023-06-28 07:56:35,592 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4986ms
INFO  2023-06-28 07:56:40,592 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:56:45,596 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:56:50,597 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4990ms
INFO  2023-06-28 07:56:55,600 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:57:00,605 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:57:05,620 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4975ms
INFO  2023-06-28 07:57:10,610 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:57:15,623 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4978ms
INFO  2023-06-28 07:57:20,614 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4991ms
INFO  2023-06-28 07:57:25,621 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4987ms
INFO  2023-06-28 07:57:30,623 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4989ms
INFO  2023-06-28 07:57:35,635 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4980ms
INFO  2023-06-28 07:57:40,631 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:57:45,635 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4988ms
INFO  2023-06-28 07:57:50,640 [shard 0] cloud_storage - cache_service.cc:227 - Cache trimming throttled, waiting 4987ms

These 5 second waits slow down the consumer further. A fix here would be to change the assertions so that we ensure that we are able to read data from cloud storage, but not necessarily to the finish. Since the purpose of the test is to ensure that the read path can work in the extreme case where cache is smaller than segment, being able to read from the cache should be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/storage ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants