Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (assert target_offset <= _insync_offset) in SIPartitionMovementTest.test_cross_shard #11659

Closed
rystsov opened this issue Jun 23, 2023 · 4 comments · Fixed by #12155
Assignees
Labels
area/raft ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@rystsov
Copy link
Contributor

rystsov commented Jun 23, 2023

https://buildkite.com/redpanda/redpanda/builds/31828

Module: rptest.tests.partition_movement_test
Class: SIPartitionMovementTest
Method: test_cross_shard
Arguments: {
    "num_to_upgrade": 0,
    "cloud_storage_type": 2
}
test_id:    SIPartitionMovementTest.test_cross_shard
status:     FAIL
run time:   89.903 seconds

<NodeCrash docker-rp-20: ERROR 2023-06-23 07:16:22,112 [shard 1] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-016a1ffb2b0ea9d7f-1/redpanda/redpanda/src/v/cluster/persisted_stm.cc:330) 'target_offset <= _insync_offset' [{kafka/topic/0} (tx.snapshot)]  after we waited for target_offset (2764) _insync_offset (2604) should have matched it or bypassed
>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 975, in send
    self.connect()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb187bd74f0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-20', port=9644): Max retries exceeded with url: /v1/partitions/kafka/topic/0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb187bd74f0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 1046, in test_cross_shard
    self._wait_post_move(topic, partition, assignments, 360)
  File "/root/tests/rptest/tests/partition_movement.py", line 125, in _wait_post_move
    wait_until(status_done, timeout_sec=timeout_sec, backoff_sec=2)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/tests/partition_movement.py", line 115, in status_done
    info = admin.get_partitions(topic, partition, node=n)
  File "/root/tests/rptest/services/admin.py", line 594, in get_partitions
    return self._request('get', path, node=node).json()
  File "/root/tests/rptest/services/admin.py", line 334, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-20', port=9644): Max retries exceeded with url: /v1/partitions/kafka/topic/0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb187bd74f0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 100, in wrapped
    redpanda.raise_on_crash(log_allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 2429, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash docker-rp-20: ERROR 2023-06-23 07:16:22,112 [shard 1] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-016a1ffb2b0ea9d7f-1/redpanda/redpanda/src/v/cluster/persisted_stm.cc:330) 'target_offset <= _insync_offset' [{kafka/topic/0} (tx.snapshot)]  after we waited for target_offset (2764) _insync_offset (2604) should have matched it or bypassed
>
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jun 23, 2023
@piyushredpanda piyushredpanda added the area/cloud-storage Shadow indexing subsystem label Jun 27, 2023
@andrwng andrwng added sev/high loss of availability, pathological performance degradation, recoverable corruption area/raft and removed area/cloud-storage Shadow indexing subsystem labels Jul 7, 2023
@andrwng
Copy link
Contributor

andrwng commented Jul 8, 2023

Switching labels since the crash is in persisted_stm of rm_stm.

It could be caused by some of the delete records changes. Maybe it's been fixed already (perhaps #11852?)

@mmaslankaprv mmaslankaprv self-assigned this Jul 10, 2023
@mmaslankaprv
Copy link
Member

This seems to be fixed with the recent changes in log_eviction_stm. I've tested this hundred of times here:

@mmaslankaprv
Copy link
Member

i am closing this one

@BenPope
Copy link
Member

BenPope commented Jul 17, 2023

@BenPope BenPope reopened this Jul 17, 2023
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 17, 2023
…fset

Previously the `rm_stm` took snapshot only up to the `log::start_offset`
this is wasteful as the stm may easily snapshot up to the insync offset
which will prevent the stm from reading whole log.

This change in the behavior also fixes an issue where a snapshot might
have been created with offset lower from the offset of a snapshot that
was already applied.

Fixes: redpanda-data#11659

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants