Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (key symptom) in PartitionMoveInterruption.test_cancelling_partition_move #20758

Closed
vbotbuildovich opened this issue Jun 29, 2024 · 5 comments · Fixed by #21324
Closed
Assignees
Labels
area/raft auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jun 29, 2024

https://buildkite.com/redpanda/redpanda/builds/50894

Module: rptest.tests.partition_move_interruption_test
Class: PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments: {
    "recovery": "restart_recovery",
    "compacted": false,
    "unclean_abort": true,
    "replication_factor": 1
}
test_id:    PartitionMoveInterruption.test_cancelling_partition_move
status:     FAIL
run time:   16.925 seconds

ConnectionError(MaxRetryError("HTTPConnectionPool(host='docker-rp-12', port=9644): Max retries exceeded with url: /v1/security/users?redirect=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused'))"))
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-12', port=9644): Max retries exceeded with url: /v1/security/users?redirect=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 158, in test_cancelling_partition_move
    self.start_redpanda(num_nodes=5,
  File "/root/tests/rptest/tests/end_to_end.py", line 129, in start_redpanda
    self.redpanda.start(nodes=started_nodes,
  File "/root/tests/rptest/services/redpanda.py", line 2731, in start
    self._admin.create_user(*self._superuser)
  File "/root/tests/rptest/services/admin.py", line 1128, in create_user
    self._request("POST",
  File "/root/tests/rptest/services/admin.py", line 640, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-12', port=9644): Max retries exceeded with url: /v1/security/users?redirect=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4fcd860880>: Failed to establish a new connection: [Errno 111] Connection refused'))
Test requested 7 nodes, used only 5```


JIRA Link: [CORE-5034](https://redpandadata.atlassian.net/browse/CORE-5034)
@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels Jun 29, 2024
@dotnwat
Copy link
Member

dotnwat commented Jul 6, 2024

node crashed

ERROR 2024-06-28 20:41:19,249 [shard 0:admi] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-02363aa4813e8de7f-1/redpanda/redpanda/src/v/raft/mux_state_machine.h:344) 'it != _results.end()' last applied offset 33 is greater than returned replicate result 33. this must imply existence of a result in results map
ERROR 2024-06-28 20:41:19,249 [shard 0:admi] assert - Backtrace below:
0xb1b26 /opt/redpanda_installs/ci/lib/libseastar.so+0x221563b /opt/redpanda_installs/ci/lib/libseastar.so+0x2217128 /opt/redpanda_installs/ci/lib/libseastar.so+0x2217ac0 /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca498f8 /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca4860a /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca48217 /opt/redpanda_installs/ci/lib/libv_v_application.so+0x43574dc /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4357053 /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4356e06 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac2792 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac8d36 /opt/redpanda_installs/ci/lib/libseastar.so+0x1acb8b5 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac9d47 /opt/redpanda_installs/ci/lib/libseastar.so+0x185ca65 /opt/redpanda_installs/ci/lib/libseastar.so+0x185adff /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4a6eb8a 0x143a67 /opt/redpanda_installs/ci/lib/libc.so.6+0x2a087 /opt/redpanda_installs/ci/lib/libc.so.6+0x2a14a 0x6ca84
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::futurize<seastar::future<void>>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>>(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::futurize<seastar::future<void>>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>>(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type

@mmaslankaprv mmaslankaprv self-assigned this Jul 8, 2024
@mmaslankaprv mmaslankaprv added ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue area/raft labels Jul 8, 2024
@mmaslankaprv
Copy link
Member

The assertion was triggered because the same base offset was assigned twice to different batches:

TRACE 2024-06-28 20:41:19,236 [shard 0:admi] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 20, base_offset: 33, last_offset: 33, last_term: 1, byte_size: 297}


TRACE 2024-06-28 20:41:19,242 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 3, base_offset: 33, last_offset: 33, last_term: 1, byte_size: 75}

@mmaslankaprv
Copy link
Member

The problem is caused by a race condition i a storage layer

@abhijat
Copy link
Contributor

abhijat commented Sep 2, 2024

Seen this same assert failure in https://buildkite.com/redpanda/redpanda/builds/53808 several times

docker-rp-14/redpanda.log:TRACE 2024-08-30 10:59:34,874 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 1, base_offset: 24, last_offset: 24, last_term: 1, byte_size: 107}
docker-rp-14/redpanda.log:TRACE 2024-08-30 10:59:42,874 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 0, base_offset: 24, last_offset: 24, last_term: 1, byte_size: 107}

In the above build a couple of tests also fail with timeout during alter topic config, but looking at the logs there the same assert failure has crashed the broker, although the timeout occurred a couple of seconds before the crash, so maybe it is a separate issue (still looking into that)

WARN  2024-08-30 10:59:41,498 [shard 0:main] kafka - config_utils.h:286 - Failed to alter topic properties of topic(s) {{kafka/panda-topic}} error_code observed: cluster::errc::timeout

followed by:

ERROR 2024-08-30 10:59:42,876 [shard 0:main] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c0dcb9e76459e3f4-1/redpanda/redpanda/src/v/raft/mux_state_machine.h:344) 'it != _results.end()' last applied offset 24 is greater than returned replicate result 24. this must imply existence of a result in results map

@piyushredpanda
Copy link
Contributor

Closing older-bot-filed CI issues as we transition to a more reliable system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants