Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (timeout on start) in PartitionMoveInterruption.test_cancellations_interrupted_with_restarts #10497

Closed
VladLazar opened this issue May 2, 2023 · 17 comments · Fixed by #11382 or #11519
Assignees
Labels
area/redpanda ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@VladLazar
Copy link
Contributor

VladLazar commented May 2, 2023

https://buildkite.com/redpanda/redpanda/builds/28393#0187dc0c-8f3d-4563-941e-187afc6df1b0

Module: rptest.tests.partition_move_interruption_test
Class:  PartitionMoveInterruption
Method: test_cancellations_interrupted_with_restarts
Arguments:
{
  "replication_factor": 1
}
test_id:    rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancellations_interrupted_with_restarts.replication_factor=1
status:     FAIL
run time:   16 minutes 4.721 seconds


TimeoutError('Redpanda service docker-rp-12 failed to start within 20 sec')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 671, in test_cancellations_interrupted_with_restarts
    self.redpanda.start_node(n)
  File "/root/tests/rptest/services/redpanda.py", line 1784, in start_node
    self.start_service(node, start_rp)
  File "/root/tests/rptest/services/redpanda.py", line 1855, in start_service
    start()
  File "/root/tests/rptest/services/redpanda.py", line 1775, in start_rp
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda service docker-rp-12 failed to start within 20 sec
@VladLazar
Copy link
Contributor Author

I'm not too sure what happened here. docker-rp-12 seems to have had a slow start.

[DEBUG - 2023-05-02 11:27:39,402 - remoteaccount - _log - lineno:166]: root@docker-rp-12: Running ssh command: ulimit -Sc unlimited; ... // start command
[...]
[DEBUG - 2023-05-02 11:27:58,502 - redpanda - __is_status_ready - lineno:1710]: node docker-rp-12 not yet accepting connections
[WARNING - 2023-05-02 11:27:59,504 - redpanda - start_service - lineno:1859]: Failed to start on docker-rp-12, gathering node ps and netstat...

From the logs of docker-rp-12, the HTTP server started around the time of the test timing out, but there's no requests for /v1/status/ready` in the log:

INFO  2023-05-02 11:27:59,109 [shard 0] admin_api_server - admin_server.cc:242 - Started HTTP admin service listening at {{:{host: 127.0.0.1, port: 9644}}, {:{host: docker-rp-12, port: 9644}}, {iplistener:{host: 172.16.16.23, port: 9647}}}

Marking this as sev/low as it looks like a harmless race.

@VladLazar VladLazar added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label May 2, 2023
@michael-redpanda
Copy link
Contributor

@abhijat
Copy link
Contributor

abhijat commented May 22, 2023

@VladLazar
Copy link
Contributor Author

FAIL test: PartitionMoveInterruption.test_cancellations_interrupted_with_restarts.replication_factor=1 (5/67 runs)
  failure at 2023-05-22T14:36:59.342Z: TimeoutError('Redpanda service docker-rp-2 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29583#0188439d-c75b-455d-9c3b-a1208e189e65
  failure at 2023-05-22T11:46:58.768Z: TimeoutError('Redpanda service docker-rp-2 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29568#01884313-c688-4ea5-ab03-d05ad7afd33e
  failure at 2023-05-21T06:30:41.249Z: TimeoutError('Redpanda service docker-rp-13 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29538#01883ccc-16ac-43f9-a776-ef4ea1905f7d
  failure at 2023-05-20T07:13:48.587Z: TimeoutError('Redpanda service docker-rp-12 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29525#018837ba-2409-49dd-95ef-8aa136464689
  failure at 2023-05-20T08:00:19.452Z: TimeoutError('Redpanda service docker-rp-13 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29533#018837dc-9692-49d1-aabb-ba8defaba029

@michael-redpanda
Copy link
Contributor

@jcsp
Copy link
Contributor

jcsp commented May 23, 2023

Failures linked here are all debug builds. The one I saw on #10287 (https://buildkite.com/redpanda/redpanda/builds/29681#01884882-9a65-4fd9-a8eb-976bddadcf9e) is missing timeout deadline by <1000ms.

Several other tests recently became unstable in debug mode. I suspect there is some rogue test making the rest run slowly, compounded by our change to debug build to use shared libraries which makes everything run a bit slower.

@NyaliaLui
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

FAIL test: PartitionMoveInterruption.test_cancellations_interrupted_with_restarts.replication_factor=1 (3/31 runs)
  failure at 2023-05-25T23:26:01.234Z: TimeoutError('Redpanda service docker-rp-13 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29928#01885502-1e1b-4cd6-8d2b-654d2ae0ac48
FAIL test: PartitionMoveInterruption.test_cancellations_interrupted_with_restarts.replication_factor=3 (1/31 runs)
  failure at 2023-05-25T23:28:19.897Z: TimeoutError('Redpanda service docker-rp-3 failed to start within 20 sec')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29928#01885502-1e1d-4952-af58-5440720806b5

@abhijat
Copy link
Contributor

abhijat commented May 30, 2023

https://buildkite.com/redpanda/redpanda/builds/30108#018868e9-8cf5-4149-a046-793962299e58

[INFO  - 2023-05-29 19:46:44,176 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancellations_interrupted_with_restarts.replication_factor=1: Summary: TimeoutError('Redpanda service docker-rp-3 failed to start within 20 sec')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 670, in test_cancellations_interrupted_with_restarts
    self.redpanda.start_node(n)
  File "/root/tests/rptest/services/redpanda.py", line 2144, in start_node
    self.start_service(node, start_rp)
  File "/root/tests/rptest/services/redpanda.py", line 2216, in start_service
    start()
  File "/root/tests/rptest/services/redpanda.py", line 2135, in start_rp
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda service docker-rp-3 failed to start within 20 sec

@michael-redpanda
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented Jun 15, 2023

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 18, 2023
Disabled `test_cancellations_interrupted_with_restarts` test running in
debug mode.

Fixes: redpanda-data#10497

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 18, 2023
Disabled `test_cancellations_interrupted_with_restarts` test running in
debug mode.

Fixes: redpanda-data#10497

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 18, 2023
Disabled `test_cancellations_interrupted_with_restarts` test running in
debug mode.

Fixes: redpanda-data#10497

Signed-off-by: Michal Maslanka <michal@redpanda.com>
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jun 18, 2023
Disabled `test_cancellations_interrupted_with_restarts` test running in
debug mode.

Fixes: redpanda-data#10497

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 010888d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
9 participants