Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (seg fault) in RandomNodeOperationsTest.test_node_operations #13301

Closed
rockwotj opened this issue Sep 7, 2023 · 6 comments
Closed
Assignees
Labels
area/controller ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@rockwotj
Copy link
Contributor

rockwotj commented Sep 7, 2023

https://buildkite.com/redpanda/redpanda/builds/36519#018a6bf2-9147-41f8-8232-4b4eb571d148

Module: rptest.tests.random_node_operations_test
Class:  RandomNodeOperationsTest
Method: test_node_operations
Arguments:
{
  "enable_controller_snapshots": true,
  "enable_failures": true,
  "num_to_upgrade": 0
}
test_id:    rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.enable_controller_snapshots=True
status:     FAIL
run time:   3 minutes 35.365 seconds


    <NodeCrash docker-rp-11: Segmentation fault on shard 0.
>
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/random_node_operations_test.py", line 316, in test_node_operations
    executor.execute_operation(op)
  File "/root/tests/rptest/utils/node_operations.py", line 432, in execute_operation
    self.wait_for_rebalanced(operation.node)
  File "/root/tests/rptest/utils/node_operations.py", line 406, in wait_for_rebalanced
    node_id = self.node_id(idx)
  File "/root/tests/rptest/utils/node_operations.py", line 234, in node_id
    return self.redpanda.node_id(self.redpanda.get_node(idx),
  File "/root/tests/rptest/services/redpanda.py", line 1100, in node_id
    node_id = wait_until_result(
  File "/root/tests/rptest/util.py", line 90, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: couldn't reach admin endpoint for docker-rp-11

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 103, in wrapped
    redpanda.raise_on_crash(log_allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 2505, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash docker-rp-11: Segmentation fault on shard 0.
>

JIRA Link: CORE-1433

@rockwotj rockwotj added kind/bug Something isn't working ci-failure sev/high loss of availability, pathological performance degradation, recoverable corruption labels Sep 7, 2023
@rockwotj
Copy link
Contributor Author

rockwotj commented Sep 7, 2023

Segmentation fault on shard 0.
Backtrace:
  0x6f507a3
  0x6fa4f8b
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42abf
  0x5b9c568
  0x5b95545
  0x22d88ba
  0x6f48dcf
  0x6f4c511
  0x6f496d6
  0x6e49eb0
  0x6e482a8
  0x21b4776
  0x72b1deb
  /opt/redpanda_installs/ci/lib/libc.so.6+0x2d58f
  /opt/redpanda_installs/ci/lib/libc.so.6+0x2d648
  0x21ae424

@rockwotj
Copy link
Contributor Author

rockwotj commented Sep 7, 2023

This failure was found while running CI for #13293, which was just adding a couple of stubs to the admin API and should not be effecting that test...

@mmaslankaprv
Copy link
Member

The failure comes from cluster::partition_balancer_planner

{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6f507a3: void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:64
 (inlined by) seastar::backtrace_buffer::append_backtrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:824
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:854
 (inlined by) seastar::print_with_backtrace(char const*, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:866
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6fa4f8b: seastar::sigsegv_action() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4150
 (inlined by) operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4131
 (inlined by) __invoke at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4127
addr2line: '/opt/redpanda_installs/ci/lib/libc.so.6': No such file
{/opt/redpanda_installs/ci/lib/libc.so.6} 0x42abf: /opt/redpanda_installs/ci/lib/libc.so.6 0x42abf
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x5b9c568: absl::lts_20220623::base_internal::PrefetchT2(void const*) at /vectorized/include/absl/base/internal/prefetch.h:103
 (inlined by) absl::lts_20220623::container_internal::raw_hash_set<absl::lts_20220623::container_internal::FlatHashSetPolicy<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > >::prefetch_heap_block() const at /vectorized/include/absl/container/internal/raw_hash_set.h:2246
 (inlined by) absl::lts_20220623::container_internal::raw_hash_set<absl::lts_20220623::container_internal::FlatHashSetPolicy<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > >::const_iterator absl::lts_20220623::container_internal::raw_hash_set<absl::lts_20220623::container_internal::FlatHashSetPolicy<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > const&) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1766
 (inlined by) bool absl::lts_20220623::container_internal::raw_hash_set<absl::lts_20220623::container_internal::FlatHashSetPolicy<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > >::contains<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > const&) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1772
 (inlined by) operator() at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-09c3d84999b646d24-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:427
 (inlined by) std::__1::iterator_traits<std::__1::__wrap_iter<model::broker_shard const*> >::difference_type std::__1::count_if[abi:v160004]<std::__1::__wrap_iter<model::broker_shard const*>, cluster::has_quorum(absl::lts_20220623::flat_hash_set<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > const&, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > const&)::$_0>(std::__1::__wrap_iter<model::broker_shard const*>, std::__1::__wrap_iter<model::broker_shard const*>, cluster::has_quorum(absl::lts_20220623::flat_hash_set<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > const&, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > const&)::$_0) at /vectorized/llvm/bin/../include/c++/v1/__algorithm/count_if.h:28
 (inlined by) cluster::has_quorum(absl::lts_20220623::flat_hash_set<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, absl::lts_20220623::hash_internal::Hash<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > const&, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > const&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-09c3d84999b646d24-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:423
 (inlined by) auto cluster::partition_balancer_planner::request_context::do_with_partition<seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)> >(model::ntp const&, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > const&, seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)>&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-09c3d84999b646d24-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:744
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x5b95545: cluster::partition_balancer_planner::request_context::for_each_partition_random_order(seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)>) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-09c3d84999b646d24-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:828
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x22d88ba: std::__1::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume[abi:v160004]() const at /vectorized/llvm/bin/../include/c++/v1/__coroutine/coroutine_handle.h:169
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at /vectorized/include/seastar/core/coroutine.hh:125
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6f48dcf: seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2750
 (inlined by) seastar::reactor::run_some_tasks() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3213
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6f4c511: seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3397
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6f496d6: seastar::reactor::run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3265
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6e49eb0: seastar::app_template::run_deprecated(int, char**, std::__1::function<void ()>&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:276
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x6e482a8: seastar::app_template::run(int, char**, std::__1::function<seastar::future<int> ()>&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:167
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x21b4776: application::run(int, char**) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-09c3d84999b646d24-1/redpanda/redpanda/src/v/redpanda/application.cc:348
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x72b1deb: main at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-09c3d84999b646d24-1/redpanda/redpanda/src/v/redpanda/main.cc:22
{/opt/redpanda_installs/ci/lib/libc.so.6} 0x2d58f: /opt/redpanda_installs/ci/lib/libc.so.6 0x2d58f
{/opt/redpanda_installs/ci/lib/libc.so.6} 0x2d648: /opt/redpanda_installs/ci/lib/libc.so.6 0x2d648
{/home/mmaslanka/dev/rp-binary/libexec/redpanda} 0x21ae424: _start at ??:?

@mmaslankaprv
Copy link
Member

This issue was triggered by the fact that node executing rebalancing action was suspended with failure injector. After execution continued the crash occurred:

INFO  2023-09-06 19:46:46,033 [shard 0:main] cluster - partition_balancer_planner.cc:935 - ntp {kafka/topic-tjimmhjrhu/2} (size: 1200, orig replicas: {{node_id: 0, shard: 1}, {node_id: 4, shard: 1}, {node_id: 5, shard: 0}}): scheduling replica move 0 -> 1, reason: partition_count_rebalancing
INFO  2023-09-06 19:46:54,098 [shard 0:main] cluster - partition_balancer_planner.cc:935 - ntp {kafka/topic-cvusynduwo/18} (size: 1077, orig replicas: {{node_id: 0, shard: 0}, {node_id: 4, shard: 1}, {node_id: 5, shard: 0}}): scheduling replica move 0 -> 1, reason: partition_count_rebalancin

@ztlpn
Copy link
Contributor

ztlpn commented May 14, 2024

Likely fixed by #18305

@ztlpn ztlpn closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

No branches or pull requests

3 participants