Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

Open
jzkiss opened this issue Jul 5, 2024 · 1 comment
Open

Comments

@jzkiss
Copy link

jzkiss commented Jul 5, 2024

Describe the bug
AsyncRedisCluster reset causes coredump if one of the redis master was killed before.

To Reproduce
[1.] asynch client is defined / used in the following way:

::std::shared_ptr<::sw::redis::AsyncRedisCluster> m_redis_cluster;
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));

[2.] Continuous traffic is generated

[3.] One redis master exits (kill -9 redis-server-pid or execute kubernetes rolling upgrade for the redis pods)

[4.] User code of redis-plus-plus detects that for 4 seconds there is no response for those requests that are directed to the unreachable redis (based on hash slot)

[5.] User code of redis-plus-plus initiates AsyncRedisCluster reset with ip-address / port of a reachable redis master
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));

[6.] after a ~0.6 sec (restart: 14:59:39.710346104Z core dump: 14:59:40.398587602Z) core dump is detected:

[New LWP 1407]
[New LWP 1486]
[New LWP 1484]
[New LWP 1483]
[New LWP 1485]
[New LWP 1405]
[New LWP 1404]
[New LWP 1400]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `'.
Program terminated with signal SIGABRT, Aborted.
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x122c1700 (LWP 1407))]
...
(gdb) bt full
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000000095f8ea5 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000007e7d96a in uv_async_send.cold () from /lib64/libuv.so.1
No symbol table info available.
#3 0x0000000007c31856 in sw::redis::AsyncConnection::send (this=0xe82ee20, event=std::unique_ptrsw::redis::AsyncEvent = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:251
No locals.
#4 0x0000000007c42d96 in sw::redis::AsyncShardsPool::_redeliver_events (this=0xfdbbf90,
events=std::queue wrapping: std::deque with 6 elements = {...}) at /usr/include/c++/8/bits/move.h:74
async_event =
pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990}
connection = {_pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990},
_connection = std::shared_ptrsw::redis::AsyncConnection (use count 3, weak count 1) = {get() = 0xe82ee20}}
event =
should_stop_worker = false
#5 0x0000000007c44530 in sw::redis::AsyncShardsPool::_run (this=0xfdbbf90)
at /.../redis++/rpm/BUILD/src/sw/redis++/async_shards_pool.cpp:191
events = std::queue wrapping: std::deque with 6 elements = {{key = "USER_KEY_297",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0x0}}, {key = "USER_KEY_302",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8b2d30}}, {
key = "USER_KEY_303", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe8a38a0}}, {key = "USER_KEY_522",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8d7a60}}, {
key = "USER_KEY_306", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe6d7e20}}, {key = "", event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe6c8dc0}}}
#6 0x0000000008d6ab23 in execute_native_thread_routine () from /lib64/libstdc++.so.6
No symbol table info available.
#7 0x00000000083591ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8 0x0000000009610e73 in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) Quit

USER_KEY_297, ..., USER_KEY_306 are anonymized keys, but all of them belongs to the slot range of the killed redis master

Expected behavior
No crash, traffic should be stabilized.

Environment:
OS: Rocky Linux 8.2-20.el8.0.1
Compiler: gcc version 8.5.0
hiredis version: hiredis 1.2.0
redis-plus-plus version: 1.3.12

Additional context
Redis cluster is used with 3 masters and 3 slaves.

@jzkiss
Copy link
Author

jzkiss commented Dec 5, 2024

Hello,

Update:
The problem is solved by patching redis-plus-plus 1.3.12 in the following way (certainly test logs was just added for debugging):

void EventLoop::_notify() {
// assert(_event_async);
std::cout << "ctrlog redispp 1.3.12: In EventLoop::_notify()\n";
if (!_event_async) {
std::cout << "ctrlog redispp 1.3.12: In EventLoop::_notify():NULL _event_async, ASSERT WOULD CRASH THE APPLICATION!!!\n";
return;
}

  uv_async_send(_event_async.get());

}

Unfortunately this is not enough for redis-plus-plus 1.3.13, because sometimes this line also met:
ctrlog /redis-plus-plus-1.3.13/src/sw/redis++/async_connection.cpp:608: sw::redis::GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr&): Assertion `!_connection->broken()' failed.

In my case this is line 608:

GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr &pool) :
_pool(pool), _connection(_pool->fetch()) {
std::cout << "ctrlog redispp 1.3.13: In GuardedAsyncConnection::GuardedAsyncConnection() entered\n";
608: assert(!_connection->broken());
}

Br, Jozsef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@jzkiss and others