Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix START/STOP SLAVE deadlock caused by slave stats daemon #1377

Open
wants to merge 1 commit into
base: fb-mysql-8.0.32
Choose a base branch
from

Conversation

laurynas-biveinis
Copy link
Contributor

@laurynas-biveinis laurynas-biveinis commented Oct 19, 2023

Fix START/STOP SLAVE deadlock caused by slave stats daemon

Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the
following deadlock is possible:

  • Thread 44 is executing STOP SLAVE, has read-locked the channel map, and is waiting for the slave I/O thread to quit.
  • Thread 55 is executing START SLAVE, and is blocked waiting for the channel map write lock.
  • Thread 54 is the SLAVE I/O thread, and is trying to read-lock the channel map lock.

The request of T54 is compatible with the current lock state, but according to POSIX, once a write request is pending, it is up to the implementation whether to satisfy them or block.

For the fix, observe that the starting replica I/O thread only tries to signal the stats thread to start, thus move this code to the START REPLICA command-executing thread instead, which already happens to hold the channel map lock. This also forces to move the stopping of the stats thread from the replica I/O thread to the STOP REPLICA command-executing thread.

This fixes intermittent but often-seen failures on rpl.rpl_multi_source_channel_map_stress under macOS.

Squash with b015dd3

Stacktraces:

Thread 44:

    frame #5: 0x0000000106483108 mysqld`inline_mysql_cond_timedwait(that=0x000000014585d438, mutex=0x000000014585d2f0, abstime=0x000000016db61588, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_replica.cc", src_line=2367) at mysql_cond.h:224:16
    frame #6: 0x0000000106460818 mysqld`terminate_slave_thread(thd=0x00000001458fb800, term_lock=0x000000014585d2f0, term_cond=0x000000014585d438, slave_running=0x000000014585d4f4, stop_wait_timeout=0x000000016db61700, need_lock_term=false, force=false) at rpl_replica.cc:2367:9
    frame #7: 0x0000000106460188 mysqld`terminate_slave_threads(mi=0x000000014585ce00, thread_mask=1, stop_wait_timeout=31536000, need_lock_term=false) at rpl_replica.cc:2204:18
    frame #8: 0x000000010645873c mysqld`stop_slave(thd=0x0000000145952800, mi=0x000000014585ce00, net_report=true, for_one_channel=true, push_temp_tables_warning=0x000000016db61a37, invoked_by_raft=false) at rpl_replica.cc:10032:9
    frame #9: 0x00000001064593c4 mysqld`stop_slave_cmd(thd=0x0000000145952800) at rpl_replica.cc:971:13

Thread 55:

    frame #2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame #3: 0x0000000104c3d2ac mysqld`native_rw_wrlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:101:10
    frame #4: 0x0000000104c3d21c mysqld`inline_mysql_rwlock_wrlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=473) at mysql_rwlock.h:410:12
    frame #5: 0x0000000104c3c6f0 mysqld`Checkable_rwlock::wrlock(this=0x00006000039dc0e0) at rpl_gtid.h:473:5
    frame #6: 0x00000001054dc6e0 mysqld`Multisource_info::wrlock(this=0x000000010a182980) at rpl_msr.h:441:25
    frame #7: 0x0000000106458bb0 mysqld`start_slave_cmd(thd=0x0000000130008a00) at rpl_replica.cc:832:15

Thread 54:

    frame #2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame #3: 0x0000000104b62d90 mysqld`native_rw_rdlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:82:10
    frame #4: 0x0000000104b62d00 mysqld`inline_mysql_rwlock_rdlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=464) at mysql_rwlock.h:340:12
    frame #5: 0x0000000104b62bcc mysqld`Checkable_rwlock::rdlock(this=0x00006000039dc0e0) at rpl_gtid.h:464:5
    frame #6: 0x0000000104b61cec mysqld`Multisource_info::rdlock(this=0x000000010a182980) at rpl_msr.h:415:25
    frame #7: 0x0000000104b618f0 mysqld`start_handle_slave_stats_daemon() at slave_stats_daemon.cc:233:15
    frame #8: 0x00000001064627dc mysqld`handle_slave_io(arg=0x000000014585ce00) at rpl_replica.cc:6050:36

@laurynas-biveinis
Copy link
Contributor Author

@hermanlee , @luqun , I am PR'ing a possible fix but note that I'm not a replication stats thread expert nor do I see any specific MTR tests I could run for it.

@laurynas-biveinis laurynas-biveinis force-pushed the start-stop-replica-deadlockj branch 2 times, most recently from fcbd7f7 to d07cb23 Compare October 24, 2023 08:16
@@ -230,18 +230,16 @@ static void *handle_slave_stats_daemon(void *arg MY_ATTRIBUTE((unused))) {
bool start_handle_slave_stats_daemon() {
DBUG_ENTER("start_handle_slave_stats_daemon");

channel_map.rdlock();
channel_map.assert_some_lock();
Copy link
Contributor

@george-reynya george-reynya Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working when start_handle_slave_stats_daemon() is called from register_raft_followers()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working when start_handle_slave_stats_daemon() is called from register_raft_followers()?

@george-reynya , is there any way I can check this with the open source build?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is the problem that you don't have the raft plugin? Let me check what the options are.

Copy link
Contributor

@george-reynya george-reynya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay. Left a question about having the channel map lock in the raft case.

Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the
following deadlock is possible:

- Thread 44 is executing STOP SLAVE, has read-locked the channel map, and is
  waiting for the slave I/O thread to quit.
- Thread 55 is executing START SLAVE, and is blocked waiting for the channel map
  write lock.
- Thread 54 is the SLAVE I/O thread, and is trying to read-lock the channel map
  lock.

The request of T54 is compatible with the current lock state, but according to
POSIX, once a write request is pending, it is up to the implementation whether
to satisfy them or block.

For the fix, observe that the starting replica I/O thread only tries to signal
the stats thread to start, thus move this code to the START REPLICA
command-executing thread instead, which already happens to hold the channel map
lock. This also forces to move the stopping of the stats thread from the replica
I/O thread to the STOP REPLICA command-executing thread.

This fixes intermittent but often-seen failures on
rpl.rpl_multi_source_channel_map_stress under macOS.

Squash with b015dd3

Stacktraces:

Thread 44:

    frame facebook#5: 0x0000000106483108 mysqld`inline_mysql_cond_timedwait(that=0x000000014585d438, mutex=0x000000014585d2f0, abstime=0x000000016db61588, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_replica.cc", src_line=2367) at mysql_cond.h:224:16
    frame facebook#6: 0x0000000106460818 mysqld`terminate_slave_thread(thd=0x00000001458fb800, term_lock=0x000000014585d2f0, term_cond=0x000000014585d438, slave_running=0x000000014585d4f4, stop_wait_timeout=0x000000016db61700, need_lock_term=false, force=false) at rpl_replica.cc:2367:9
    frame facebook#7: 0x0000000106460188 mysqld`terminate_slave_threads(mi=0x000000014585ce00, thread_mask=1, stop_wait_timeout=31536000, need_lock_term=false) at rpl_replica.cc:2204:18
    frame facebook#8: 0x000000010645873c mysqld`stop_slave(thd=0x0000000145952800, mi=0x000000014585ce00, net_report=true, for_one_channel=true, push_temp_tables_warning=0x000000016db61a37, invoked_by_raft=false) at rpl_replica.cc:10032:9
    frame facebook#9: 0x00000001064593c4 mysqld`stop_slave_cmd(thd=0x0000000145952800) at rpl_replica.cc:971:13

Thread 55:

    frame facebook#2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame facebook#3: 0x0000000104c3d2ac mysqld`native_rw_wrlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:101:10
    frame facebook#4: 0x0000000104c3d21c mysqld`inline_mysql_rwlock_wrlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=473) at mysql_rwlock.h:410:12
    frame facebook#5: 0x0000000104c3c6f0 mysqld`Checkable_rwlock::wrlock(this=0x00006000039dc0e0) at rpl_gtid.h:473:5
    frame facebook#6: 0x00000001054dc6e0 mysqld`Multisource_info::wrlock(this=0x000000010a182980) at rpl_msr.h:441:25
    frame facebook#7: 0x0000000106458bb0 mysqld`start_slave_cmd(thd=0x0000000130008a00) at rpl_replica.cc:832:15

Thread 54:

    frame facebook#2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame facebook#3: 0x0000000104b62d90 mysqld`native_rw_rdlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:82:10
    frame facebook#4: 0x0000000104b62d00 mysqld`inline_mysql_rwlock_rdlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=464) at mysql_rwlock.h:340:12
    frame facebook#5: 0x0000000104b62bcc mysqld`Checkable_rwlock::rdlock(this=0x00006000039dc0e0) at rpl_gtid.h:464:5
    frame facebook#6: 0x0000000104b61cec mysqld`Multisource_info::rdlock(this=0x000000010a182980) at rpl_msr.h:415:25
    frame facebook#7: 0x0000000104b618f0 mysqld`start_handle_slave_stats_daemon() at slave_stats_daemon.cc:233:15
    frame facebook#8: 0x00000001064627dc mysqld`handle_slave_io(arg=0x000000014585ce00) at rpl_replica.cc:6050:36
@laurynas-biveinis laurynas-biveinis changed the base branch from fb-mysql-8.0.28 to fb-mysql-8.0.32 November 28, 2023 13:18
@laurynas-biveinis
Copy link
Contributor Author

Rebased on 8.0.32. "This branch has conflicts that must be resolved" looks bogus

@facebook-github-bot
Copy link

@george-reynya has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants