Potential crash when deleting a replicated backend #6555

progier389 · 2025-01-28T15:12:08Z

Issue Description
When removing a replicated backend (and probably also when disabling replication) a crash may occurs because the connection cleanup code attempt to use a free replica:"

#0  ___pthread_mutex_lock (mutex=mutex@entry=0x7373616c4374636d) at pthread_mutex_lock.c:80
#1  0x00007fba78007796 in PR_EnterMonitor (mon=0x7373616c43746365) at ../../../../nspr/pr/src/pthreads/ptsynch.c:563
#2  0x00007fba73c305c7 in replica_lock (lock=<optimized out>) at ldap/servers/plugins/replication/repl5_replica.c:109
#3  replica_relinquish_exclusive_access (r=0x7fb707089140, connid=0, opid=-1) at ldap/servers/plugins/replication/repl5_replica.c:676
#4  0x00007fba73c16475 in consumer_connection_extension_destructor (ext=<optimized out>, object=<optimized out>, parent=<optimized out>) at ldap/servers/plugins/replication/repl_connext.c:91
#5  0x00007fba789e0ad3 in factory_destroy_extension (type=<optimized out>, object=0x7fb710e01a90, parent=0x0, extension=0x7fb710e01bc8) at ldap/servers/slapd/factory.c:366
#6  factory_destroy_extension (type=<optimized out>, object=0x7fb710e01a90, parent=0x0, extension=0x7fb710e01bc8) at ldap/servers/slapd/factory.c:348
#7  0x0000555bc877d76c in connection_cleanup (conn=0x7fb710e01a90) at ldap/servers/slapd/connection.c:181
#8  0x0000555bc879117c in connection_table_move_connection_out_of_active_list.isra.0 (ct=ct@entry=0x7fba743977c0, c=c@entry=0x7fb710e01a90) at ldap/servers/slapd/conntable.c:470
#9  0x0000555bc87827db in setup_pr_read_pds (ct=0x7fba743977c0, listnum=<optimized out>) at ldap/servers/slapd/daemon.c:1596
#10 ct_list_thread (threadnum=<optimized out>) at ldap/servers/slapd/daemon.c:1356
#11 0x00007fba7800e3b7 in _pt_root (arg=0x7fba7722bbc0) at ../../../../nspr/pr/src/pthreads/ptthread.c:191
#12 0x00007fba78797057 in start_thread (arg=<optimized out>) at pthread_create.c:448
#13 0x00007fba7881af4c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

This happen in the CI test with the recent new test: test_multi_subsuffix_replication

Package Version and Platform:

Platform: Fedora
Package and version: main
Browser: N/A

Steps to Reproduce
Steps to reproduce the behavior:

Look at CI test for replication_acceptance tests
if some tests (typically test_new_suffix) status are ERROR, download the pytest-replication-acceptance_test.py] artefacts
unzip it and check if there arfe cores in the assets/cores directory

Expected results
No crash should occur

Additional context
This crash is also described in #6531

The text was updated successfully, but these errors were encountered:

@tbordaz

Problem: connection cleanup may try to access a freed replica Solution: Ensure that the replica is still existing before using its pointer Issue: #6555 Reviewed by: @tbordaz, @droideck (Thanks!)

tbordaz · 2025-02-05T15:01:18Z

On main branch the patch triggers (not systematically) an infinite loop at shutdown

Thread 1 (Thread 0x7f20c99b4700 (LWP 3080510) "ns-slapd"):
#0  __pthread_rwlock_rdlock_full64 (rwlock=0x7f20c66a7400, clockid=0, abstime=0x0) at /usr/src/debug/glibc-2.39-33.fc40.x86_64/nptl/pthread_rwlock_common.c:506
#1  ___pthread_rwlock_rdlock (rwlock=0x7f20c66a7400) at pthread_rwlock_rdlock.c:26
#2  0x00007f20cab3d8ee in slapi_rwlock_rdlock (rwlock=) at ldap/servers/slapd/slapi2runtime.c:288
#3  0x00007f20c5fd2634 in replica_check_validity (replica=) at ldap/servers/plugins/replication/repl5_replica_hash.c:213
#4  0x00007f20c5fab461 in consumer_connection_extension_destructor (ext=, object=, parent=) at ldap/servers/plugins/replication/repl_connext.c:64
#5  0x00007f20caad7b33 in factory_destroy_extension (type=, object=0x7f1bbf000b70, parent=0x0, extension=0x7f1bbf000ca8) at ldap/servers/slapd/factory.c:366
#6  factory_destroy_extension (type=, object=0x7f1bbf000b70, parent=0x0, extension=0x7f1bbf000ca8) at ldap/servers/slapd/factory.c:348
#7  0x000056051977671c in connection_cleanup (conn=conn@entry=0x7f1bbf000b70) at ldap/servers/slapd/connection.c:181
#8  0x000056051977ac43 in connection_done (conn=0x7f1bbf000b70) at ldap/servers/slapd/connection.c:148
#9  connection_table_free (ct=) at ldap/servers/slapd/conntable.c:229
#10 slapd_daemon (ports=0x7ffec39290a0) at ldap/servers/slapd/daemon.c:1287
#11 0x000056051976b7c5 in main (argc=5, argv=0x7ffec39294f8) at ldap/servers/slapd/main.c:1152

I suspect that replica_check_validity should not be called at shutdown as s_hash (list of replica) looks somehow broken.
Possibly consumer_connection_extension_destructor should just return if g_get_shutdown() is true

progier389 · 2025-02-05T15:22:46Z

Are you sure it is an infinite loop ?

It rather looks like a locking issue (deadlock or deallocated lock?): we are not yet trying to access the hash table in this stack.
And there is no way consumer_connection_extension_destructor could be called in loop oin the same address (because the address is freed)

Now it is possible that replica_destroy_name_hash has been called before the connection closure.

IMHO replica_destroy_name_hash should set s_hash and s_lock to NULL after releasing them
and replica_check_validity should check that s_lock is not NULL

@tbordaz

…6585) * Issue 6555 - Potential crash when deleting a replicated backend - hang at shutdown Prevent previous 6555 fix to hang at shutdown. Issue: #6555 Reviewed by: @tbordaz

progier389 added the needs triage The issue will be triaged during scrum label Jan 28, 2025

This was referenced Jan 28, 2025

Fix CI Tests - Decrease mdb map size for large topologies #6531

Closed

Issue 6555 - Potential crash when deleting a replicated backend #6559

Merged

progier389 mentioned this issue Feb 5, 2025

Issue 6555 - V2 - Potential crash when deleting a replicated backend #6585

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential crash when deleting a replicated backend #6555

Potential crash when deleting a replicated backend #6555

progier389 commented Jan 28, 2025 •

edited

Loading

tbordaz commented Feb 5, 2025 •

edited

Loading

progier389 commented Feb 5, 2025

Potential crash when deleting a replicated backend #6555

Potential crash when deleting a replicated backend #6555

Comments

progier389 commented Jan 28, 2025 • edited Loading

tbordaz commented Feb 5, 2025 • edited Loading

progier389 commented Feb 5, 2025

progier389 commented Jan 28, 2025 •

edited

Loading

tbordaz commented Feb 5, 2025 •

edited

Loading