Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential crash when deleting a replicated backend #6555

Open
progier389 opened this issue Jan 28, 2025 · 2 comments
Open

Potential crash when deleting a replicated backend #6555

progier389 opened this issue Jan 28, 2025 · 2 comments
Labels
needs triage The issue will be triaged during scrum

Comments

@progier389
Copy link
Contributor

progier389 commented Jan 28, 2025

Issue Description
When removing a replicated backend (and probably also when disabling replication) a crash may occurs because the connection cleanup code attempt to use a free replica:"

#0  ___pthread_mutex_lock (mutex=mutex@entry=0x7373616c4374636d) at pthread_mutex_lock.c:80
#1  0x00007fba78007796 in PR_EnterMonitor (mon=0x7373616c43746365) at ../../../../nspr/pr/src/pthreads/ptsynch.c:563
#2  0x00007fba73c305c7 in replica_lock (lock=<optimized out>) at ldap/servers/plugins/replication/repl5_replica.c:109
#3  replica_relinquish_exclusive_access (r=0x7fb707089140, connid=0, opid=-1) at ldap/servers/plugins/replication/repl5_replica.c:676
#4  0x00007fba73c16475 in consumer_connection_extension_destructor (ext=<optimized out>, object=<optimized out>, parent=<optimized out>) at ldap/servers/plugins/replication/repl_connext.c:91
#5  0x00007fba789e0ad3 in factory_destroy_extension (type=<optimized out>, object=0x7fb710e01a90, parent=0x0, extension=0x7fb710e01bc8) at ldap/servers/slapd/factory.c:366
#6  factory_destroy_extension (type=<optimized out>, object=0x7fb710e01a90, parent=0x0, extension=0x7fb710e01bc8) at ldap/servers/slapd/factory.c:348
#7  0x0000555bc877d76c in connection_cleanup (conn=0x7fb710e01a90) at ldap/servers/slapd/connection.c:181
#8  0x0000555bc879117c in connection_table_move_connection_out_of_active_list.isra.0 (ct=ct@entry=0x7fba743977c0, c=c@entry=0x7fb710e01a90) at ldap/servers/slapd/conntable.c:470
#9  0x0000555bc87827db in setup_pr_read_pds (ct=0x7fba743977c0, listnum=<optimized out>) at ldap/servers/slapd/daemon.c:1596
#10 ct_list_thread (threadnum=<optimized out>) at ldap/servers/slapd/daemon.c:1356
#11 0x00007fba7800e3b7 in _pt_root (arg=0x7fba7722bbc0) at ../../../../nspr/pr/src/pthreads/ptthread.c:191
#12 0x00007fba78797057 in start_thread (arg=<optimized out>) at pthread_create.c:448
#13 0x00007fba7881af4c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

This happen in the CI test with the recent new test: test_multi_subsuffix_replication

Package Version and Platform:

  • Platform: Fedora
  • Package and version: main
  • Browser: N/A

Steps to Reproduce
Steps to reproduce the behavior:

  1. Look at CI test for replication_acceptance tests
  2. if some tests (typically test_new_suffix) status are ERROR, download the pytest-replication-acceptance_test.py] artefacts
    unzip it and check if there arfe cores in the assets/cores directory

Expected results
No crash should occur

Additional context
This crash is also described in #6531

@progier389 progier389 added the needs triage The issue will be triaged during scrum label Jan 28, 2025
progier389 added a commit that referenced this issue Jan 30, 2025
Problem: connection cleanup may try to access a freed replica
Solution: Ensure that the replica is still existing before using its pointer

Issue: #6555

Reviewed by: @tbordaz, @droideck (Thanks!)
@tbordaz
Copy link
Contributor

tbordaz commented Feb 5, 2025

On main branch the patch triggers (not systematically) an infinite loop at shutdown

Thread 1 (Thread 0x7f20c99b4700 (LWP 3080510) "ns-slapd"):
#0  __pthread_rwlock_rdlock_full64 (rwlock=0x7f20c66a7400, clockid=0, abstime=0x0) at /usr/src/debug/glibc-2.39-33.fc40.x86_64/nptl/pthread_rwlock_common.c:506
#1  ___pthread_rwlock_rdlock (rwlock=0x7f20c66a7400) at pthread_rwlock_rdlock.c:26
#2  0x00007f20cab3d8ee in slapi_rwlock_rdlock (rwlock=) at ldap/servers/slapd/slapi2runtime.c:288
#3  0x00007f20c5fd2634 in replica_check_validity (replica=) at ldap/servers/plugins/replication/repl5_replica_hash.c:213
#4  0x00007f20c5fab461 in consumer_connection_extension_destructor (ext=, object=, parent=) at ldap/servers/plugins/replication/repl_connext.c:64
#5  0x00007f20caad7b33 in factory_destroy_extension (type=, object=0x7f1bbf000b70, parent=0x0, extension=0x7f1bbf000ca8) at ldap/servers/slapd/factory.c:366
#6  factory_destroy_extension (type=, object=0x7f1bbf000b70, parent=0x0, extension=0x7f1bbf000ca8) at ldap/servers/slapd/factory.c:348
#7  0x000056051977671c in connection_cleanup (conn=conn@entry=0x7f1bbf000b70) at ldap/servers/slapd/connection.c:181
#8  0x000056051977ac43 in connection_done (conn=0x7f1bbf000b70) at ldap/servers/slapd/connection.c:148
#9  connection_table_free (ct=) at ldap/servers/slapd/conntable.c:229
#10 slapd_daemon (ports=0x7ffec39290a0) at ldap/servers/slapd/daemon.c:1287
#11 0x000056051976b7c5 in main (argc=5, argv=0x7ffec39294f8) at ldap/servers/slapd/main.c:1152

I suspect that replica_check_validity should not be called at shutdown as s_hash (list of replica) looks somehow broken.
Possibly consumer_connection_extension_destructor should just return if g_get_shutdown() is true

@progier389
Copy link
Contributor Author

Are you sure it is an infinite loop ?

It rather looks like a locking issue (deadlock or deallocated lock?): we are not yet trying to access the hash table in this stack.
And there is no way consumer_connection_extension_destructor could be called in loop oin the same address (because the address is freed)

Now it is possible that replica_destroy_name_hash has been called before the connection closure.

IMHO replica_destroy_name_hash should set s_hash and s_lock to NULL after releasing them
and replica_check_validity should check that s_lock is not NULL

progier389 added a commit that referenced this issue Feb 6, 2025
…6585)

* Issue 6555 - Potential crash when deleting a replicated backend - hang at shutdown
Prevent previous 6555 fix to hang at shutdown.

Issue: #6555

Reviewed by: @tbordaz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage The issue will be triaged during scrum
Projects
None yet
Development

No branches or pull requests

2 participants