You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A data race occurs in the vrb_open_ep function in the verbs provider when creating multiple endpoints using multiple threads. The race is detected by Thread Sanitizer and involves concurrent modifications of the global variable vrb_ep_ops. The issue causes unpredictable behavior when using the FI_THREAD_SAFE threading model, which should guarantee safe concurrent operations.
To Reproduce
Steps to reproduce the behavior:
Set up an environment with Libfabric 1.22.0, Ubuntu 22.04, and the verbs provider.
Use the FI_THREAD_SAFE threading mode.
Create multiple endpoints (fi_endpoint) from multiple threads simultaneously.
Observe the data race using Thread Sanitizer.
Expected behavior
The FI_THREAD_SAFE threading mode should ensure thread safety, allowing multiple threads to create endpoints concurrently without encountering data races.
Output
Thread Sanitizer detects the following data race:
WARNING: ThreadSanitizer: data race (pid=76726)
Write of size 8 at 0x7ff67d69a3b8 by thread T2:
#0 vrb_open_ep /path/to/libfabric/prov/verbs/src/verbs_ep.c:1397:31 (libfabric.so.1+0x100f27)
#1 fi_endpoint(fid_domain*, fi_info*, fid_ep**, void*) /path/to/libfabric/include/rdma/fi_endpoint.h:187:9 (application+0x6658b0)
Previous write of size 8 at 0x7ff67d69a3b8 by thread T4:
#0 vrb_open_ep /path/to/libfabric/prov/verbs/src/verbs_ep.c:1397:31 (libfabric.so.1+0x100f27)
#1 fi_endpoint(fid_domain*, fi_info*, fid_ep**, void*) /path/to/libfabric/include/rdma/fi_endpoint.h:187:9 (application+0x6658b0)
Additional context
The issue appears to arise because vrb_ep_ops is a global variable shared across threads, and the modifications are not protected by a mutex or any thread-synchronization mechanism.
This breaks the FI_THREAD_SAFE threading model, where thread safety is expected when using multiple threads concurrently.
Potential Fix: Synchronize access to vrb_ep_ops using a mutex or move to a per-instance structure to avoid shared mutable state.
The text was updated successfully, but these errors were encountered:
Describe the bug
A data race occurs in the vrb_open_ep function in the verbs provider when creating multiple endpoints using multiple threads. The race is detected by Thread Sanitizer and involves concurrent modifications of the global variable vrb_ep_ops. The issue causes unpredictable behavior when using the FI_THREAD_SAFE threading model, which should guarantee safe concurrent operations.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The FI_THREAD_SAFE threading mode should ensure thread safety, allowing multiple threads to create endpoints concurrently without encountering data races.
Output
Thread Sanitizer detects the following data race:
The data race occurs at: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1397
This line modifies the global variable vrb_ep_ops:
The global variable vrb_ep_ops is defined here: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1159
The issue was introduced in commit: da62d0f
Environment:
Additional context
The issue appears to arise because vrb_ep_ops is a global variable shared across threads, and the modifications are not protected by a mutex or any thread-synchronization mechanism.
This breaks the FI_THREAD_SAFE threading model, where thread safety is expected when using multiple threads concurrently.
Potential Fix: Synchronize access to vrb_ep_ops using a mutex or move to a per-instance structure to avoid shared mutable state.
The text was updated successfully, but these errors were encountered: