Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Data race in vrb_open_ep function (Libfabric v1.22.0) #10569

Open
piotrchmiel opened this issue Nov 21, 2024 · 1 comment
Open

prov/verbs: Data race in vrb_open_ep function (Libfabric v1.22.0) #10569

piotrchmiel opened this issue Nov 21, 2024 · 1 comment
Labels

Comments

@piotrchmiel
Copy link
Contributor

Describe the bug
A data race occurs in the vrb_open_ep function in the verbs provider when creating multiple endpoints using multiple threads. The race is detected by Thread Sanitizer and involves concurrent modifications of the global variable vrb_ep_ops. The issue causes unpredictable behavior when using the FI_THREAD_SAFE threading model, which should guarantee safe concurrent operations.

To Reproduce
Steps to reproduce the behavior:

  1. Set up an environment with Libfabric 1.22.0, Ubuntu 22.04, and the verbs provider.
  2. Use the FI_THREAD_SAFE threading mode.
  3. Create multiple endpoints (fi_endpoint) from multiple threads simultaneously.
  4. Observe the data race using Thread Sanitizer.

Expected behavior
The FI_THREAD_SAFE threading mode should ensure thread safety, allowing multiple threads to create endpoints concurrently without encountering data races.

Output
Thread Sanitizer detects the following data race:

WARNING: ThreadSanitizer: data race (pid=76726)
  Write of size 8 at 0x7ff67d69a3b8 by thread T2:
    #0 vrb_open_ep /path/to/libfabric/prov/verbs/src/verbs_ep.c:1397:31 (libfabric.so.1+0x100f27)
    #1 fi_endpoint(fid_domain*, fi_info*, fid_ep**, void*) /path/to/libfabric/include/rdma/fi_endpoint.h:187:9 (application+0x6658b0)

 Previous write of size 8 at 0x7ff67d69a3b8 by thread T4:
    #0 vrb_open_ep /path/to/libfabric/prov/verbs/src/verbs_ep.c:1397:31 (libfabric.so.1+0x100f27)
    #1 fi_endpoint(fid_domain*, fi_info*, fid_ep**, void*) /path/to/libfabric/include/rdma/fi_endpoint.h:187:9 (application+0x6658b0)

The data race occurs at: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1397

This line modifies the global variable vrb_ep_ops:

(*ep_fid)->fid.ops->ops_open = vrb_ep_ops_open;

The global variable vrb_ep_ops is defined here: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1159

The issue was introduced in commit: da62d0f

Environment:

  • OS: Ubuntu 22.04
  • Provider: Verbs
  • Endpoint Type: Message Endpoint (FI_EP_MSG)
  • Libfabric Version: 1.22.0
  • Threading Mode: FI_THREAD_SAFE

Additional context
The issue appears to arise because vrb_ep_ops is a global variable shared across threads, and the modifications are not protected by a mutex or any thread-synchronization mechanism.

This breaks the FI_THREAD_SAFE threading model, where thread safety is expected when using multiple threads concurrently.

Potential Fix: Synchronize access to vrb_ep_ops using a mutex or move to a per-instance structure to avoid shared mutable state.

@sydidelot
Copy link
Member

@piotrchmiel Thanks for the bug report! I guess this issue can easily be fixed with the following patch:

diff --git a/prov/verbs/src/verbs_ep.c b/prov/verbs/src/verbs_ep.c
index 63aea8277..ebd7646d9 100644
--- a/prov/verbs/src/verbs_ep.c
+++ b/prov/verbs/src/verbs_ep.c
@@ -1161,7 +1161,7 @@ static struct fi_ops vrb_ep_ops = {
        .close = vrb_ep_close,
        .bind = vrb_ep_bind,
        .control = vrb_ep_control,
-       .ops_open = fi_no_ops_open,
+       .ops_open = vrb_ep_ops_open,
 };
 
 static struct fi_ops_cm vrb_dgram_cm_ops = {
@@ -1394,7 +1394,6 @@ int vrb_open_ep(struct fid_domain *domain, struct fi_info *info,
        *ep_fid = &ep->util_ep.ep_fid;
        ep->util_ep.ep_fid.fid.ops = &vrb_ep_ops;
        ep->util_ep.ep_fid.ops = &vrb_ep_base_ops;
-       (*ep_fid)->fid.ops->ops_open = vrb_ep_ops_open;
 
        vrb_prof_func_end("vrb_open_ep");
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants