Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak Detected in TCP Provider with unbound Event Queue after fi_shutdown #10545

Open
piotrchmiel opened this issue Nov 15, 2024 · 1 comment

Comments

@piotrchmiel
Copy link
Contributor

Describe the bug
A memory leak is detected by the Address Sanitizer when performing operations on an endpoint that has been shut down using fi_shutdown. The issue occurs specifically when using the TCP provider in RDM mode, and no event queue (ep->util_ep.eq) is bound to the domain.

To Reproduce
Steps to reproduce the behavior:

  1. Use the TCP provider in RDM mode.
  2. Create an endpoint without binding an event queue to the domain (ep->util_ep.eq remains empty).
  3. Perform fi_shutdown on the endpoint.
  4. Perform any additional operations on the endpoint after fi_shutdown.

Expected behavior
The memory allocated in xnet_ep_disable (specifically err_entry.err_data = mem_dup(err_data, err_data_size);) should be properly released, avoiding memory leaks.

Output
The Address Sanitizer reports the following memory leak:

2024-11-14T12:56:28.7675977Z ==73885==ERROR: LeakSanitizer: detected memory leaks
2024-11-14T12:56:28.7676337Z 
2024-11-14T12:56:28.7676557Z Direct leak of 8 byte(s) in 1 object(s) allocated from:
2024-11-14T12:56:28.7677701Z     #0 0x55c2874a72c3 in malloc (/test/test+0x6d52c3) (BuildId: 10d8cef421d2609343e1feb371ea248a68039137)
2024-11-14T12:56:28.7679278Z     #1 0x7f2e491dea68 in mem_dup /test/third_party/libfabric/./include/ofi_mem.h:81:15
2024-11-14T12:56:28.7680923Z     #2 0x7f2e491de493 in xnet_ep_disable /test/third_party/libfabric/prov/tcp/src/xnet_ep.c:458:25
2024-11-14T12:56:28.7682293Z     #3 0x7f2e491d5819 in xnet_req_done /test/third_party/libfabric/prov/tcp/src/xnet_cm.c:209:2
2024-11-14T12:56:28.7683669Z     #4 0x7f2e491f30d5 in xnet_run_ep /test/third_party/libfabric/prov/tcp/src/xnet_progress.c:1468:3
2024-11-14T12:56:28.7685215Z     #5 0x7f2e491ee15a in xnet_handle_events /test/third_party/libfabric/prov/tcp/src/xnet_progress.c:1505:4
2024-11-14T12:56:28.7686681Z     #6 0x7f2e491edf8a in xnet_run_progress /test/third_party/libfabric/prov/tcp/src/xnet_progress.c:1562:3
2024-11-14T12:56:28.7688089Z     #7 0x7f2e491e96c6 in xnet_cq_progress /test/third_party/libfabric/prov/tcp/src/xnet_cq.c:84:2
2024-11-14T12:56:28.7689621Z     #8 0x7f2e49129be0 in ofi_cq_readfrom /test/third_party/libfabric/prov/util/src/util_cq.c:270:2
2024-11-14T12:56:28.7690989Z     #9 0x7f2e491e9d89 in xnet_cq_readfrom /test/third_party/libfabric/prov/tcp/src/xnet_cq.c:50:8
2024-11-14T12:56:28.7692541Z     #10 0x55c287ad9c7f in fi_cq_readfrom(fid_cq*, void*, unsigned long, unsigned long*) /test/third_party/libfabric/include/rdma/fi_eq.h:402:9

Environment:
OS: Ubuntu 22.04
Provider: TCP
Mode: RDM
Libfabric 1.22.0

Additional context
The memory leak originates from the function xnet_ep_disable at the line:
err_entry.err_data = mem_dup(err_data, err_data_size);
The issue only occurs when no event queue is bound to the domain (ep->util_ep.eq is empty) and operations are performed on the endpoint after it has been shut down using fi_shutdown.

@sydidelot
Copy link
Member

@piotrchmiel The memory that leaks corresponds to a FI_SHUTDOWN event added to the Event Queue after the endpoint shuts down. I'm not familiar with RDM but I guess there is a bug where err_entry.err_data is not freed after the EQ event is consumed by RDM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants