Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/logging: Fix race in writing to vrb_ep_ops #10588

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dsciebu
Copy link

@dsciebu dsciebu commented Nov 28, 2024

'vrb_ep_ops' is a static object which field 'ops_open' gets overriden in 'vrb_open_ep'. If the operation is performed by multiple threads at once, which is totally possible, we end up with a data race. Switching it to thread local fixes it.

Signed-off-by: Dariusz Sciebura dariusz.sciebura@gmail.com

'vrb_ep_ops' is a static object which field 'ops_open' gets
overriden in 'vrb_open_ep'. If the operation is performed by
multiple threads at once, which is totally possible, we end
up with a data race. Switching it to thread local fixes it.

Signed-off-by: Dariusz Sciebura <dariusz.sciebura@gmail.com>
@dsciebu
Copy link
Author

dsciebu commented Nov 28, 2024

There is a bunch of static variables spread all over the place. IMHO they all should be marked as _Thread_local, as, even if they are not overridden somewhere now, they can be in future leading to another race. If you agree, just let know - I can modify my PR. Otherwise, let's stick to this ver.

@piotrchmiel
Copy link
Contributor

Isn't it duplicate of #10571 and #10569 ?

@dsciebu
Copy link
Author

dsciebu commented Nov 28, 2024

Isn't it duplicate of #10571 and #10569 ?

It is in a sense that it generally solves the same issue. However what I am promoting here is a general rule of marking static globals with _Thread_local as probably there are still used concurrently in several places and will probably be used that way in the future. IMO the best solution would be to rethink the architecture and refactor the existing code, but as a quick fix adding thread locality to variables like the one mentioned may help fix a lot of UBs for pretty much 'free'.

@dsciebu
Copy link
Author

dsciebu commented Nov 28, 2024

I accidentally found another spot with an identical bug, this time in xnet_ep.c. The line:
(*ep_fid)->fid.ops->ops_open = xnet_ep_ops_open;
overwrites the global assigned to (*ep_fid)->fid.ops assigned in the following line:
(*ep_fid)->fid.ops = &xnet_ep_fi_ops;
I will push a fix for it in a separate PR, but please treat it as an argument for my idea of marking the globals with thread_local.
Edit: The fix mentioned is #10590.

@dsciebu
Copy link
Author

dsciebu commented Nov 29, 2024

I accidentally found another spot with an identical bug, this time in xnet_ep.c. The line: (*ep_fid)->fid.ops->ops_open = xnet_ep_ops_open; overwrites the global assigned to (*ep_fid)->fid.ops assigned in the following line: (*ep_fid)->fid.ops = &xnet_ep_fi_ops; I will push a fix for it in a separate PR, but please treat it as an argument for my idea of marking the globals with thread_local. Edit: The fix mentioned is #10590.

@shefty - counting on your voice!

@shefty
Copy link
Member

shefty commented Dec 2, 2024

The function pointers here must be set per endpoint that's created, not per thread. If the same thread creates 2 endpoints, the second endpoint should not change the function pointers for the first endpoint. The original change that modified ops_open is what's wrong and needs to be changed.

We should just need this:

static struct fi_ops vrb_ep_ops = {
	.size = sizeof(struct fi_ops),
	.close = vrb_ep_close,
	.bind = vrb_ep_bind,
	.control = vrb_ep_control,
	.ops_open = vrb_ep_ops_open,  <-- update this
};

The xnet change should be similar.

If a provider wants to customize a function pointer for an individual object, they need to dynamically allocate the struct fi_xxx_ops structure, which is almost certainly a bad idea and shouldn't be done.

@shefty
Copy link
Member

shefty commented Dec 2, 2024

Correct fix is #10571.

The static structs referencing function pointers is desirable. Consider an application which allocates 10k endpoints which are nearly identical. Each endpoint has a single pointer to a static struct, where the function pointers are stored. When an app calls fi_send(), fi_write(), etc. those functions are inline wrappers which access the correct underlying function.

The benefit is that the size of each endpoint only increases by 1 pointer, rather than the size of the entire struct. It also allows adding new functions to the end of the struct while maintaining backwards compatibility. That wouldn't be possible if the structs were embedded directly into the user visible object. The cost is that 2 pointers must be dereferenced when making a function call, rather than only 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants