Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlx5: invalidate cq->cur_rsc when QP is destroyed inside a polling batch #1526

Merged
merged 1 commit into from
Dec 2, 2024

Conversation

FujiZ
Copy link
Contributor

@FujiZ FujiZ commented Nov 27, 2024

When using cq_ex inteface, if the user destroys the QP associated with the current work completion, the next ibv_next_poll() call will cause a use-after-free error since it needs to access the QP that has already been destroyed through cq->cur_rsc inside get_req_context().

Fix this error by resetting the cq->cur_rsc in __mlx5_cq_clean if it is associated with the QP to be destroyed.

@FujiZ FujiZ force-pushed the fz/fix-cq-cur-rsc branch 3 times, most recently from 1110f71 to c48b366 Compare November 27, 2024 03:47
providers/mlx5/cq.c Outdated Show resolved Hide resolved
* Reset the cq->cur_rsc if it is associated with the QP to be
* destroyed in order to prevent use-after-free errors in the
* next ibv_next_poll().
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?

In addition, if the 'lock' mode was used, this code will run only after mlx5_end_poll(), so the next mlx5_start_poll() should do the work by setting the pointer to NULL upon its entrance.

Note:
The code should not protect against incorrect application behavior (e.g., destroying a QP while still polling for its completions), especially in areas that might impact the data path.

Copy link
Contributor Author

@FujiZ FujiZ Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?

Yeah in this case multiple QPs are attached to the same CQ.
The code sequence to reproduce this problem would look like this:

ibv_start_poll(); // get a work completion associated with the QP A
ibv_destroy_qp(); // destroy QP A here since we get a completion error
ibv_next_poll(); // try to get the next work completion for other QPs from the same CQ. UAF error is triggered here

destroying a QP while still polling for its completions

Do you mean that destroying a QP between ibv_start_poll() and ibv_end_poll() is not permitted? However, I haven't find any manual which describes this behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's permitted but doesn't look like a good practice, a few notes below.

You are referring only to a single-threaded application, correct? In a multi-threaded application, the call to ibv_destroy_qp() will remain blocked until ibv_end_poll() is invoked, ensuring that ibv_next_poll() is safe to use.

Additionally, we are only discussing a scenario where the CQ serves multiple QPs. Otherwise, it would not make sense to destroy a QP and continue polling its CQ, as this would clearly indicate incorrect application behavior.

The commit log should be rephrased to clarify the exact use case that we are talking about.

So, in the specific scenario that you are talking about, it can make sense to set the NULL upon __mlx5_cq_clean() while narrowing the comment near this line to be more specific as was mentioned above.

In any case, no need for a change in mlx5_end_poll() as I already mentioned.

Copy link
Contributor Author

@FujiZ FujiZ Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are referring only to a single-threaded application, correct? In a multi-threaded application

Yes, the CQ is configured to be in single threaded mode by specifying IBV_CREATE_CQ_ATTR_SINGLE_THREADED.

I have updated the comment to reveal the use case that we have discussed so far.

@FujiZ FujiZ force-pushed the fz/fix-cq-cur-rsc branch 2 times, most recently from 56c07f5 to cdeef23 Compare November 28, 2024 08:06
For CQ created in single threaded mode serving multiple QPs, if
the user destroys a QP between ibv_start_poll() and ibv_end_poll(),
then cq->cur_rsc should be invalidated since it may point to the QP
that is being destroyed, which may cause UAF error in the next
ibv_next_poll() call.

Signed-off-by: ZHOU Huaping <zhouhuaping.san@bytedance.com>
@FujiZ FujiZ force-pushed the fz/fix-cq-cur-rsc branch from cdeef23 to 8452205 Compare November 28, 2024 08:29
@FujiZ FujiZ changed the title mlx5: fix a use-after-free error in mlx5_next_poll mlx5: invalidate cq->cur_rsc when QP is destroyed inside a polling batch Nov 28, 2024
@FujiZ FujiZ requested a review from yishaih November 28, 2024 08:32
@yishaih yishaih merged commit b0da1d5 into linux-rdma:master Dec 2, 2024
14 checks passed
@FujiZ FujiZ deleted the fz/fix-cq-cur-rsc branch December 9, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants