-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlx5: invalidate cq->cur_rsc when QP is destroyed inside a polling batch #1526
Conversation
1110f71
to
c48b366
Compare
* Reset the cq->cur_rsc if it is associated with the QP to be | ||
* destroyed in order to prevent use-after-free errors in the | ||
* next ibv_next_poll(). | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?
In addition, if the 'lock' mode was used, this code will run only after mlx5_end_poll(), so the next mlx5_start_poll() should do the work by setting the pointer to NULL upon its entrance.
Note:
The code should not protect against incorrect application behavior (e.g., destroying a QP while still polling for its completions), especially in areas that might impact the data path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?
Yeah in this case multiple QPs are attached to the same CQ.
The code sequence to reproduce this problem would look like this:
ibv_start_poll(); // get a work completion associated with the QP A
ibv_destroy_qp(); // destroy QP A here since we get a completion error
ibv_next_poll(); // try to get the next work completion for other QPs from the same CQ. UAF error is triggered here
destroying a QP while still polling for its completions
Do you mean that destroying a QP between ibv_start_poll()
and ibv_end_poll()
is not permitted? However, I haven't find any manual which describes this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's permitted but doesn't look like a good practice, a few notes below.
You are referring only to a single-threaded application, correct? In a multi-threaded application, the call to ibv_destroy_qp() will remain blocked until ibv_end_poll() is invoked, ensuring that ibv_next_poll() is safe to use.
Additionally, we are only discussing a scenario where the CQ serves multiple QPs. Otherwise, it would not make sense to destroy a QP and continue polling its CQ, as this would clearly indicate incorrect application behavior.
The commit log should be rephrased to clarify the exact use case that we are talking about.
So, in the specific scenario that you are talking about, it can make sense to set the NULL upon __mlx5_cq_clean() while narrowing the comment near this line to be more specific as was mentioned above.
In any case, no need for a change in mlx5_end_poll() as I already mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are referring only to a single-threaded application, correct? In a multi-threaded application
Yes, the CQ is configured to be in single threaded mode by specifying IBV_CREATE_CQ_ATTR_SINGLE_THREADED
.
I have updated the comment to reveal the use case that we have discussed so far.
56c07f5
to
cdeef23
Compare
For CQ created in single threaded mode serving multiple QPs, if the user destroys a QP between ibv_start_poll() and ibv_end_poll(), then cq->cur_rsc should be invalidated since it may point to the QP that is being destroyed, which may cause UAF error in the next ibv_next_poll() call. Signed-off-by: ZHOU Huaping <zhouhuaping.san@bytedance.com>
cdeef23
to
8452205
Compare
When using cq_ex inteface, if the user destroys the QP associated with the current work completion, the next ibv_next_poll() call will cause a use-after-free error since it needs to access the QP that has already been destroyed through cq->cur_rsc inside get_req_context().
Fix this error by resetting the cq->cur_rsc in __mlx5_cq_clean if it is associated with the QP to be destroyed.