Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[19696][20409] Fix data race on PDP #4220

Merged
merged 7 commits into from
Feb 22, 2024
Merged

[19696][20409] Fix data race on PDP #4220

merged 7 commits into from
Feb 22, 2024

Conversation

Tempate
Copy link
Contributor

@Tempate Tempate commented Jan 9, 2024

Description

This PR intends to solve a race condition on PDP discovery phase, in which a data UP is received and processed while its corresponding previously received data P is still being processed.

As a result, the participant proxy created and registered can be cleared before calling assignRemoteEndpoints, as the PDP mutex is released just before (on purpose to avoid interlocking problems).

This issue has been reported by TSAN in this particular scenario, when the data P is received in the UDP multicast thread and then the data UP in the unicast one. However, this problem should also affect other scenarios such as:

  • Same but with SHM transport.
  • Same but receiving data P in unicast thread and data UP in multicast one.
  • Losing remote participant's liveliness (events thread) while still processing its data P.

The proposed solution is to copy the received participant proxy data before releasing the PDP mutex, and then passing this copy forward to the methods requiring this information.

@Mergifyio backport 2.12.x 2.10.x 2.6.x

Contributor Checklist

  • Commit messages follow the project guidelines.
  • The code follows the style guidelines of this project.
  • Tests that thoroughly check the new feature have been added/Regression tests checking the bug and its fix have been added; the added tests pass locally
  • N/A Any new/modified methods have been properly documented using Doxygen.
  • Changes are ABI compatible.
  • Changes are API compatible.
  • N/A New feature has been added to the versions.md file (if applicable).
  • N/A New feature has been documented/Current behavior is correctly described in the documentation.
  • Applicable backports have been included in the description.

Reviewer Checklist

  • The PR has a milestone assigned.
  • Check contributor checklist is correct.
  • Check CI results: changes do not issue any warning.
  • Check CI results: failing tests are unrelated with the changes.

@EduPonz EduPonz added this to the v2.13.2 milestone Jan 10, 2024
@Tempate Tempate changed the title Fix data race on PDP [19696] [19696] Fix data race on PDP Jan 10, 2024
@Tempate Tempate force-pushed the bugfix/pdp_data_race branch 2 times, most recently from 3f6008b to 4e428af Compare January 15, 2024 08:33
@Tempate
Copy link
Contributor Author

Tempate commented Jan 15, 2024

@richiprosima please test this

1 similar comment
@Tempate
Copy link
Contributor Author

Tempate commented Jan 16, 2024

@richiprosima please test this

@Tempate Tempate marked this pull request as ready for review January 19, 2024 10:51
Copy link
Member

@MiguelCompany MiguelCompany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is a test that has failed with segmentation fault both in Mac and Windows
  2. I don't like the solution of copying the whole ParticipantProxyData, it might be better to change the API of the methods receiving it so they only use the necessary information, and just copy that information

@Tempate
Copy link
Contributor Author

Tempate commented Jan 26, 2024

@richiprosima please test this

@JesusPoderoso JesusPoderoso added the ci-pending PR which CI is running label Jan 29, 2024
@Tempate
Copy link
Contributor Author

Tempate commented Jan 31, 2024

@richiprosima please test this

@Mario-DL Mario-DL mentioned this pull request Jan 31, 2024
1 task
@EduPonz EduPonz added the needs-review PR that is ready to be reviewed label Feb 1, 2024
@EduPonz EduPonz added the to-do label Feb 9, 2024
@EduPonz EduPonz modified the milestones: v2.13.2, v2.13.3 Feb 14, 2024
@juanlofer-eprosima
Copy link
Contributor

@richiprosima please test this

@JesusPoderoso
Copy link
Contributor

@juanlofer-eprosima please address linter issue. Is the linux discovery failed test related with the PR?

@JesusPoderoso JesusPoderoso removed the ci-pending PR which CI is running label Feb 20, 2024
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
…rectly

Signed-off-by: tempate <danieldiaz@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
@juanlofer-eprosima juanlofer-eprosima changed the title [19696] Fix data race on PDP [19696][20409] Fix data race on PDP Feb 21, 2024
Copy link
Member

@Mario-DL Mario-DL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job in this PR as it effectively solves the data race it describes, I also like the simplification on the PDPClient in order to use the is_connected, avoiding to store the entire ParticipantProxyData. I am approving though leaving a reflection.

Trying to give a second thought on @MiguelCompany suggested, I think this work may admit an even fine-grained solution to avoid copying the entire ParcipantProxyData in stack. But it is also true that we are highly tied to the entire pdata within PDP::assign_remote_endpoints() specially when we are notifying the user with the DiscoveredParticipantInfo. The rest of the methods, matched_reader/writer_add() and assign_low_level_endpoints() do not make an intensive use of the ParticipantProxyData and it would be enough to pass them only some of the fields: some locator lists, the guid and the m_availableBuiltinEndpoints in essence.

Finally, I am attaching a possible patch of a different approach instead of stack-copying, which explores the option of synchronizing the PDP::remove_remote_participant() and PDP::assign_remote_endpoints() only in the case that we are trying to delete a pdata that is is currently being processed. The fix is fast and dirty just to make an idea of the approach and I would still need to think if it would solve all the situations that we are solving by making that stack copy.

pdp_patch_another_approach.txt

Approving !

@Mario-DL Mario-DL added ci-pending PR which CI is running and removed needs-review PR that is ready to be reviewed ci-pending PR which CI is running labels Feb 22, 2024
@Mario-DL
Copy link
Member

I included the backports in the PR description

@Mario-DL Mario-DL added the ready-to-merge Ready to be merged. CI and changes have been reviewed and approved. label Feb 22, 2024
@EduPonz EduPonz merged commit 3053832 into master Feb 22, 2024
11 of 17 checks passed
@EduPonz EduPonz deleted the bugfix/pdp_data_race branch February 22, 2024 11:30
@EduPonz
Copy link

EduPonz commented Feb 22, 2024

@Mergifyio backport 2.12.x 2.10.x 2.6.x

Copy link
Contributor

mergify bot commented Feb 22, 2024

backport 2.12.x 2.10.x 2.6.x

✅ Backports have been created

@juanlofer-eprosima
Copy link
Contributor

Good job in this PR as it effectively solves the data race it describes, I also like the simplification on the PDPClient in order to use the is_connected, avoiding to store the entire ParticipantProxyData. I am approving though leaving a reflection.

Trying to give a second thought on @MiguelCompany suggested, I think this work may admit an even fine-grained solution to avoid copying the entire ParcipantProxyData in stack. But it is also true that we are highly tied to the entire pdata within PDP::assign_remote_endpoints() specially when we are notifying the user with the DiscoveredParticipantInfo. The rest of the methods, matched_reader/writer_add() and assign_low_level_endpoints() do not make an intensive use of the ParticipantProxyData and it would be enough to pass them only some of the fields: some locator lists, the guid and the m_availableBuiltinEndpoints in essence.

Finally, I am attaching a possible patch of a different approach instead of stack-copying, which explores the option of synchronizing the PDP::remove_remote_participant() and PDP::assign_remote_endpoints() only in the case that we are trying to delete a pdata that is is currently being processed. The fix is fast and dirty just to make an idea of the approach and I would still need to think if it would solve all the situations that we are solving by making that stack copy.

pdp_patch_another_approach.txt

Approving !

I also agree the solution @MiguelCompany suggested was better, but we decided to proceed with the copy approach given the effort required, and also considering that the root of the problem (releasing the PDP mutex before completing an entire discovery operation, which might lead to inconsistent states apart from data races as the one we are dealing with here) would still remain unsolved. We also discussed that a possible solution to tackle the root of the problem would be to derive all discovery actions to a dedicated thread.

Regarding your proposal, I believe it is a bit risky to wait as some lower level resources might be blocked by that thread, leading to potential deadlocks. And two additional comments:

  • I believe that (in remove_remote_participant) locking the PDP mutex should be done after calling wait (to avoid intraprocess deadlock).
  • I also believe that a PDP instance could process more than one data P before processing a data uP, so I guess the complexity of this solution would need to grow.

@Mario-DL
Copy link
Member

Mario-DL commented Feb 22, 2024

Good job in this PR as it effectively solves the data race it describes, I also like the simplification on the PDPClient in order to use the is_connected, avoiding to store the entire ParticipantProxyData. I am approving though leaving a reflection.
Trying to give a second thought on @MiguelCompany suggested, I think this work may admit an even fine-grained solution to avoid copying the entire ParcipantProxyData in stack. But it is also true that we are highly tied to the entire pdata within PDP::assign_remote_endpoints() specially when we are notifying the user with the DiscoveredParticipantInfo. The rest of the methods, matched_reader/writer_add() and assign_low_level_endpoints() do not make an intensive use of the ParticipantProxyData and it would be enough to pass them only some of the fields: some locator lists, the guid and the m_availableBuiltinEndpoints in essence.
Finally, I am attaching a possible patch of a different approach instead of stack-copying, which explores the option of synchronizing the PDP::remove_remote_participant() and PDP::assign_remote_endpoints() only in the case that we are trying to delete a pdata that is is currently being processed. The fix is fast and dirty just to make an idea of the approach and I would still need to think if it would solve all the situations that we are solving by making that stack copy.
pdp_patch_another_approach.txt
Approving !

I also agree the solution @MiguelCompany suggested was better, but we decided to proceed with the copy approach given the effort required, and also considering that the root of the problem (releasing the PDP mutex before completing an entire discovery operation, which might lead to inconsistent states apart from data races as the one we are dealing with here) would still remain unsolved. We also discussed that a possible solution to tackle the root of the problem would be to derive all discovery actions to a dedicated thread.

Regarding your proposal, I believe it is a bit risky to wait as some lower level resources might be blocked by that thread, leading to potential deadlocks. And two additional comments:

  • I believe that (in remove_remote_participant) locking the PDP mutex should be done after calling wait (to avoid intraprocess deadlock).
  • I also believe that a PDP instance could process more than one data P before processing a data uP, so I guess the complexity of this solution would need to grow.

Yeah, agree that it would be risky as a consequence of relying on a cond var.

A further lock-free approach I can think of would be that the thread executing remove_remote_participant() with the participant proxy that is being currently processed in assign_remote_endpoints , could leave a flag to true in PDPListener meaning that “I am not removing the participant proxy now, as you are using it, but I notify you that when you finish the assign_remote_endpoints() you will have to delete it on behalf of me”
So, in that case, the same discovery thread could call remove_remote_participant() if the flag is up, after executing the assign_remote_endpoints() and then reset it. What do you think ?

@juanlofer-eprosima
Copy link
Contributor

Good job in this PR as it effectively solves the data race it describes, I also like the simplification on the PDPClient in order to use the is_connected, avoiding to store the entire ParticipantProxyData. I am approving though leaving a reflection.
Trying to give a second thought on @MiguelCompany suggested, I think this work may admit an even fine-grained solution to avoid copying the entire ParcipantProxyData in stack. But it is also true that we are highly tied to the entire pdata within PDP::assign_remote_endpoints() specially when we are notifying the user with the DiscoveredParticipantInfo. The rest of the methods, matched_reader/writer_add() and assign_low_level_endpoints() do not make an intensive use of the ParticipantProxyData and it would be enough to pass them only some of the fields: some locator lists, the guid and the m_availableBuiltinEndpoints in essence.
Finally, I am attaching a possible patch of a different approach instead of stack-copying, which explores the option of synchronizing the PDP::remove_remote_participant() and PDP::assign_remote_endpoints() only in the case that we are trying to delete a pdata that is is currently being processed. The fix is fast and dirty just to make an idea of the approach and I would still need to think if it would solve all the situations that we are solving by making that stack copy.
pdp_patch_another_approach.txt
Approving !

I also agree the solution @MiguelCompany suggested was better, but we decided to proceed with the copy approach given the effort required, and also considering that the root of the problem (releasing the PDP mutex before completing an entire discovery operation, which might lead to inconsistent states apart from data races as the one we are dealing with here) would still remain unsolved. We also discussed that a possible solution to tackle the root of the problem would be to derive all discovery actions to a dedicated thread.
Regarding your proposal, I believe it is a bit risky to wait as some lower level resources might be blocked by that thread, leading to potential deadlocks. And two additional comments:

  • I believe that (in remove_remote_participant) locking the PDP mutex should be done after calling wait (to avoid intraprocess deadlock).
  • I also believe that a PDP instance could process more than one data P before processing a data uP, so I guess the complexity of this solution would need to grow.

Yeah, agree that it would be risky as a consequence of relying on a cond var.

A further lock-free approach I can think of would be that the thread executing remove_remote_participant() with the participant proxy that is being currently processed in assign_remote_endpoints , could leave a flag to true in PDPListener meaning that “I am not removing the participant proxy now, as you are using it, but I notify you that when you finish the assign_remote_endpoints() you will have to delete it on behalf of me” So, in that case, the same discovery thread could call remove_remote_participant() if the flag is up, after executing the assign_remote_endpoints() and then reset it. What do you think ?

That's an interesting approach we could further explore. The only caveat I see at the moment is that remove_remote_participant can also be called from places other than a PDPListener, like for example from the events thread when liveliness is lost.

juanlofer-eprosima pushed a commit that referenced this pull request Feb 29, 2024
* Add regression test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Make a copy of the participant proxy data

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Use the copy method to ensure all the attributes are being copied correctly

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Lookup the participant proxy data to pass the discovery-server tests

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Get rid of proxy reference in RemoteServerAttributes

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Fix failing test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Uncrustify

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
(cherry picked from commit 3053832)
juanlofer-eprosima pushed a commit that referenced this pull request Feb 29, 2024
* Add regression test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Make a copy of the participant proxy data

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Use the copy method to ensure all the attributes are being copied correctly

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Lookup the participant proxy data to pass the discovery-server tests

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Get rid of proxy reference in RemoteServerAttributes

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Fix failing test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Uncrustify

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
(cherry picked from commit 3053832)
juanlofer-eprosima added a commit that referenced this pull request Feb 29, 2024
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
EduPonz pushed a commit that referenced this pull request Mar 6, 2024
* Fix data race on PDP (#4220)

* Add regression test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Make a copy of the participant proxy data

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Use the copy method to ensure all the attributes are being copied correctly

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Lookup the participant proxy data to pass the discovery-server tests

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Get rid of proxy reference in RemoteServerAttributes

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Fix failing test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Uncrustify

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
(cherry picked from commit 3053832)

* Export ParticipantProxyData constructor/destructor

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Co-authored-by: Daniel Díaz Quílez <33602217+Tempate@users.noreply.github.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
MiguelCompany pushed a commit that referenced this pull request Mar 18, 2024
* Add regression test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Make a copy of the participant proxy data

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Use the copy method to ensure all the attributes are being copied correctly

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Lookup the participant proxy data to pass the discovery-server tests

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Get rid of proxy reference in RemoteServerAttributes

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Fix failing test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Uncrustify

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
(cherry picked from commit 3053832)
Mario-DL pushed a commit that referenced this pull request Mar 19, 2024
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
EduPonz pushed a commit that referenced this pull request Mar 22, 2024
* Fix data race on PDP (#4220)

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Export ParticipantProxyData constructor/destructor

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
MiguelCompany pushed a commit that referenced this pull request Mar 29, 2024
* Fix data race on PDP (#4220)

* Add regression test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Make a copy of the participant proxy data

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Use the copy method to ensure all the attributes are being copied correctly

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Lookup the participant proxy data to pass the discovery-server tests

Signed-off-by: tempate <danieldiaz@eprosima.com>

* Get rid of proxy reference in RemoteServerAttributes

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Fix failing test

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

* Uncrustify

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
(cherry picked from commit 3053832)

* Export ParticipantProxyData constructor/destructor

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>

---------

Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Co-authored-by: Daniel Díaz Quílez <33602217+Tempate@users.noreply.github.com>
Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Ready to be merged. CI and changes have been reviewed and approved.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants