Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent discovery DATABASE_ERROR during WiFi brownouts and roaming #3544

Open
1 task done
cvilas opened this issue May 28, 2023 · 1 comment
Open
1 task done

Frequent discovery DATABASE_ERROR during WiFi brownouts and roaming #3544

cvilas opened this issue May 28, 2023 · 1 comment
Labels
triage Issue pending classification

Comments

@cvilas
Copy link
Contributor

cvilas commented May 28, 2023

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

The context:

  • A discovery server is running on a Linux host on wired Ethernet
  • Additional Linux hosts running DDS applications are running on hosts connected to the same network but via wireless Ethernet (WiFi 6). These hosts are compute boards on autonomous mobile robots and configured with static IP addresses.
  • The mobile robots operate in a large area served by many wireless Access Points (AP). As they move in the environment, they routinely switch from one AP to another. The switch-over usually happens in a few 100ms, but sometimes can take a few seconds

Expected behaviour:

  • Usually we expect to see nothing at all. All endpoints should continue communicating with each other with a brief interruption during the switch-over.
  • If the switch-over takes too long, we expect to see the discovery service 'drop' the participants and endpoints running on WiFi hosts, but re-discover them after the switch-over to new AP is completed. After this points, the publishers and subscribers should match again and data flow should continue

Current behavior

For most part, the behaviour is as we expect (described over). However, occasionally, after the WiFi hosts rejoin network, the discovery service is seen to throw error messages like these:

2023-05-19 07:20:08.146 [DISCOVERY_DATABASE Error] Reader 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.4.7 has no associated participant. Skipping -> Function create_readers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Reader 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.5.7 has no associated participant. Skipping -> Function create_readers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Writer 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.1.2 has no associated participant. Skipping -> Function create_writers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Writer 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.2.2 has no associated participant. Skipping -> Function create_writers_from_change_

Once we see these errors on the discovery service, we notice that discovery is no more reliable. Certain publishers and subscribers may not match anymore and data flow may not ever recover.

Steps to reproduce

As this is an occasional behaviour, this is quite hard to reproduce. The way to reproduce this using multiple virtual machines (VMs) on a host

  • Let each VM run DDS applications - perhaps one VM running a publisher and the other running the corresponding subscriber.
  • Let one of the VMs run a discovery server.
  • Turn the network connectivity off and on continuously on one of the VMs
  • Eventually, after a few tries, notice DATABASE_ERROR reported on the console running discovery server

Fast DDS version/commit

Happens in master. But certainly on release 2.10.1

Platform/Architecture

Ubuntu Focal 20.04 amd64, Ubuntu Focal 20.04 arm64

Transport layer

UDPv4

Additional context

We seem to have solved this by delayed reconciliation of readers and writers reported to have no associated participants. Essentially push such readers and writers into a list, and upon discovery of new participants, we try to run the association again. This seems to resolve the errors. I will add a pull request demonstrating the solution later.

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

@cvilas
Copy link
Contributor Author

cvilas commented May 28, 2023

Associated PR showing code changes here: #3545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Issue pending classification
Projects
None yet
Development

No branches or pull requests

1 participant