Robustify spawner #1501

fmauch · 2024-04-23T10:58:25Z

With the current implementation it can happen that when using multiple controller spawners some of them fail or get stuck, see #1182. This fixes #1182 and #1483.

While writing this, I realize this has great overlap with #1483, but I see no problem combining those two.

As briefly mentioned above, this change addresses two things:

The node discovery mechanism for the controller manager seems to be error-prone and not strictly needed.
It can happen that we don't get a response from the service server and hang in a deadlock.

Note: The test I implemented is a bit hacky, so we might also want to remove it again? Otherwise I think we would have to change the following things:

I installed the test folder in order to access the controller configuration file living in there. We should either install that file by hand rather than the complete test folder or move it somewhere else.
I've added a urdf file in ros_control_test_assets as I thought that might be the most useful place. I might be wrong.
I've added a sleep to my test before checking whether the controller_manager shows up all the expected controllers. That worked for me for an implementation, but since a timeout is very error-prone it would be better to have a proper waiting mechanism. With the changes I made the spawners should die eventually and not hang in a deadlock, so we could add event handlers to the launch description, but I'm not sure how to combine the exit events from all the spawners into one trigger.

I implemented and tested things on the rolling on jammy installation I currently have, but I know the problem definitively also arises for humble users.

With the current implementation I expect this to fail most of the times.

This waiting mechanism seems to be very error-prone and is strictly not needed since we wait for all the services, anyway. NOTE: This change removes the ability to specify the controller_manager_timeout.

by setting it as the timeout for the first service call.

Apparently, sercice calls can end up in a deadlock where the server says it cannot send the response to the client. In that case, if we spin_until_future_complete() we will be in a deadlock forever. Hence, this commit adds a timeout and retry mechanism to the service call abstraction.

christophfroehlich

The changes look fine to me.

Couldn't you hardcode the URDF directly in the launch_testing file instead of running xacro? Then you wouldn't need to install the URDF. Otherwise, you could use a relative path from the launch_testing file to the urdf instead of FindPackageShare.
Same for the yaml (haven't done that yet to be honest).
instead of waiting 30s, would it make sense to check repeatedly in a loop if already all controllers are running? (+ some delay at every loop, +max loops of course)

Edit: We could just put URDF+yaml into the test_assets package and leave the launch file as it is.

fmauch · 2024-04-23T20:53:13Z

I'll have to have a look at the tests anyway, since they seem not to work on noble... I'll probably wait to properly address this until Friday when noble is released.

Thank you for the input @christophfroehlich. Moving both files to the test_assets didn't seem right to me, since the controller config file is a bit awkward (I mean, I spawn 50 joint_state_controllers for the same joints). I tried including things into the launchfile, but since the parameters aren't for the ros2_control_node this doesn't work quite straightforward. But maybe there is a way. However, I am currently not sure whether I have to install the python test, anyway, so it would also not completely harm to install the controller file, I think.

It looks like #1483 might get merged before this, though, in which case most of the changes here aren't necessarily required anymore and maybe also the test is questionable. We might hold this until #1483 is merged and it is decided how to proceed with humble and iron.

controller_manager/controller_manager/controller_manager_services.py

…ces.py

destogl

I find this arbitrary repetition time a bit "mew", if we provide this properly than it would make more sense.

controller_manager/controller_manager/controller_manager_services.py

ros2_control_test_assets/urdf/test_description_mock.urdf

fmauch · 2024-08-14T17:23:51Z

I'll update this, now that #1562 is merged.

This is not related to this PR

codecov · 2024-08-15T08:30:01Z

Codecov Report

Attention: Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.64%. Comparing base (af4b48f) to head (6d627b5).
Report is 2 commits behind head on master.

Files	Patch %	Lines
.../controller_manager/controller_manager_services.py	55.55%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1501      +/-   ##
==========================================
- Coverage   86.67%   86.64%   -0.03%     
==========================================
  Files         115      115              
  Lines       10528    10544      +16     
  Branches      967      970       +3     
==========================================
+ Hits         9125     9136      +11     
- Misses       1056     1059       +3     
- Partials      347      349       +2

Flag	Coverage Δ
unittests	`86.64% <80.95%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
controller_manager/test/test_spawner_unspawner.cpp	`99.05% <100.00%> (+0.05%)`	⬆️
.../controller_manager/controller_manager_services.py	`66.15% <55.55%> (-1.59%)`	⬇️

... and 1 file with indirect coverage changes

This is not related to this PR

fmauch · 2024-08-15T08:42:24Z

I'm happy with how that currently is, I'm just not sure about the docstring style. I don't have a lot of experience regarding python docstrings, but I do prefer using type hints in the signature and restructured text in the docstring:

def service_caller(                                                                                                                                                                                                                   
     node: 'Node',                                                                                                                                                                                                                     
     service_name: str,                                                                                                                                                                                                                
     service_type: str,                                                                                                                                                                                                                
     request: SrvTypeRequest,                                                                                                                                                                                                          
     service_timeout: Optional[float] = 0.0,                                                                                                                                                                                           
     call_timeout: Optional[float] = 10.0,                                                                                                                                                                                             
     max_attempts: Optional[int] = 3,                                                                                                                                                                                                  
   -> SrvTypeResponse:                                                                                                                                                                                                                 
     """                                                                                                                                                                                                                               
     Abstraction of a service call.                                                                                                                                                                                                    
                                                                                                                                                                                                                                       
     Has an optional timeout to find the service and a mechanism                                                                                                                                                                       
     to retry a call of no response is received.                                                                                                                                                                                       
                                                                                                                                                                                                                                       
     :param node: Node object to be associated with                                                                                                                                                                                    
     :param service_name: Service URL                                                                                                                                                                                                  
     :param request: The request to be sent                                                                                                                                                                                            
     :param service_timeout: Timeout (in seconds) to wait until the service is available. 0 means                                                                                                                                      
         waiting forever, retrying every 10 seconds.                                                                                                                                                                                   
     :param call_timeout: Timeout (in seconds) for getting a response                                                                                                                                                                  
     :param max_attempts: Number of attempts until a valid response is received. With some                                                                                                                                             
         middlewares it can happen, that the service response doesn't reach the client                                                                                                                                                 
         leaving it in a waiting state forever.                                                                                                                                                                                        
     :return: The service response                                                                                                                                                                                                     
                                                                                                                                                                                                                                       
     """

As there aren't too many python docstrings in this project and I couldn't find any specification on what to use, I would like to ask, whether there is a preference here.

destogl · 2024-08-15T10:22:43Z

As there aren't too many python docstrings in this project and I couldn't find any specification on what to use, I would like to ask, whether there is a preference here.

This is great! Now we have a reference :)

destogl

Great work, just a few minor cosmetic proposals if you like. It can go in also without it.

controller_manager/controller_manager/controller_manager_services.py

controller_manager/test/test_spawner_unspawner.cpp

Co-authored-by: Dr. Denis <denis@stoglrobotics.de>

saikishor

LGTM

…ollers (#1501) --------- Co-authored-by: Dr. Denis <denis@stoglrobotics.de> (cherry picked from commit 80c264f) # Conflicts: # controller_manager/test/test_spawner_unspawner.cpp

--------- Co-authored-by: Felix Exner (fexner) <exner@fzi.de> Co-authored-by: Dr. Denis <denis@stoglrobotics.de>

* Robustify controller spawner and add integration test with many controllers (#1501) --------- Co-authored-by: Felix Exner (fexner) <exner@fzi.de> Co-authored-by: Dr. Denis <denis@stoglrobotics.de>

tonynajjar · 2024-10-21T15:52:23Z

controller_manager/controller_manager/controller_manager_services.py

+        if future.result() is None:
+            node.get_logger().warning(
+                f"Failed getting a result from calling {service_name} in "
+                f"{service_timeout}. (Attempt {attempt+1} of {max_attempts}.)"


@fmauch should be call_timeout right?

Yes, that's right! Thanks for spotting this

tonynajjar · 2024-10-22T07:14:30Z

controller_manager/controller_manager/controller_manager_services.py

+    service_timeout=0.0,
+    call_timeout=10.0,


@fmauch one more question, do you see an easy way to make these configurable by the user? My bringup is quite CPU intensive causing the call_timeout to be reached, however it works if I increase it a bit

If not, any harm in arbitrarily increasing it to e.g. 30?

fmauch added 4 commits April 23, 2024 12:07

Add integration test with a lot of controller spawners

324a012

With the current implementation I expect this to fail most of the times.

Do not wait for controller_manager node

7ee0fad

This waiting mechanism seems to be very error-prone and is strictly not needed since we wait for all the services, anyway. NOTE: This change removes the ability to specify the controller_manager_timeout.

Re-add controller_manager_timeout

e25ea4f

by setting it as the timeout for the first service call.

github-actions bot requested review from bmagyar, christophfroehlich, DasRoteSkelett, destogl, livanov93, moriarty and progtologist April 23, 2024 10:58

fmauch mentioned this pull request Apr 23, 2024

Handle on waiting #1483

Closed

Add missing controllers file

9b4ecfc

christophfroehlich reviewed Apr 23, 2024

View reviewed changes

fmauch mentioned this pull request Apr 24, 2024

(ur_control.launch.py) custom description file from custom pkg UniversalRobots/Universal_Robots_ROS2_Driver#947

Closed

1 task

fmauch mentioned this pull request Jun 10, 2024

Update spawner to have a timeout on service call #1566

Closed

6 tasks

Merge branch 'master' into robustify_spawner

f26e536

destogl reviewed Aug 14, 2024

View reviewed changes

controller_manager/controller_manager/controller_manager_services.py Outdated Show resolved Hide resolved

Update controller_manager/controller_manager/controller_manager_servi…

de424ac

…ces.py

destogl reviewed Aug 14, 2024

View reviewed changes

controller_manager/controller_manager/controller_manager_services.py Outdated Show resolved Hide resolved

ros2_control_test_assets/urdf/test_description_mock.urdf Outdated Show resolved Hide resolved

fmauch added 8 commits August 14, 2024 20:52

Add call_timeout parameter to service_caller

484cd4a

Add documentation to service_caller parameters

77f7e16

Make base_joint a revolute joint

3e3aded

Use attempt counting with starting from 1

85c593c

Add test_dependency on launch_testing_ament_cmake

a98ca81

Correct error message on failed service call

aa0aa97

Add more test dependencies

2120fd8

Add used controllers to test dependencies

a77c600

fmauch added 5 commits August 15, 2024 00:06

Do not depend on other controllers

14028ea

Reduce installation scope

66ac4e8

Fix test_ros2_control_node.yaml

a117f8c

Use gtest instead of launch_testing for integration test

723a896

Revert sorting package.xml

26262c6

This is not related to this PR

Revert fixing yaml

49137f2

This is not related to this PR

destogl previously approved these changes Aug 15, 2024

View reviewed changes

destogl added backport-humble This label should be used by maintainers only! Label triggers PR backport to ROS2 humble. backport-iron labels Aug 15, 2024

fmauch dismissed destogl’s stale review via 6d627b5 August 15, 2024 11:17

Apply suggestions from code review

6d627b5

Co-authored-by: Dr. Denis <denis@stoglrobotics.de>

destogl approved these changes Aug 15, 2024

View reviewed changes

saikishor approved these changes Aug 15, 2024

View reviewed changes

destogl merged commit 80c264f into ros-controls:master Aug 16, 2024
19 checks passed

This was referenced Aug 16, 2024

Robustify spawner (backport #1501) #1686

Merged

Robustify spawner (backport #1501) #1687

Merged

destogl added a commit that referenced this pull request Aug 16, 2024

Robustify spawner (backport #1501) (#1686)

b6bdce9

--------- Co-authored-by: Felix Exner (fexner) <exner@fzi.de> Co-authored-by: Dr. Denis <denis@stoglrobotics.de>

fmauch deleted the robustify_spawner branch August 28, 2024 09:16

tonynajjar reviewed Oct 21, 2024

View reviewed changes

fmauch mentioned this pull request Oct 22, 2024

Fix timeout value in std output #1807

Merged

tonynajjar reviewed Oct 22, 2024

View reviewed changes

This was referenced Oct 25, 2024

Fix timeout value in std output (backport #1807) #1812

Merged

Fix timeout value in std output (backport #1807) #1813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustify spawner #1501

Robustify spawner #1501

fmauch commented Apr 23, 2024 •

edited by destogl

Loading

christophfroehlich left a comment •

edited

Loading

fmauch commented Apr 23, 2024

destogl left a comment

fmauch commented Aug 14, 2024

codecov bot commented Aug 15, 2024 •

edited

Loading

fmauch commented Aug 15, 2024

destogl commented Aug 15, 2024

destogl left a comment

saikishor left a comment

tonynajjar Oct 21, 2024

fmauch Oct 22, 2024

tonynajjar Oct 22, 2024

tonynajjar Oct 22, 2024

tonynajjar Oct 22, 2024

Robustify spawner #1501

Robustify spawner #1501

Conversation

fmauch commented Apr 23, 2024 • edited by destogl Loading

christophfroehlich left a comment • edited Loading

Choose a reason for hiding this comment

fmauch commented Apr 23, 2024

destogl left a comment

Choose a reason for hiding this comment

fmauch commented Aug 14, 2024

codecov bot commented Aug 15, 2024 • edited Loading

Codecov Report

fmauch commented Aug 15, 2024

destogl commented Aug 15, 2024

destogl left a comment

Choose a reason for hiding this comment

saikishor left a comment

Choose a reason for hiding this comment

tonynajjar Oct 21, 2024

Choose a reason for hiding this comment

fmauch Oct 22, 2024

Choose a reason for hiding this comment

tonynajjar Oct 22, 2024

Choose a reason for hiding this comment

tonynajjar Oct 22, 2024

Choose a reason for hiding this comment

tonynajjar Oct 22, 2024

Choose a reason for hiding this comment

fmauch commented Apr 23, 2024 •

edited by destogl

Loading

christophfroehlich left a comment •

edited

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading