Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robustify spawner #1501

Merged
merged 22 commits into from
Aug 16, 2024
Merged

Robustify spawner #1501

merged 22 commits into from
Aug 16, 2024

Conversation

fmauch
Copy link
Contributor

@fmauch fmauch commented Apr 23, 2024

With the current implementation it can happen that when using multiple controller spawners some of them fail or get stuck, see #1182. This fixes #1182 and #1483.

While writing this, I realize this has great overlap with #1483, but I see no problem combining those two.

As briefly mentioned above, this change addresses two things:

  • The node discovery mechanism for the controller manager seems to be error-prone and not strictly needed.
  • It can happen that we don't get a response from the service server and hang in a deadlock.

Note: The test I implemented is a bit hacky, so we might also want to remove it again? Otherwise I think we would have to change the following things:

  • I installed the test folder in order to access the controller configuration file living in there. We should either install that file by hand rather than the complete test folder or move it somewhere else.
  • I've added a urdf file in ros_control_test_assets as I thought that might be the most useful place. I might be wrong.
  • I've added a sleep to my test before checking whether the controller_manager shows up all the expected controllers. That worked for me for an implementation, but since a timeout is very error-prone it would be better to have a proper waiting mechanism. With the changes I made the spawners should die eventually and not hang in a deadlock, so we could add event handlers to the launch description, but I'm not sure how to combine the exit events from all the spawners into one trigger.

I implemented and tested things on the rolling on jammy installation I currently have, but I know the problem definitively also arises for humble users.

fmauch added 4 commits April 23, 2024 12:07
With the current implementation I expect this to fail most of the times.
This waiting mechanism seems to be very error-prone and is strictly not needed
since we wait for all the services, anyway.

NOTE: This change removes the ability to specify the
controller_manager_timeout.
by setting it as the timeout for the first service call.
Apparently, sercice calls can end up in a deadlock where the server says
it cannot send the response to the client. In that case, if we spin_until_future_complete()
we will be in a deadlock forever. Hence, this commit adds a timeout and retry
mechanism to the service call abstraction.
Copy link
Contributor

@christophfroehlich christophfroehlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look fine to me.

  • Couldn't you hardcode the URDF directly in the launch_testing file instead of running xacro? Then you wouldn't need to install the URDF. Otherwise, you could use a relative path from the launch_testing file to the urdf instead of FindPackageShare.
  • Same for the yaml (haven't done that yet to be honest).
  • instead of waiting 30s, would it make sense to check repeatedly in a loop if already all controllers are running? (+ some delay at every loop, +max loops of course)

Edit: We could just put URDF+yaml into the test_assets package and leave the launch file as it is.

@fmauch
Copy link
Contributor Author

fmauch commented Apr 23, 2024

I'll have to have a look at the tests anyway, since they seem not to work on noble... I'll probably wait to properly address this until Friday when noble is released.

Thank you for the input @christophfroehlich. Moving both files to the test_assets didn't seem right to me, since the controller config file is a bit awkward (I mean, I spawn 50 joint_state_controllers for the same joints). I tried including things into the launchfile, but since the parameters aren't for the ros2_control_node this doesn't work quite straightforward. But maybe there is a way. However, I am currently not sure whether I have to install the python test, anyway, so it would also not completely harm to install the controller file, I think.

It looks like #1483 might get merged before this, though, in which case most of the changes here aren't necessarily required anymore and maybe also the test is questionable. We might hold this until #1483 is merged and it is decided how to proceed with humble and iron.

Copy link
Member

@destogl destogl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this arbitrary repetition time a bit "mew", if we provide this properly than it would make more sense.

@fmauch
Copy link
Contributor Author

fmauch commented Aug 14, 2024

I'll update this, now that #1562 is merged.

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.64%. Comparing base (af4b48f) to head (6d627b5).
Report is 2 commits behind head on master.

Files Patch % Lines
.../controller_manager/controller_manager_services.py 55.55% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1501      +/-   ##
==========================================
- Coverage   86.67%   86.64%   -0.03%     
==========================================
  Files         115      115              
  Lines       10528    10544      +16     
  Branches      967      970       +3     
==========================================
+ Hits         9125     9136      +11     
- Misses       1056     1059       +3     
- Partials      347      349       +2     
Flag Coverage Δ
unittests 86.64% <80.95%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
controller_manager/test/test_spawner_unspawner.cpp 99.05% <100.00%> (+0.05%) ⬆️
.../controller_manager/controller_manager_services.py 66.15% <55.55%> (-1.59%) ⬇️

... and 1 file with indirect coverage changes

This is not related to this PR
@fmauch
Copy link
Contributor Author

fmauch commented Aug 15, 2024

I'm happy with how that currently is, I'm just not sure about the docstring style. I don't have a lot of experience regarding python docstrings, but I do prefer using type hints in the signature and restructured text in the docstring:

def service_caller(                                                                                                                                                                                                                   
     node: 'Node',                                                                                                                                                                                                                     
     service_name: str,                                                                                                                                                                                                                
     service_type: str,                                                                                                                                                                                                                
     request: SrvTypeRequest,                                                                                                                                                                                                          
     service_timeout: Optional[float] = 0.0,                                                                                                                                                                                           
     call_timeout: Optional[float] = 10.0,                                                                                                                                                                                             
     max_attempts: Optional[int] = 3,                                                                                                                                                                                                  
   -> SrvTypeResponse:                                                                                                                                                                                                                 
     """                                                                                                                                                                                                                               
     Abstraction of a service call.                                                                                                                                                                                                    
                                                                                                                                                                                                                                       
     Has an optional timeout to find the service and a mechanism                                                                                                                                                                       
     to retry a call of no response is received.                                                                                                                                                                                       
                                                                                                                                                                                                                                       
     :param node: Node object to be associated with                                                                                                                                                                                    
     :param service_name: Service URL                                                                                                                                                                                                  
     :param request: The request to be sent                                                                                                                                                                                            
     :param service_timeout: Timeout (in seconds) to wait until the service is available. 0 means                                                                                                                                      
         waiting forever, retrying every 10 seconds.                                                                                                                                                                                   
     :param call_timeout: Timeout (in seconds) for getting a response                                                                                                                                                                  
     :param max_attempts: Number of attempts until a valid response is received. With some                                                                                                                                             
         middlewares it can happen, that the service response doesn't reach the client                                                                                                                                                 
         leaving it in a waiting state forever.                                                                                                                                                                                        
     :return: The service response                                                                                                                                                                                                     
                                                                                                                                                                                                                                       
     """ 

As there aren't too many python docstrings in this project and I couldn't find any specification on what to use, I would like to ask, whether there is a preference here.

@destogl
Copy link
Member

destogl commented Aug 15, 2024

As there aren't too many python docstrings in this project and I couldn't find any specification on what to use, I would like to ask, whether there is a preference here.

This is great! Now we have a reference :)

destogl
destogl previously approved these changes Aug 15, 2024
Copy link
Member

@destogl destogl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, just a few minor cosmetic proposals if you like. It can go in also without it.

@destogl destogl added backport-humble This label should be used by maintainers only! Label triggers PR backport to ROS2 humble. backport-iron labels Aug 15, 2024
Co-authored-by: Dr. Denis <denis@stoglrobotics.de>
Copy link
Member

@saikishor saikishor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@destogl destogl merged commit 80c264f into ros-controls:master Aug 16, 2024
19 checks passed
mergify bot pushed a commit that referenced this pull request Aug 16, 2024
…ollers (#1501)

---------

Co-authored-by: Dr. Denis <denis@stoglrobotics.de>
(cherry picked from commit 80c264f)

# Conflicts:
#	controller_manager/test/test_spawner_unspawner.cpp
mergify bot pushed a commit that referenced this pull request Aug 16, 2024
…ollers (#1501)

---------

Co-authored-by: Dr. Denis <denis@stoglrobotics.de>
(cherry picked from commit 80c264f)

# Conflicts:
#	controller_manager/test/test_spawner_unspawner.cpp
destogl added a commit that referenced this pull request Aug 16, 2024
---------

Co-authored-by: Felix Exner (fexner) <exner@fzi.de>
Co-authored-by: Dr. Denis <denis@stoglrobotics.de>
destogl added a commit that referenced this pull request Aug 16, 2024
* Robustify controller spawner and add integration test with many controllers (#1501)

---------

Co-authored-by: Felix Exner (fexner) <exner@fzi.de>
Co-authored-by: Dr. Denis <denis@stoglrobotics.de>
@fmauch fmauch deleted the robustify_spawner branch August 28, 2024 09:16
if future.result() is None:
node.get_logger().warning(
f"Failed getting a result from calling {service_name} in "
f"{service_timeout}. (Attempt {attempt+1} of {max_attempts}.)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmauch should be call_timeout right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right! Thanks for spotting this

Comment on lines +40 to +41
service_timeout=0.0,
call_timeout=10.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmauch one more question, do you see an easy way to make these configurable by the user? My bringup is quite CPU intensive causing the call_timeout to be reached, however it works if I increase it a bit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, any harm in arbitrarily increasing it to e.g. 30?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-humble This label should be used by maintainers only! Label triggers PR backport to ROS2 humble.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating multiple spawners can cause issues finding the CM
5 participants