Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle on waiting #1483

Closed
wants to merge 9 commits into from
Closed

Conversation

Timple
Copy link
Contributor

@Timple Timple commented Apr 16, 2024

That was easier than anticipated, fixes: #1482 and #1200

@Timple
Copy link
Contributor Author

Timple commented Apr 16, 2024

Note, I didn't touch the other except KeyboardInterrupt: statement as it actually does work when controllers are successfully loaded.

@christophfroehlich
Copy link
Contributor

Does it make sense to add this to the unspawner script as well?

@Timple
Copy link
Contributor Author

Timple commented Apr 16, 2024

Good question! That doesn't have the same check in it.

Coming to think of this, since asking for permission instead of forgiveness is an antipattern (and actually breaks stuff: #1200).

I would opt for modifying this PR to align spawner and unspawner to both retry-and-log instead of the more expensive check-and-try.

Copy link
Contributor

@christophfroehlich christophfroehlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! LGTM and tested with/without running CM.

would you mind adding hardware_spawner also to the helper scripts section in userdoc.rst, as you have fixed the arguments of the other scripts already there?

edit: the tests are failing now, could you also have a look there please? (see RHEL/debian workflows)

@Timple
Copy link
Contributor Author

Timple commented Apr 19, 2024

I need some assistance here. CI is failing on:

    Could not find a package configuration file provided by "test_msgs" with
    any of the following names:
  
      test_msgsConfig.cmake

But I didn't change anything related to test_msgs here?

@christophfroehlich
Copy link
Contributor

But I didn't change anything related to test_msgs here?

This was some issue with the rolling CI due to the transition to Ubuntu noble. The issues related to your changes are

[ RUN      ] TestLoadController.spawner_without_manager_errors
...
[WARN] [1713556932.603576757] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556942.609051609] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556952.614319281] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556962.619689181] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556972.625026280] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers

@Timple Timple force-pushed the fix/spawner-interrupt branch from 121a8a5 to 1941427 Compare April 22, 2024 12:54
@fmauch fmauch mentioned this pull request Apr 23, 2024
@fmauch
Copy link
Contributor

fmauch commented Apr 23, 2024

I didn't have a massive amount of time to look into things the last couple of days but I wanted to finally fix #1182 today. My result is in #1501. This PR basically solves the first problem from #1182, as well, so +1 from me. But seeing the

Could not contact service /controller_manager/list_controllers

indicates that there is still something missing.

Edit: Actually, looking at the test it is obvious that this fails since this PR explicitly changes the spawner not to fail if it can't find the CM while the test expects it to fail. If that's the desired behavior that test probably simply has to be removed.

@Timple
Copy link
Contributor Author

Timple commented Apr 23, 2024

I dropped the test. As it was testing if false inputs gave no result. It could be written if a warning was issued, but that would mean parsing stdout or stderr in tests. Which is also a bit frowned upon.

Copy link
Contributor

@fmauch fmauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good and I manually tested them in my setup.

If the removal of the controller_manager_timeout parameter is acceptable I think this is actually a good solution. However, this is a breaking change, so now would be the best time to merge this.

Porting this back to Iron & Humble isn't that straightforward, we could use my implementation from #1501 for that where I also removed the wait for controller_manager functionality but added the controller_manager_timeout for the first service call. Or we do indeed remove the functionality and print a deprecation warning on those if the timeout is not the default value.

@fmauch
Copy link
Contributor

fmauch commented Apr 24, 2024

Discussing in the working group today, we agreed on the following things:

  • The spawners should allow setting a definite timeout in order to make launchfiles actually fail if there's something wrong. Regarding the initial service wait we could use my proposal from e25ea4f. With that users can still define a timeout of 0 in order to make it wait indefinitely. This way this PR would also be directly backportable to Iron and Humble.
  • We'll first merge this and then I'll adjust Robustify spawner #1501 to only contain the missing changes.

@Timple Would you like to update this PR according to the first point please?

@Timple
Copy link
Contributor Author

Timple commented Apr 24, 2024

Regarding backporting, keeping the original timeout with the option of setting it to indefinitely makes sense!

However, I strongly opt for this to be the default in future releases. We've had flaky simulation and hardware startups because the default timeout was close to the boot time of the nodes. Flaky behavior seems the worst of all possible scenarios.
Would this be alright?

@fmauch
Copy link
Contributor

fmauch commented Apr 24, 2024

I think @bmagyar might have an opinion on that.

In the end that boils down to the default value of the timeout. I agree that a larger default timeout than 10 seconds might make a lot of sense, but in the spirit of making it fail by default and not leave launch files hanging indefinitely it might be good to still have one.

@Timple
Copy link
Contributor Author

Timple commented Apr 24, 2024

Yes, but defaults are very important.
I fail to grasp how a node printing once and failing in the startup noise is a more clear message than an error/warning being repeatedly printed.

Only scenario I can think of is some kind of catch mechanism that restarts/reboots. But very likely the result will be exactly the same, but debugging much harder as everything keeps restarting and printing a lot.

If I'm missing a scenario here, or proper scenarios were discussed at the working group, let me know!

@fmauch
Copy link
Contributor

fmauch commented Apr 25, 2024

One scenario is having launchfiles in an autostart job, e.g. a systemd unit where the shell output will never been seen by the user. In this case a failed unit is much more obvious to find than a unit repeating a log output.

@Timple
Copy link
Contributor Author

Timple commented Apr 25, 2024

That's a valid use case. If the logs of the autostart jobs are evaluated before the ROS logs.

Although I'd rather have that specific case as exception than flaky simulation and hardware startups.

@Timple Timple force-pushed the fix/spawner-interrupt branch from 9368265 to 7c2cdcc Compare April 29, 2024 07:12
@destogl
Copy link
Member

destogl commented May 8, 2024

We should keep timeout at it was and just add new option here.

Copy link
Contributor

mergify bot commented Jun 5, 2024

This pull request is in conflict. Could you fix it @Timple?

@bmagyar bmagyar mentioned this pull request Jun 5, 2024
@bmagyar
Copy link
Member

bmagyar commented Jun 5, 2024

Closing in favour of #1562

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyboardInterrupt exception if shutdown before controller manager is found
5 participants