Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not crash Executor when send_response fails due to client failure. #2215

Closed
wants to merge 2 commits into from

Conversation

kghost
Copy link
Contributor

@kghost kghost commented Jun 16, 2023

Related to ros2/ros2#1253

It is not sane that a faulty client can crash our service Executor, as discussed in the referred issue, if the client is not setup properly, send_response may return RCL_RET_TIMEOUT, we should not throw an error in this case.

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kghost thanks for creating PR on this issue.

rclcpp/include/rclcpp/service.hpp Outdated Show resolved Hide resolved
rclcpp/include/rclcpp/service.hpp Outdated Show resolved Hide resolved
@fujitatomoya
Copy link
Collaborator

@kghost thank you for being patient on this.

I will be checking the following and then start CI all together.

  • underlying tier 1 rmw implementation returns timeout when it detects the timeout? (unless, it will generate the exception)
  • rclpy counterpart PR against this rclcpp.

@fujitatomoya
Copy link
Collaborator

underlying tier 1 rmw implementation returns timeout when it detects the timeout? (unless, it will generate the exception)

see ros2/rcl#1048 (comment)

rclpy counterpart PR against this rclcpp.

see ros2/rclpy#1136

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the one change inline, and also likely needs to be rebased. Once those are done, I'm happy to approve this.

rclcpp/include/rclcpp/service.hpp Outdated Show resolved Hide resolved
@milidam
Copy link

milidam commented Jul 7, 2023

I am experiencing the same kind of crash using the fastrtps RMW implementation with ros2 humble.
However, the proposed patch does not avoid the crash, as the RMW_RET_TIMEOUT error code returned by the fastrtps' __rmw_send_response is changed to RCL_RET_ERROR in rcl_send_response.
The same is done in rolling : https://github.com/ros2/rcl/blob/rolling/rcl/src/rcl/service.c#L403

Am I missing somthing? or are there any other patches required for it to work?

@clalancette
Copy link
Contributor

Am I missing somthing? or are there any other patches required for it to work?

Yes, you also need ros2/rcl#1048

@fujitatomoya
Copy link
Collaborator

@kghost are you willing to work on this? see #2215 (review)

@kghost
Copy link
Contributor Author

kghost commented Jul 8, 2023

@kghost are you willing to work on this? see #2215 (review)

no problem!

@kghost kghost requested a review from clalancette July 10, 2023 10:03
@fujitatomoya
Copy link
Collaborator

@kghost thank you very much for the effort, can you address DCO error? see https://github.com/ros2/rclcpp/pull/2215/checks?check_run_id=14906339355

@fujitatomoya
Copy link
Collaborator

@kghost this also need to be rebased on rolling, see https://ci.ros2.org/job/ci_linux/19079/console CI failed on building rclcpp.

@kghost
Copy link
Contributor Author

kghost commented Jul 11, 2023

@fujitatomoya Yes, it is already rebased on rolling branch.

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status

windows failure is unrelated.

kghost and others added 2 commits July 14, 2023 18:32
Related to ros2/ros2#1253

It is not sane that a faulty client can crash our service Executor, as
discussed in the referred issue, if the client is not setup properly,
send_response may return RCL_RET_TIMEOUT, we should not throw an error
in this case.

Signed-off-by: Zang MingJie <zealot0630@gmail.com>
Co-authored-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
Signed-off-by: Zang MingJie <zealot0630@gmail.com>
@kghost
Copy link
Contributor Author

kghost commented Jul 18, 2023

@clalancette This is ready to merge.

@@ -486,6 +486,14 @@ class Service
{
rcl_ret_t ret = rcl_send_response(get_service_handle().get(), &req_id, &response);

if (ret == RCL_RET_TIMEOUT) {
RCLCPP_WARN(
rclcpp::get_node_logger(node_handle_.get()).get_child("rclcpp"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using node_logger_ here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that there is access to the node logger in this class right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a protected member of ServiceBase

rclcpp::Logger node_logger_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call. In that case, yeah, we should use that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kghost can you address this one? and then i think we are good to go.

if (ret == RCL_RET_TIMEOUT) {
RCLCPP_WARN(
rclcpp::get_node_logger(node_handle_.get()).get_child("rclcpp"),
"failed to send response (timeout): %s",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add the service name in the log message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with trying to do that here is that not all of the constructors have access to the service name. So this may not be possible to do without changing all of the constructors, or exposing service_name in rcl_service_t. Personally, I think this is nice but I wouldn't hold up this PR for that.

@mauropasse
Copy link
Collaborator

Should changes in rclcpp/include/rclcpp/service.hpp be also applied also here?:

ret = rcl_action_send_goal_response(

ret = rcl_action_send_cancel_response(

rcl_ret_t rcl_ret = rcl_action_send_result_response(

rcl_ret_t ret = rcl_action_send_result_response(

Related to this topic, we're in conversations to set the timeout in DDS from an XML, so we can increase the timeout before returning RCL_RET_TIMEOUT. If the client is discovered before it, it will get the response from the service. Otherwise, it'll hang forever waiting a response (at least, that happens with actions).

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @mauropasse makes a good point and we should do the same thing for the action side of things as well. Once that is fixed I'm happy to approve this PR.

if (ret == RCL_RET_TIMEOUT) {
RCLCPP_WARN(
rclcpp::get_node_logger(node_handle_.get()).get_child("rclcpp"),
"failed to send response (timeout): %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with trying to do that here is that not all of the constructors have access to the service name. So this may not be possible to do without changing all of the constructors, or exposing service_name in rcl_service_t. Personally, I think this is nice but I wouldn't hold up this PR for that.

@@ -486,6 +486,14 @@ class Service
{
rcl_ret_t ret = rcl_send_response(get_service_handle().get(), &req_id, &response);

if (ret == RCL_RET_TIMEOUT) {
RCLCPP_WARN(
rclcpp::get_node_logger(node_handle_.get()).get_child("rclcpp"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that there is access to the node logger in this class right now.

@fujitatomoya
Copy link
Collaborator

I think that @mauropasse makes a good point and we should do the same thing for the action side of things as well. Once that is fixed I'm happy to approve this PR.

@clalancette @mauropasse can we manage this with another issue before adding the implementation in this PR? i would like to check if that happens in detail.

@fujitatomoya
Copy link
Collaborator

I think that @mauropasse makes a good point and we should do the same thing for the action side of things as well. Once that is fixed I'm happy to approve this PR.

@clalancette @mauropasse can we manage this with another issue before adding the implementation in this PR? i would like to check if that happens in detail.

ros2/ros2#1462

@fujitatomoya
Copy link
Collaborator

@clalancette @mauropasse friendly ping.

@mauropasse
Copy link
Collaborator

mauropasse commented Jul 27, 2023

@fujitatomoya : i would like to check if that happens in detail.

What do you mean with "that happens in detail"?

I know it happens with actions since that's where we observed this issue first, and come with the temporary solution of increasing the timeout.

With regards of applying the action side of these fixes in a different PR, I'm OK with that. Maybe adding tests for these situations also.

@fujitatomoya
Copy link
Collaborator

@mauropasse i was not sure if this problem can be actually reproducible with actions, thanks for the information.

With regards of applying the action side of these fixes in a different PR, I'm OK with that

👍 @clalancette what do you think?

@clalancette
Copy link
Contributor

👍 @clalancette what do you think?

Yeah, that's fine. Let's keep this one just about the service server, and we can do the follow-up with ros2/ros2#1462 for action servers.

I'll go ahead and re-review this.

@fujitatomoya
Copy link
Collaborator

@kghost since you are not responding, i did create an another PR for this to proceed. I keep the all commits that you made. it would be appreciated if you can review it. see #2276

@fujitatomoya
Copy link
Collaborator

@clalancette @kghost i will go ahead to close this, instead can we proceed with #2276?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants