Not getting service responses reliably when using CycloneDDS #74

hidmic · 2019-11-14T19:08:47Z

Bug report

Required Info:

Operating System:
- Ubuntu 18.04, both AMD64 and ARM64
Installation type:
- Binaries and from source
Version or commit hash:
- b92db52bb14ec097946bfe44d2b337359dddfbab (using 4089fee)
DDS implementation:
- CycloneDDS
Client library (if applicable):
- rclcpp only

Steps to reproduce issue

On one terminal, run:

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run demo_nodes_cpp add_two_ints_server

On another terminal, run:

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run demo_nodes_cpp add_two_ints_client_async

Expected behavior

A single service call takes place and succeeds i.e. you get logs on each terminal:

[INFO] [add_two_ints_server]: Incoming request
a: 2 b: 3

[INFO] [add_two_ints_client]: Result of add_two_ints: 5

Actual behavior

The client hangs with no output though the server does log so it appears to have received the request.

Additional information

It does not happen on rclpy. It also does not happen if bringing up both nodes via launch e.g.:

from launch import LaunchDescription
from launch_ros.actions import Node


def generate_launch_description():
    return LaunchDescription([
        Node(
            package='demo_nodes_cpp',
            node_executable='add_two_ints_server',
            output='screen'
        ),
        Node(
            package='demo_nodes_cpp',
            node_executable='add_two_ints_client_async',
            output='screen'
        )
    ])

So it's likely a race.

The text was updated successfully, but these errors were encountered:

hidmic · 2019-12-06T12:46:57Z

I can still reproduce on master.

dirk-thomas · 2019-12-06T18:52:36Z

I can confirm the problem.

My first guess is a race that the client discovers the server and sends it request but when the server receives the request and wants to publish the reply it hasn't discovered the client yet.

With a simple hack (adding a sleep of 500ms into the server callback function) the client receives the response reliable.

I guess we might something like a "wait_for_client" in the same way as we already offer a "wait_for_service".

eboasson · 2019-12-06T21:03:17Z

My first guess is a race that the client discovers the server and sends it request but when the server receives the request and wants to publish the reply it hasn't discovered the client yet.

Cyclone implements what the spec calls "stateful" readers and writers, which means the server's reader for the request topic won't accept any data until it has discovered the client's writer. "wait_for_service" waits until the server's reader for the request topic has been discovered, consequently the client's writer will hold on to the request message until the service's reader acknowledges it. It is therefore pretty much impossible that service is unaware of the existence of the client.

Reader and writer discovery do run independently in RTPS and so it is technically possible that the server has discovered the client's request writer and handled the request without having discovered the client's response reader yet, but that does seem unlikely.

I haven't had a chance to try to reproduce it yet (and this weekend it's not likely to happen either because of all kinds of obligations) but if I were to try, I'd probably enable Cyclone's tracing to get some more insight in what is going on. (Not for the faint of heart, allegedly … CYCLONEDDS_URI='finest</>cdds.log.${CYCLONEDDS_PID}' suffices. You have to squint a bit to figure out how that relates to proper XML 🙂.)

hidmic · 2019-12-12T17:49:22Z

Reader and writer discovery do run independently in RTPS and so it is technically possible that the server has discovered the client's request writer and handled the request without having discovered the client's response reader yet, but that does seem unlikely.

I'm no expert here. Having said that, I think that may be the case @eboasson.

The server sets up a response writer and a request reader first:

1576167310.543602 [0] add_two_in: new_writer(guid 8c561001:dc2ce137:5055b4d2:1303, (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_)
1576167310.543645 [0] add_two_in: ref_participant(8c561001:dc2ce137:5055b4d2:1c1 @ 0x559c5a20f930 <- 8c561001:dc2ce137:5055b4d2:1303 @ 0x559c5a24ac84) user 20 builtin 15
1576167310.543658 [0] add_two_in: WRITER 8c561001:dc2ce137:5055b4d2:1303 QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
1576167310.543665 [0] add_two_in: write_sample 0:0:0:3c2 #9: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167310.543668 [0] add_two_in: reader 8c561001:dc2ce137:5055b4d2:307
1576167310.543673 [0] add_two_in: match_writer_with_proxy_readers(wr 8c561001:dc2ce137:5055b4d2:1303) scanning all proxy readers
1576167310.543680 [0] add_two_in: match_writer_with_readers(wr 8c561001:dc2ce137:5055b4d2:1303) scanning all readers
1576167310.543687 [0] add_two_in: sedp: write for 8c561001:dc2ce137:5055b4d2:1303 via 8c561001:dc2ce137:5055b4d2:3c2
1576167310.543692 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:3c2 #9: ST0 (null)/(null):(plist)
1576167310.543699 [0] add_two_in: new_reader(guid 8c561001:dc2ce137:5055b4d2:1404, (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_)
1576167310.543702 [0] add_two_in: ref_participant(8c561001:dc2ce137:5055b4d2:1c1 @ 0x559c5a20f930 <- 8c561001:dc2ce137:5055b4d2:1404 @ 0x559c5a24bd74) user 21 builtin 15
1576167310.543709 [0] add_two_in: READER 8c561001:dc2ce137:5055b4d2:1404 QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167310.543714 [0] add_two_in: write_sample 0:0:0:4c2 #11: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167310.543716 [0] add_two_in: reader 8c561001:dc2ce137:5055b4d2:207
1576167310.543719 [0] add_two_in: match_reader_with_proxy_writers(rd 8c561001:dc2ce137:5055b4d2:1404) scanning all proxy writers
1576167310.543721 [0] add_two_in: match_reader_with_writers(rd 8c561001:dc2ce137:5055b4d2:1404) scanning all writers
1576167310.543727 [0] add_two_in: sedp: write for 8c561001:dc2ce137:5055b4d2:1404 via 8c561001:dc2ce137:5055b4d2:4c2
1576167310.543730 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:4c2 #8: ST0 (null)/(null):(plist)

When the client gets brought up, it discovers the server in full:

1576167362.690709 [0] dq.builtin: write_sample 0:0:0:3c2 #9: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167362.690712 [0] dq.builtin: reader 1481001:ddd08e4f:524471cd:307
1576167362.690717 [0] dq.builtin: match_proxy_writer_with_readers(pwr 8c561001:dc2ce137:5055b4d2:1103) scanning all readers
1576167362.690722 [0] add_two_in: new_writer(guid 1481001:ddd08e4f:524471cd:503, (default).rr/add_two_ints_client/get_parametersReply/rcl_interfaces::srv::dds_::GetParameters_Response_)
1576167362.690728 [0] dq.builtin: SEDP ST0 8c561001:dc2ce137:5055b4d2:1303 reliable volatile writer: (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:38672 ssm=0) QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
...
1576167362.690954 [0] dq.builtin: write_sample 0:0:0:4c2 #11: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.690957 [0] dq.builtin: reader 1481001:ddd08e4f:524471cd:207
1576167362.690961 [0] dq.builtin: match_proxy_reader_with_writers(prd 8c561001:dc2ce137:5055b4d2:1204) scanning all writers
1576167362.690974 [0] dq.builtin: SEDP ST0 8c561001:dc2ce137:5055b4d2:1404 reliable volatile reader: (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:38672 ssm=0) QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167362.690981 [0] dq.builtin: write_sample 0:0:0:4c2 #12: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.690983 [0] dq.builtin: reader 1481001:ddd08e4f:524471cd:207
1576167362.690987 [0] dq.builtin: match_proxy_reader_with_writers(prd 8c561001:dc2ce137:5055b4d2:1404) scanning all writers

And then it sets up a response reader and a request writer:

1576167362.694616 [0] add_two_in: new_writer(guid 1481001:ddd08e4f:524471cd:1303, (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_)
1576167362.694622 [0] add_two_in: ref_participant(1481001:ddd08e4f:524471cd:1c1 @ 0x5582e7add940 <- 1481001:ddd08e4f:524471cd:1303 @ 0x5582e7b1cad4) user 20 builtin 15
1576167362.694631 [0] add_two_in: WRITER 1481001:ddd08e4f:524471cd:1303 QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
1576167362.694635 [0] add_two_in: write_sample 0:0:0:3c2 #18: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167362.694637 [0] add_two_in: reader 1481001:ddd08e4f:524471cd:307
1576167362.694641 [0] add_two_in: match_writer_with_proxy_readers(wr 1481001:ddd08e4f:524471cd:1303) scanning all proxy readers
1576167362.694644 [0] add_two_in:   proxy_reader_add_connection(wr 1481001:ddd08e4f:524471cd:1303 prd 8c561001:dc2ce137:5055b4d2:1404)
1576167362.694646 [0] add_two_in:   writer_add_connection(wr 1481001:ddd08e4f:524471cd:1303 prd 8c561001:dc2ce137:5055b4d2:1404) - ack seq 0
1576167362.694650 [0] add_two_in: setcover: all_addrs udp/239.255.0.1:7401 udp/192.168.9.2:38672
1576167362.694653 [0] add_two_in: reduced nlocs=2
1576167362.694657 [0] add_two_in:   loc  0 = udp/192.168.9.2:38672           1 { 0 }
1576167362.694659 [0] add_two_in:   loc  1 = udp/239.255.0.1:7401            1 { 0 }
1576167362.694660 [0] add_two_in:   best = 0
1576167362.694662 [0] add_two_in:   simple udp/192.168.9.2:38672
1576167362.694665 [0] add_two_in: rebuild_writer_addrset(1481001:ddd08e4f:524471cd:1303): udp/192.168.9.2:38672
1576167362.694670 [0] add_two_in: match_writer_with_readers(wr 1481001:ddd08e4f:524471cd:1303) scanning all readers
1576167362.694677 [0] add_two_in: sedp: write for 1481001:ddd08e4f:524471cd:1303 via 1481001:ddd08e4f:524471cd:3c2
1576167362.694680 [0] add_two_in: write_sample 1481001:ddd08e4f:524471cd:3c2 #9: ST0 (null)/(null):(plist)
1576167362.694685 [0] add_two_in: non-timed queue now has 1 items
1576167362.694704 [0] add_two_in: new_reader(guid 1481001:ddd08e4f:524471cd:1404, (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_)
1576167362.694707 [0] add_two_in: ref_participant(1481001:ddd08e4f:524471cd:1c1 @ 0x5582e7add940 <- 1481001:ddd08e4f:524471cd:1404 @ 0x5582e7b1e6b4) user 21 builtin 15
1576167362.694734 [0] add_two_in: READER 1481001:ddd08e4f:524471cd:1404 QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167362.694739 [0]        tev: xpack_addmsg 0x7fea58001340 0x5582e7b1e100 0(data(1481001:ddd08e4f:524471cd:3c2:#9/1)): niov 0 sz 0 => now niov 3 sz 228
1576167362.694744 [0] add_two_in: write_sample 0:0:0:4c2 #19: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.694746 [0] add_two_in: reader 1481001:ddd08e4f:524471cd:207
1576167362.694747 [0]         gc: gc 0x5582e7b1f630: not yet, shortsleep
1576167362.694749 [0] add_two_in: match_reader_with_proxy_writers(rd 1481001:ddd08e4f:524471cd:1404) scanning all proxy writers
1576167362.694753 [0] add_two_in:   reader 1481001:ddd08e4f:524471cd:1404 init_acknack_count = 0
1576167362.694755 [0] add_two_in:   reader_add_connection(pwr 8c561001:dc2ce137:5055b4d2:1303 rd 1481001:ddd08e4f:524471cd:1404)
1576167362.694756 [0]        tev: nn_xpack_send 228: 0x7fea5800134c:20 0x5582e7b1e1c8:36 0x5582e7b1dfbc:172 [ udp/192.168.9.2:38672 ]
1576167362.694758 [0] add_two_in:   proxy_writer_add_connection(pwr 8c561001:dc2ce137:5055b4d2:1303 rd 1481001:ddd08e4f:524471cd:1404) - out-of-sync
1576167362.694758 [0]        tev: traffic-xmit (1) 228
1576167362.694761 [0] add_two_in: match_reader_with_writers(rd 1481001:ddd08e4f:524471cd:1404) scanning all writers
1576167362.694768 [0] add_two_in: sedp: write for 1481001:ddd08e4f:524471cd:1404 via 1481001:ddd08e4f:524471cd:4c2
1576167362.694771 [0] add_two_in: write_sample 1481001:ddd08e4f:524471cd:4c2 #8: ST0 (null)/(null):(plist)

The server discovers the client's request writer first:

1576167362.694778 [0] dq.builtin: SEDP ST0 1481001:ddd08e4f:524471cd:1303 reliable volatile writer: (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:37316 ssm=0) QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
1576167362.694798 [0] dq.builtin: write_sample 0:0:0:3c2 #18: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167362.694800 [0] dq.builtin: reader 8c561001:dc2ce137:5055b4d2:307
1576167362.694803 [0] dq.builtin: match_proxy_writer_with_readers(pwr 1481001:ddd08e4f:524471cd:1303) scanning all readers
1576167362.694826 [0] dq.builtin:   reader 8c561001:dc2ce137:5055b4d2:1404 init_acknack_count = 0
1576167362.694827 [0] dq.builtin:   reader_add_connection(pwr 1481001:ddd08e4f:524471cd:1303 rd 8c561001:dc2ce137:5055b4d2:1404)

The client then proceeds to send a request:

1576167362.694950 [0] add_two_in: write_sample 1481001:ddd08e4f:524471cd:1303 #1: ST0 rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_:{1359775513681701906,1,{2,3}}

And the server receives and replies:

1576167362.695160 [0]     recvUC: INFODST(8c561001:dc2ce137:5055b4d2)
1576167362.695162 [0]     recvUC: INFOTS(1576167362.694938812)
1576167362.695171 [0]     recvUC: DATA(1481001:ddd08e4f:524471cd:1303 -> 8c561001:dc2ce137:5055b4d2:1404 #1 L(:1c1 11198.650423) msr_in_sync(8c561001:dc2ce137:5055b4d2:1404 out-of-sync to tlcatchup)data(application, vendor 1.16): 1481001:ddd08e4f:524471cd:1303 #1: ST0 rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_:{1359775513681701906,1,{2,3}} =>8c561001:dc2ce137:5055b4d2:1404
1576167362.695190 [0]     recvUC: )
[INFO] [1576167362.695305532] [add_two_ints_server]: Incoming request
a: 2 b: 3
1576167362.695364 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:403 #1: ST0 rt/rosout/rcl_interfaces::msg::dds_::Log_:{{1576167362,695305532},20,"add_two_ints_server","Incoming request
a: 2 b: 3","/home/michel/Workspaces/another_ros2_ws/src/ros2/demos/demo_nodes_cpp/src/services/add_two_ints_server.cpp","operator()",43}
1576167362.695472 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:1303 #1: ST0 rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_:{1359775513681701906,1,{5}}

But it isn't till shortly after that the server discovers the client's response reader:

1576167362.699695 [0] dq.builtin: SEDP ST0 1481001:ddd08e4f:524471cd:1404 reliable volatile reader: (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:37316 ssm=0) QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167362.699707 [0] dq.builtin: write_sample 0:0:0:4c2 #19: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.699709 [0] dq.builtin: reader 8c561001:dc2ce137:5055b4d2:207
1576167362.699711 [0] dq.builtin: match_proxy_reader_with_writers(prd 1481001:ddd08e4f:524471cd:1404) scanning all writers
1576167362.699713 [0]         gc: gc 0x7f57b800be60: not yet, shortsleep
1576167362.699715 [0] dq.builtin:   proxy_reader_add_connection(wr 8c561001:dc2ce137:5055b4d2:1303 prd 1481001:ddd08e4f:524471cd:1404)
1576167362.699730 [0] dq.builtin:   writer_add_connection(wr 8c561001:dc2ce137:5055b4d2:1303 prd 1481001:ddd08e4f:524471cd:1404) - ack seq 1
1576167362.699733 [0] dq.builtin: setcover: all_addrs udp/239.255.0.1:7401 udp/192.168.9.2:37316
1576167362.699735 [0] dq.builtin: reduced nlocs=2
1576167362.699739 [0] dq.builtin:   loc  0 = udp/192.168.9.2:37316           1 { 0 }
1576167362.699742 [0] dq.builtin:   loc  1 = udp/239.255.0.1:7401            1 { 0 }
1576167362.699744 [0] dq.builtin:   best = 0
1576167362.699746 [0] dq.builtin:   simple udp/192.168.9.2:37316
1576167362.699749 [0] dq.builtin: rebuild_writer_addrset(8c561001:dc2ce137:5055b4d2:1303): udp/192.168.9.2:37316

I've updated both logs here. Mind to check if my logic is correct? Maybe I'm missing something in the midst of the ack/heartbeat dance.

eboasson · 2019-12-12T20:53:46Z

@hidmic your analysis is entirely correct. I'm a bit surprised that the dance works out this way, but it is definitely what happens.

@dirk-thomas's "wait_for_client" would solve this, but it is a bit of tricky operation in that it must gracefully deal with the client disappearing while the service is waiting. What I would find more elegant is for the client to first wait for the server to have discovered the response reader, and only then send the request. That way the server will never have to wait and is still certain that the response will go out properly.

I think that that is possible in principle: guarantee that the endpoint discovery data doesn't get acknowledged until after it has been processed completely, and then offer a way of doing "wait_for_acknowledgements" on the endpoint discovery writers. But I don't dare predict when that'll be available ...

hidmic · 2019-12-17T19:21:27Z

What I would find more elegant is for the client to first wait for the server to have discovered the response reader, and only then send the request.

IMHO client.wait_for_service() success should imply that both service server and client have discovered each other or, in other words, that a request/response exchange may occur without risking either of the two getting lost. So I agree in principle with your idea @eboasson.

But I don't dare predict when that'll be available ...

Services are a core functionality in ROS 2. Meaning, I personally think we need a fix for this ASAP.

* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <eb@ilities.com> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <eb@ilities.com> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <eb@ilities.com> * Address review comments Signed-off-by: Erik Boasson <eb@ilities.com> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <eb@ilities.com> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <eb@ilities.com> * Make iterators "const auto &" Signed-off-by: Erik Boasson <eb@ilities.com> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <eb@ilities.com>

* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <eb@ilities.com> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <eb@ilities.com> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <eb@ilities.com> * Address review comments Signed-off-by: Erik Boasson <eb@ilities.com> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <eb@ilities.com> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <eb@ilities.com> * Make iterators "const auto &" Signed-off-by: Erik Boasson <eb@ilities.com> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <eb@ilities.com> Co-authored-by: eboasson <eb@ilities.com>

hidmic · 2020-10-21T14:47:59Z

IIRC #187 worked around this problem. @eboasson is there an ETA for a complete solution (e.g. a DDS-RPC like implementation for robust discovery or similar)?

hidmic · 2020-11-20T15:39:03Z

Some re-design discussion took place in ros2/rmw_fastrtps#418.

abhisheklolage · 2024-02-05T23:43:33Z

Hi,
I'm reaching out regarding an issue where I am seeing in a ROS client-service setup - we're actually observing a memory leak where the rate of the leak increases as we increase the frequency of the client-service interactions. Essentially, the value for $(cat /proc/$PID/smaps | grep -i 'Pss:' | awk '{Total+=$2} END {print Total/1000" MB"}') increases over time. Is this related or is there a separate known issue with ROS2 client-service with Cyclone DDS? Are there any workarounds?

System setup:

ROS2 Foxy from ros2/ros2@75b13d3
Cyclone DDS 0.7.0 / rmw-cyclonedds 0.7.11
Build running on arm64

Tagging @dirk-thomas for some insight on memory leak.

Thanks!

wjwwood transferred this issue from ros2/rclcpp Dec 5, 2019

wjwwood assigned hidmic Dec 5, 2019

dirk-thomas added the bug Something isn't working label Dec 6, 2019

wjwwood mentioned this issue Dec 12, 2019

Action client examples not working with CycloneDDS ros2/examples#257

Closed

hidmic mentioned this issue May 1, 2020

test_parameter_server_cpp times out ros2/system_tests#420

Closed

eboasson mentioned this issue May 16, 2020

test_parameter_server_cpp__rmw_cyclonedds_cpp is failing most of the time on linux #183

Closed

hidmic mentioned this issue May 20, 2020

Flaky services performance ros2/ros2#922

Closed

eboasson mentioned this issue May 20, 2020

Lost service responses (#183, #74) #187

Merged

eboasson mentioned this issue May 29, 2020

Restore dashing/eloquent behaviour of "service_is_available" #190

Merged

ivanpauno mentioned this issue Oct 14, 2020

Cyclone DDS fails nightly CI tests ros-navigation/navigation2#2043

Closed

hidmic removed their assignment Nov 20, 2020

MichaelOrlov mentioned this issue Jan 16, 2024

👨‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs ros2/rosbag2#862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not getting service responses reliably when using CycloneDDS #74

Not getting service responses reliably when using CycloneDDS #74

hidmic commented Nov 14, 2019

hidmic commented Dec 6, 2019

dirk-thomas commented Dec 6, 2019

eboasson commented Dec 6, 2019

hidmic commented Dec 12, 2019

eboasson commented Dec 12, 2019

hidmic commented Dec 17, 2019

hidmic commented Oct 21, 2020

hidmic commented Nov 20, 2020

abhisheklolage commented Feb 5, 2024 •

edited

Loading

Not getting service responses reliably when using CycloneDDS #74

Not getting service responses reliably when using CycloneDDS #74

Comments

hidmic commented Nov 14, 2019

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

hidmic commented Dec 6, 2019

dirk-thomas commented Dec 6, 2019

eboasson commented Dec 6, 2019

hidmic commented Dec 12, 2019

eboasson commented Dec 12, 2019

hidmic commented Dec 17, 2019

hidmic commented Oct 21, 2020

hidmic commented Nov 20, 2020

abhisheklolage commented Feb 5, 2024 • edited Loading

abhisheklolage commented Feb 5, 2024 •

edited

Loading