Lost service responses (#183, #74) #187

eboasson · 2020-05-20T20:19:53Z

This PR addresses the service invocation problems by fixing one silly bug in the rmw_service_server_is_available code (not actually checking the number of matched endpoints), and by blocking in rmw_send_response until there is reasonable evidence that the response reader has been discovered.

The proper solution (as discussed in #74) makes the rmw_service_server_is_available return false until this point has been reached, but as of today, neither the DDS specification nor Cyclone DDS provides the means to do that without the application exchanging information on what has been discovered. It can be done easily enough, but it is a rather significant burden for what is ultimately a rare problem.

Without the workaround (but with the bug fix) it rarely fails. With the workaround added, I have not been able to reproduce it anymore. I've only seen multiple waits in sequence by introducing significant packet loss.

I do expect this workaround to be somewhat controversial ...

hidmic

Thanks for fixing this @eboasson! At first glance, we should be able to backport this for Eloquent and Dashing. Do you agree? Or has anything fundamentally changed within rmw_cyclonedds_cpp since?

@jacobperron I think it'd be great if we can get this in for Foxy.

hidmic · 2020-05-20T22:03:37Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  //
+  // it is pretty horrid ... but blocking the service is the only option if
+  // the client is unable to determine that it has been fully discovered by
+  // the service.


it is pretty horrid ...

I'm fine with this as long as we agree that this isn't a solution but a workaround for distros up and including Foxy.

Wait why is this needed? Shouldn't the requester wait for a bidirectional match before sending a request? If, in the meantime, the response channel goes away I think this should just fail, not block... Blocking without an external option for a timeout is not very nice in my opinion. The user may not even know it is blocking or for how long.

The requester is definitely the one that should be doing the waiting. Unfortunately, DDS doesn’t offer a mechanism for waiting for a bidirectional match. Cyclone does guarantee that the writer having matched the reader is sufficient for that reader to receive the data (modulo disconnections and things like that), but that gives you one-way only.

Secondly, the simple counting that is done here (and in other RMW layers) only works if there is but a single server. You really ought to wait until you have a matched pair, but that would additionally require DDS to be aware of these pairs, and it isn’t. We could use, e.g., the USER_DATA QoS to make the mapping clear at the RMW level, but if service interoperability is desirable that needs to be agreed in a wider audience.

What has been on my to-do list for Cyclone for some time is to make it possible to wait until all local entities have been discovered by all remote entities, much like “wait for acknowledgements” allows you to do this for a regular writer and its readers. Even then, there are little details, for example, one would have to guarantee that effects of a discovery message node P sent to node Q have occurred prior to P receiving Q’s acknowledgment. That’s not guaranteed anywhere in the spec and also not today in Cyclone because it processes all discovery data asynchronously, but it is an important primitive.

As to this change: I considered a variant (check in rmw_take_request) but the same quality of implementation comes out way more complicated there. A good one would put the request aside and retry/prune based on “publication matched” events. It’s feasible (I should still have a proof-of-concept implementation lying around for something similar as far as the waitset goes) but it is significantly more complex. That’d probably be time better spent on fixing the root cause.

Another mapping would use transient local requests and responses, using the request id as key. It is easy enough in Cyclone, but that’d also be quite a big departure from what has been done until now (by any RMW as far as I know).

The only indefinite blocking this does is when there are clients arriving and leaving all the time. If the currently client simply disappears, both counters will end up being the same once the disappearance is detected. But, yeah ...

Actually, on continued consideration, I could probably make it work without this hack and without adding a "wait for discovery to complete" operation, but I do need to think through the consequences of the changes it would require in Cyclone. It also wouldn't solve the problem if there are multiple matching services.

So it is not yet a given that it'll happen, and it certainly won't be available immediately.

I think we can all agree that this isn't the correct fix, but a workaround. And @eboasson has been open about it from the get-go. As I see it, the sole purpose of this patch is to reduce the likelihood of a discovery race to impair service performance.

Considering the current behavior is objectively worst (once we've guaranteed an indefinite wait cannot happen here) and that we're about to release an LTS distro, I'd be very much inclined to land this and document accordingly. I'm sure @jacobperron and @dirk-thomas have their own opinions as well.

IIUC, it seems like a general design flaw with trying to build ROS services on top of DDS, which plagues all of our RMWs.

A workaround sounds okay for now if it's making the situation better; in this particular case, I would want certainty that the logic is not going to cause an indefinite wait. Maybe that means introducing at timeout.

hidmic · 2020-05-20T22:07:05Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

@@ -3135,6 +3147,16 @@ extern "C" rmw_ret_t rmw_send_response(
  cdds_request_header_t header;
  memcpy(&header.guid, request_header->writer_guid, sizeof(header.guid));
  header.seq = request_header->sequence_number;
+  // if the number of writers matching our request reader equals the number
+  // of readers matching our response writer, we have a pretty decent claim
+  // to have discovered this client's response reader.


@eboasson nit: consider moving this comment block closer to the check_for_response_reader() function.

we have a pretty decent claim

👀

Any chance we can tell if remote writer and reader belong to the same server instance? Or at least to the same participant?

I actually started working out how to do that, and it is definitely possible the determine they are from the same participant. But now that all nodes in a process (well, context) share the same participant that seems insufficiently selective. I couldn’t find a way to tie them to the node (which I think is selective enough) short of using USER_DATA. That seemed like it might be unwise. I suppose the graph cache interface could be extended to answer this question.

What we do know here is that we have discovered the client’s writer (else we could never have received its request), and that therefore the client’s writer is accounted for in the “subscription matched” status. Thus, it is pretty likely that the service writer’s match count won’t equal the service’s reader match count unless that client’s reader has been discovered. But there is no guarantee: if another client shows up and its reader happens to be discovered before its writer, you will draw the wrong conclusion.

In the absence of packet loss, that is highly unlikely because it creates the writer first, because under that assumption, the discovery of the writer will precede the discovery of the reader. Indeed, I wouldn’t be surprised if simply swapping the creation of the reader and the writer in the client would make the problem disappear in the test setup.

Indeed, I wouldn’t be surprised if simply swapping the creation of the reader and the writer in the client would make the problem disappear in the test setup.

Interesting. Have you tried it? Whatever we can do to reduce the likelihood of an indefinite wait is worth exploring IMO.

hidmic · 2020-05-20T22:09:44Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  // the service.
+  while (!check_for_response_reader(info->service.sub->enth, info->service.pub->enth)) {
+    dds_sleepfor(DDS_MSECS(10));
+  }


@eboasson what if the client goes away before sending the response? I'm fine with the busy wait, but it should timeout at some point. Unless we can detect that the request writer went away.

The DDS discovery will discover the disappearance of the client, remove all matches with it and decrement the current_count. So I believe that case is covered. (But as I remarked above, there is a problem if you create/delete clients all the time.)

I see. I'm still a bit wary about indefinite waits. Even if it rarely occurs, to timeout and fail on replying would give calling code a chance to do something about it.

@eboasson any chance we can leave the loop after some period of time? I'd expect it to be innocuous, unless there's such traffic loss that it's unable to succeed in, say, 100 ms or more. In which case having the service server throw would be better than having either service server or service client hang silently.

hidmic · 2020-05-20T22:10:05Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

@@ -3628,7 +3650,6 @@ extern "C" rmw_ret_t rmw_service_server_is_available(
  ret =
    common_context->graph_cache.get_writer_count(sub_topic_name, &number_of_response_publishers);
  if (ret != RMW_RET_OK || 0 == number_of_response_publishers) {
-    // error


@eboasson why this change?

The error case was self-evident and the comment was incorrect for the common case (no response publishers discovered yet). But, yeah, perhaps that ought to be a separate commit.

hidmic · 2020-05-20T22:10:25Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  if (dds_get_subscription_matched_status(request_reader, &sm) < 0 ||
+    dds_get_publication_matched_status(response_writer, &pm) < 0)
+  {
+    return RMW_RET_ERROR;


@eboasson

Suggested change

return RMW_RET_ERROR;

return false;

?

Oops! Thanks! It’ll probably never get there (the function can’t fail if the reader/writer exist) but that’s no excuse for messing up the type so badly ...

I have think about the fix, I might be that return true is better than returning false because if this fails once, it’ll probably fail the next time, too. Returning true would definitely require a comment, and it’d probably make more sense to return an rmw_ret_t.

+1 to discriminating error conditions from entity absence.

eboasson · 2020-05-22T11:07:01Z

Apologies for the force push ... the original two commits are still there, but I noticed that the subject line mentioned the wrong function and I didn't want to run the risk of that mistake getting merged.

What is new is:

fb040c5 which changes the request headers on the wire to use rmw_request_id_t (so a 128-bit client identifier + sequence number instead of a 64-bit client identifier). The extra room is then used by
4c3b8fa to do a precise check whether a reader/writer pair of a single server has been matched (in rmw_service_server_is_available) and whether the requesting client's response reader has been matched (in rmw_send_response).

The mechanism employed is generating unique client/service identifiers (based on the participant GUID), storing these in the reader/writer USER_DATA QoS (as a key/value pair, using serviceid and clientid as keys) and in the GUID part of the rmw_request_id_t, and then using the various operations to interrogate information on the matched endpoints to see if there is a match.

It adds overhead and complexity to the previous proposal but as far as I can tell, it doesn't suffer from ever having to wait indefinitely (it is by design immune to adding/removing other clients, and it still handles the case where the client has disappeared).

But clearly this is still a workaround.

hidmic

Thanks for putting the time @eboasson ! I wonder though, is this change backwards compatible (i.e. it can work with a peer participant running without this patch, with the known potential races)? Also, won't this change compromise cross-vendor communication?

hidmic · 2020-05-26T15:08:39Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  void * ud;
+  size_t udsz;
+  if (!dds_qget_userdata(qos, &ud, &udsz)) {
+    std::map<std::string, std::vector<uint8_t>> emptymap;


@eboasson nit: I'd think any compiler would be smart enough to copy-elide the return value, but

std::map<std::string, std::vector<uint8_t>> map; void * ud; size_t udsz; if (dds_qget_userdata(qos, &ud, &udsz)) { std::vector<uint8_t> udvec(static_cast<uint8_t *>(ud), static_cast<uint8_t *>(ud) + udsz); dds_free(ud); map = rmw::impl::cpp::parse_key_value(udvec); } return map;

would make it simpler and slightly clearer.

hidmic · 2020-05-26T15:21:26Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  dds_entity_t writer, dds_instance_handle_t readerih)
+{
+  std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>>
+  ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint);


@eboasson nit:

Suggested change

ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint);

ep(dds_get_matched_subscription_data(writer, readerih), free_builtintopic_endpoint);

hidmic · 2020-05-26T15:21:56Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  dds_entity_t reader, dds_instance_handle_t writerih)
+{
+  std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>>
+  ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint);


@eboasson nit:

Suggested change

ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint);

ep(dds_get_matched_publication_data(reader, writerih), free_builtintopic_endpoint);

hidmic · 2020-05-26T15:23:40Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+{
+  std::ostringstream os;
+  os << std::hex;
+  os << std::setw(2) << static_cast<int>(static_cast<uint8_t>(id.writer_guid[0]));


@eboasson why static_cast<uint8_t>(...) ? Are you expecting to truncate it?

That cast is there to avoid sign extension (id.writer_guid is an array of int8_t ...)

hidmic · 2020-05-26T15:32:14Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  for (auto rdih : rds) {
+    auto rd = get_matched_subscription_data(client.pub->enth, rdih);
+    std::string serviceid;
+    if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) {


@eboasson nit:

Suggested change

if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) {

if (rd && get_user_data_key(rd->qos, "serviceid", serviceid)) {

hidmic · 2020-05-26T15:33:16Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  for (auto wrih : wrs) {
+    auto wr = get_matched_publication_data(client.sub->enth, wrih);
+    std::string serviceid;
+    if (wr.get() &&


@eboasson nit:

Suggested change

if (wr.get() &&

if (wr &&

hidmic · 2020-05-26T15:36:06Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+        wr->qos, "serviceid",
+        serviceid) && needles.find(serviceid) != needles.end())
+    {
+      *is_available = true;


@eboasson considering asserting that is_available != nullptr && !*is_available, which seems to be the implicit assumption in this function.

hidmic · 2020-05-26T15:41:40Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  }
+  const std::string needle = request_id_writer_guid_to_string(reqid);
+  // if we have matched this client's reader, all is well
+  for (auto rdih : rds) {


@eboasson nit:

Suggested change

for (auto rdih : rds) {

for (auto & rdih : rds) {

given that dds_instance_handle_t is just a 64-bit integer, I don't understand why using a reference would be an improvement

hidmic · 2020-05-26T15:42:00Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  }
+  // if not, we should stop waiting if the writer is no longer there,
+  // as that implies the client no longer exists
+  for (auto wrih : wrs) {


@eboasson nit:

Suggested change

for (auto wrih : wrs) {

for (auto & wrih : wrs) {

hidmic · 2020-05-26T15:47:09Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+      // break instead of returns makes gcc happy
+      break;
+    case client_present_t::MAYBE:
+      return RMW_RET_TIMEOUT;


@eboasson how could st ever be client_present_t::MAYBE at this point if that wouldn't allow it to leave the loop in the first place?

True, and it would be better to combine it with the ERROR case. I usually go out of my way to treat enums as enumerated types and cover all the cases in switches over them. But C & C++ semantics don't guarantee that one can assign only defined values to an object of an enumerated type, and what then happens is compiler dependent.

Clang doesn't warn if you cover all cases of an enum, presumably on the assumption that you treat the enum as an enum (or perhaps it proves that no other values ever get assigned to it). Gcc warns, presumably because technically other values are legal too and, again presumably, it doesn't do the analysis to prove that no other values ever get assigned to it. As the comment suggests, there used to be a return in line 3294. With that, the MAYBE case makes a bit more sense. Now, it just looks silly.

ivanpauno

I've left some comments, but the new approach looks pretty good to me

ivanpauno · 2020-05-26T17:39:03Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

@@ -490,6 +508,33 @@ static void get_entity_gid(dds_entity_t h, rmw_gid_t & gid)
  convert_guid_to_gid(guid, gid);
 }

+static std::map<std::string, std::vector<uint8_t>> parse_user_data(const dds_qos_t * qos)


nit: prefer std::unordered_map

parse_key_value from rmw uses std::map so I think it is better to leave this as-is

ivanpauno · 2020-05-26T18:17:59Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  //   sizeof((reinterpret_cast<rmw_request_id_t *>(0))->writer_guid)
+  //
+  // is not a constant, and the 16 is a hard-coded magic number in
+  // rmw_request_id_t ...


I don't fully follow the comment, but you can also suppress a cpplint warning by using problematic line of code; // NOLINT.

There are actually two comments with bad formatting: one that I put in originally: "strangely, the writer_guid in a request id is smaller than the rmw_gid_t". I find it weird that the writer_guid in a rmw_request_t is 16 bytes whereas a rmw_gid_t has 24 bytes.

The second bit is about lint. I didn't know about // NOLINT, but that's a better solution than a hard-coded constant. Thanks!

ivanpauno · 2020-05-26T18:25:51Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+static rmw_ret_t get_matched_endpoints(
+  dds_entity_t h, dds_return_t (* fn)(
+    dds_entity_t h,
+    dds_instance_handle_t * xs, size_t nxs), std::vector<dds_instance_handle_t> & res)


nit: defining the pointer to function before highly increases readability:

using get_matched_endpoints_fn_t = dds_return_t (* fn)( dds_entity_t h, dds_instance_handle_t * xs, size_t nxs); static rmw_ret_t get_matched_endpoints( dds_entity_t h, get_matched_endpoints_fn_t fn, std::vector<dds_instance_handle_t> & res)

I must have seen too many C function pointer types ... :) but yes, that would be wise

ivanpauno · 2020-05-26T18:31:53Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  dds_free(e);
+}
+
+static std::unique_ptr<dds_builtintopic_endpoint_t,


similarly here:

using BuiltinTopicEndpoint = std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>>;

ivanpauno · 2020-05-26T18:34:57Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  std::ostringstream os;
+  os << std::hex;
+  os << std::setw(2) << static_cast<int>(static_cast<uint8_t>(id.writer_guid[0]));
+  for (size_t i = 1; i < sizeof(id.writer_guid) - 1; i++) {


should be < sizeof(id.writer_guid)?
it seems that the last byte won't be copied if not.

Yes, absolutely. (I changed it from 0 .. n-1 to 1..n, except I didn't change the upper bound.) There are not enough services/clients in tests to catch this one. This would have been a source of nasty bugs ...

ivanpauno · 2020-05-26T18:36:25Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  std::ostringstream os;
+  os << std::hex;
+  os << std::setw(2) << static_cast<int>(id.data[0]);
+  for (size_t i = 1; i < sizeof(id.data) - 1; i++) {


same about the - 1, it doesn't sound correct ...

rmw_cyclonedds_cpp/src/rmw_node.cpp

ivanpauno · 2020-05-26T18:52:10Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  }
+  // first extract all service ids from matched readers
+  std::set<std::string> needles;
+  for (auto rdih : rds) {


nit: const auto &

This still applies.

ivanpauno · 2020-05-26T18:52:27Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+    return RMW_RET_OK;
+  }
+  // then scan the writers to see if there is at least one with a service id in the set
+  for (auto wrih : wrs) {


nit: const auto &

eboasson · 2020-05-26T21:00:52Z

@hidmic and @ivanpauno thanks for reviewing (and catching a few bugs!). I'm fine with all the nitpicks. I'll sort them out tomorrow, but I thought it might make sense to quickly respond to the comments given the time zone differences and the urgency.

@hidmic, regarding your two overall questions:

I wonder though, is this change backwards compatible (i.e. it can work with a peer participant running without this patch, with the known potential races)?

It isn't: firstly, the wire representation of the request id changed from the quick hack of yore to the kind-of sensible rmw_request_id. While that change is convenient, it is not strictly necessary: the alternative is to convert the publication_handle of the request in rmw_take_request to the client id, and then associate the rmw_request_id_t with the client id internally. Then rmw_send_response can use the request id to lookup the GUID, and so on. It is messy ... and it does assume that all requests result in exactly one response.

Secondly, it now only matches services/clients that have these identifiers. Making it backwards compatible requires treating an "unidentified" reader and writer pair as sufficient in rmw_service_server_is_available and never waiting in rmw_send_response.

Also, won't this change compromise cross-vendor communication?

That doesn't work anyway ... different wire representations, use of vendor-specific tricks ...

eboasson · 2020-05-27T11:41:23Z

I believe the first of the two commits addresses all the comments regarding small details (unless noted otherwise — if I am mistaken, I'll be happy to change them after all, as I don't know the C++ idiom very well).

The second addresses backwards compatibility. It also happens to make the code a bit simpler, too. The changes in that commit outside rmw_node.cpp are a simple reverting of fb040c5.

There is still the matter of the desirability of blocking in rmw_send_response. I don't particularly like this workaround, but having it ready at least puts us in a situation where we can decide what to do:

do as in dashing/eloquent (i.e., just do 0cf065d)
do as in dashing/eloquent for matching but do add the client/service identifiers (head, except for the blocking)
use this workaround
if desired, add a timeout and expect services to handle that case (that's easy enough)

Note that I don't think adding a timeout make sense: the duration would be completely arbitrary and the service implementations would then have be modified to deal with it.

hidmic

LGTM

Linux
Linux-aarch64
macOS
Windows

@wjwwood @ivanpauno @jacobperron I'd like your approval as well before merging anything. Test failures seem unrelated.

hidmic · 2020-05-29T13:46:02Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  }
+  // first extract all service ids from matched readers
+  std::set<std::string> needles;
+  for (auto rdih : rds) {


This still applies.

hidmic · 2020-05-29T13:46:37Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  } else {
+    // scan the writers to see if there is at least one response writer
+    // matching a discovered request reader
+    for (auto wrih : wrs) {


@eboasson nit: const auto & wrih

hidmic · 2020-05-29T13:52:11Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

+  // the service.
+  while (!check_for_response_reader(info->service.sub->enth, info->service.pub->enth)) {
+    dds_sleepfor(DDS_MSECS(10));
+  }


@eboasson any chance we can leave the loop after some period of time? I'd expect it to be innocuous, unless there's such traffic loss that it's unable to succeed in, say, 100 ms or more. In which case having the service server throw would be better than having either service server or service client hang silently.

hidmic · 2020-05-29T14:00:00Z

rmw_cyclonedds_cpp/src/rmw_node.cpp

  header.seq = request_header->sequence_number;
-  return rmw_send_response_request(&info->service, header, ros_response);
+  // Block until the response reader has been matched by the response writer (this is a


@eboasson mind to add a TODO for a proper fix? I've opened #191 to track that work.

wjwwood · 2020-05-29T22:35:42Z

@wjwwood @ivanpauno @jacobperron I'd like your approval as well before merging anything.

Based on the discussions we've had about this and the corresponding Fast-RTPS pull request, I'm good with it. I'll have another look, but the code changes also look good.

hidmic · 2020-06-01T13:16:21Z

@eboasson I think Windows CI compilation issues stem from client_present_t::ERROR enum value clashing with the ERROR macro that comes with windows.h.

jacobperron · 2020-06-01T16:15:56Z

Given that we're scheduled to release this week, I'd be more comfortable holding this PR for the first patch release to give us more time for testing. This means holding until the end of this week.

eboasson · 2020-06-02T11:50:02Z

I think this covers all the remarks. It has a conflict in rmw_node.cpp because the merged #190 uses the exact same code as dashing/eloquent, whereas this one had equivalent code in its place. If everyone agrees it is ok (which is not necessarily the same as it being ok to merge it straightaway, because of the Foxy release), I'll fix the conflict and squash the changes.

hidmic · 2020-06-02T16:00:08Z

@eboasson +1, feel free to rebase.

The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <eb@ilities.com>

Signed-off-by: Erik Boasson <eb@ilities.com>

Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <eb@ilities.com>

Signed-off-by: Erik Boasson <eb@ilities.com>

* Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times.

Signed-off-by: Erik Boasson <eb@ilities.com>

The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <eb@ilities.com>

Signed-off-by: Erik Boasson <eb@ilities.com>

hidmic · 2020-06-16T21:38:02Z

Once Windows' CI is back, I think this is ok to go.

hidmic · 2020-06-17T19:09:42Z

Alright, going in.

* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <eb@ilities.com> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <eb@ilities.com> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <eb@ilities.com> * Address review comments Signed-off-by: Erik Boasson <eb@ilities.com> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <eb@ilities.com> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <eb@ilities.com> * Make iterators "const auto &" Signed-off-by: Erik Boasson <eb@ilities.com> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <eb@ilities.com>

* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <eb@ilities.com> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <eb@ilities.com> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <eb@ilities.com> * Address review comments Signed-off-by: Erik Boasson <eb@ilities.com> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <eb@ilities.com> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <eb@ilities.com> * Make iterators "const auto &" Signed-off-by: Erik Boasson <eb@ilities.com> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <eb@ilities.com> Co-authored-by: eboasson <eb@ilities.com>

ros-discourse · 2020-07-23T19:22:02Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-for-foxy-fitzroy-2020-07-23/15570/2

ros-discourse · 2020-08-12T16:36:03Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-and-patch-release-for-ros-2-foxy-fitzroy-2020-08-07/15818/1

hidmic reviewed May 20, 2020

View reviewed changes

hidmic mentioned this pull request May 20, 2020

Update Foxy release notes. ros2/ros2_documentation#704

Merged

eboasson force-pushed the lost-service-response branch from 74e71e7 to 4c3b8fa Compare May 22, 2020 10:56

hidmic reviewed May 26, 2020

View reviewed changes

hidmic requested review from jacobperron, wjwwood and ivanpauno May 26, 2020 17:18

ivanpauno reviewed May 26, 2020

View reviewed changes

jacobperron mentioned this pull request May 26, 2020

Make service wait for response reader ros2/rmw_fastrtps#390

Merged

eboasson mentioned this pull request May 29, 2020

Restore dashing/eloquent behaviour of "service_is_available" #190

Merged

hidmic mentioned this pull request May 29, 2020

Improve service discovery #191

Open

hidmic approved these changes May 29, 2020

View reviewed changes

wjwwood approved these changes May 29, 2020

View reviewed changes

eboasson added 8 commits June 3, 2020 08:32

Change request headers to use rmw_request_id_t on the wire

d63290e

Signed-off-by: Erik Boasson <eb@ilities.com>

Address review comments

6971450

Signed-off-by: Erik Boasson <eb@ilities.com>

Replace ERROR by FAILURE to because of windows.h

371e16e

Signed-off-by: Erik Boasson <eb@ilities.com>

Make iterators "const auto &"

3f85be1

Signed-off-by: Erik Boasson <eb@ilities.com>

Add TODO for eliminating rmw_send_response blocking

e17e9ed

Signed-off-by: Erik Boasson <eb@ilities.com>

eboasson force-pushed the lost-service-response branch from e5f96b1 to e17e9ed Compare June 3, 2020 06:38

ivanpauno mentioned this pull request Jun 3, 2020

Fast-RPTS 2.0.x might introduce service performance issue ros2/ros2#931

Closed

hidmic merged commit f95c496 into ros2:master Jun 17, 2020

ivanpauno mentioned this pull request Jun 22, 2020

Add preprocessor logic to preserve compatibility with Foxy in master #197

Closed

eboasson mentioned this pull request Jul 18, 2020

Really Poor Real-time Factor in Gazebo ROS simulation compared to Fast-RTPS #207

Closed

jacobperron mentioned this pull request Jul 21, 2020

[foxy backport] Lost service responses (#183, #74) (#187) #209

Merged

hidmic mentioned this pull request Oct 21, 2020

Not getting service responses reliably when using CycloneDDS #74

Open

eboasson mentioned this pull request Nov 9, 2020

Fast-DDS service discovery redesign ros2/rmw_fastrtps#418

Open

JEnoch mentioned this pull request Jan 2, 2023

[Bug] Random failure on first ROS service request while in forward discovery mode eclipse-zenoh/zenoh-plugin-dds#111

Closed

MichaelOrlov mentioned this pull request Jan 16, 2024

👨‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs ros2/rosbag2#862

Open

	ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint);
	ep(dds_get_matched_subscription_data(writer, readerih), free_builtintopic_endpoint);

	ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint);
	ep(dds_get_matched_publication_data(reader, writerih), free_builtintopic_endpoint);

	if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) {
	if (rd && get_user_data_key(rd->qos, "serviceid", serviceid)) {

Lost service responses (#183, #74) #187

Lost service responses (#183, #74) #187

Conversation

eboasson commented May 20, 2020

hidmic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hidmic May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eboasson May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eboasson commented May 22, 2020

hidmic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanpauno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eboasson commented May 26, 2020

eboasson commented May 27, 2020

hidmic left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjwwood commented May 29, 2020

hidmic commented Jun 1, 2020

jacobperron commented Jun 1, 2020

eboasson commented Jun 2, 2020

hidmic commented Jun 2, 2020

hidmic commented Jun 16, 2020

hidmic commented Jun 17, 2020

ros-discourse commented Jul 23, 2020

ros-discourse commented Aug 12, 2020

hidmic May 21, 2020 •

edited

Loading

eboasson May 21, 2020 •

edited

Loading

hidmic left a comment •

edited

Loading