Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost service responses (#183, #74) #187

Merged
merged 9 commits into from
Jun 17, 2020

Conversation

eboasson
Copy link
Collaborator

This PR addresses the service invocation problems by fixing one silly bug in the rmw_service_server_is_available code (not actually checking the number of matched endpoints), and by blocking in rmw_send_response until there is reasonable evidence that the response reader has been discovered.

The proper solution (as discussed in #74) makes the rmw_service_server_is_available return false until this point has been reached, but as of today, neither the DDS specification nor Cyclone DDS provides the means to do that without the application exchanging information on what has been discovered. It can be done easily enough, but it is a rather significant burden for what is ultimately a rare problem.

Without the workaround (but with the bug fix) it rarely fails. With the workaround added, I have not been able to reproduce it anymore. I've only seen multiple waits in sequence by introducing significant packet loss.

I do expect this workaround to be somewhat controversial ...

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @eboasson! At first glance, we should be able to backport this for Eloquent and Dashing. Do you agree? Or has anything fundamentally changed within rmw_cyclonedds_cpp since?

@jacobperron I think it'd be great if we can get this in for Foxy.

//
// it is pretty horrid ... but blocking the service is the only option if
// the client is unable to determine that it has been fully discovered by
// the service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is pretty horrid ...

I'm fine with this as long as we agree that this isn't a solution but a workaround for distros up and including Foxy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait why is this needed? Shouldn't the requester wait for a bidirectional match before sending a request? If, in the meantime, the response channel goes away I think this should just fail, not block... Blocking without an external option for a timeout is not very nice in my opinion. The user may not even know it is blocking or for how long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requester is definitely the one that should be doing the waiting. Unfortunately, DDS doesn’t offer a mechanism for waiting for a bidirectional match. Cyclone does guarantee that the writer having matched the reader is sufficient for that reader to receive the data (modulo disconnections and things like that), but that gives you one-way only.

Secondly, the simple counting that is done here (and in other RMW layers) only works if there is but a single server. You really ought to wait until you have a matched pair, but that would additionally require DDS to be aware of these pairs, and it isn’t. We could use, e.g., the USER_DATA QoS to make the mapping clear at the RMW level, but if service interoperability is desirable that needs to be agreed in a wider audience.

What has been on my to-do list for Cyclone for some time is to make it possible to wait until all local entities have been discovered by all remote entities, much like “wait for acknowledgements” allows you to do this for a regular writer and its readers. Even then, there are little details, for example, one would have to guarantee that effects of a discovery message node P sent to node Q have occurred prior to P receiving Q’s acknowledgment. That’s not guaranteed anywhere in the spec and also not today in Cyclone because it processes all discovery data asynchronously, but it is an important primitive.

As to this change: I considered a variant (check in rmw_take_request) but the same quality of implementation comes out way more complicated there. A good one would put the request aside and retry/prune based on “publication matched” events. It’s feasible (I should still have a proof-of-concept implementation lying around for something similar as far as the waitset goes) but it is significantly more complex. That’d probably be time better spent on fixing the root cause.

Another mapping would use transient local requests and responses, using the request id as key. It is easy enough in Cyclone, but that’d also be quite a big departure from what has been done until now (by any RMW as far as I know).

The only indefinite blocking this does is when there are clients arriving and leaving all the time. If the currently client simply disappears, both counters will end up being the same once the disappearance is detected. But, yeah ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on continued consideration, I could probably make it work without this hack and without adding a "wait for discovery to complete" operation, but I do need to think through the consequences of the changes it would require in Cyclone. It also wouldn't solve the problem if there are multiple matching services.

So it is not yet a given that it'll happen, and it certainly won't be available immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can all agree that this isn't the correct fix, but a workaround. And @eboasson has been open about it from the get-go. As I see it, the sole purpose of this patch is to reduce the likelihood of a discovery race to impair service performance.

Considering the current behavior is objectively worst (once we've guaranteed an indefinite wait cannot happen here) and that we're about to release an LTS distro, I'd be very much inclined to land this and document accordingly. I'm sure @jacobperron and @dirk-thomas have their own opinions as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it seems like a general design flaw with trying to build ROS services on top of DDS, which plagues all of our RMWs.

A workaround sounds okay for now if it's making the situation better; in this particular case, I would want certainty that the logic is not going to cause an indefinite wait. Maybe that means introducing at timeout.

@@ -3135,6 +3147,16 @@ extern "C" rmw_ret_t rmw_send_response(
cdds_request_header_t header;
memcpy(&header.guid, request_header->writer_guid, sizeof(header.guid));
header.seq = request_header->sequence_number;
// if the number of writers matching our request reader equals the number
// of readers matching our response writer, we have a pretty decent claim
// to have discovered this client's response reader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit: consider moving this comment block closer to the check_for_response_reader() function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a pretty decent claim

👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can tell if remote writer and reader belong to the same server instance? Or at least to the same participant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started working out how to do that, and it is definitely possible the determine they are from the same participant. But now that all nodes in a process (well, context) share the same participant that seems insufficiently selective. I couldn’t find a way to tie them to the node (which I think is selective enough) short of using USER_DATA. That seemed like it might be unwise. I suppose the graph cache interface could be extended to answer this question.

What we do know here is that we have discovered the client’s writer (else we could never have received its request), and that therefore the client’s writer is accounted for in the “subscription matched” status. Thus, it is pretty likely that the service writer’s match count won’t equal the service’s reader match count unless that client’s reader has been discovered. But there is no guarantee: if another client shows up and its reader happens to be discovered before its writer, you will draw the wrong conclusion.

In the absence of packet loss, that is highly unlikely because it creates the writer first, because under that assumption, the discovery of the writer will precede the discovery of the reader. Indeed, I wouldn’t be surprised if simply swapping the creation of the reader and the writer in the client would make the problem disappear in the test setup.

Copy link
Contributor

@hidmic hidmic May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I wouldn’t be surprised if simply swapping the creation of the reader and the writer in the client would make the problem disappear in the test setup.

Interesting. Have you tried it? Whatever we can do to reduce the likelihood of an indefinite wait is worth exploring IMO.

// the service.
while (!check_for_response_reader(info->service.sub->enth, info->service.pub->enth)) {
dds_sleepfor(DDS_MSECS(10));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson what if the client goes away before sending the response? I'm fine with the busy wait, but it should timeout at some point. Unless we can detect that the request writer went away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DDS discovery will discover the disappearance of the client, remove all matches with it and decrement the current_count. So I believe that case is covered. (But as I remarked above, there is a problem if you create/delete clients all the time.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm still a bit wary about indefinite waits. Even if it rarely occurs, to timeout and fail on replying would give calling code a chance to do something about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson any chance we can leave the loop after some period of time? I'd expect it to be innocuous, unless there's such traffic loss that it's unable to succeed in, say, 100 ms or more. In which case having the service server throw would be better than having either service server or service client hang silently.

@@ -3628,7 +3650,6 @@ extern "C" rmw_ret_t rmw_service_server_is_available(
ret =
common_context->graph_cache.get_writer_count(sub_topic_name, &number_of_response_publishers);
if (ret != RMW_RET_OK || 0 == number_of_response_publishers) {
// error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error case was self-evident and the comment was incorrect for the common case (no response publishers discovered yet). But, yeah, perhaps that ought to be a separate commit.

if (dds_get_subscription_matched_status(request_reader, &sm) < 0 ||
dds_get_publication_matched_status(response_writer, &pm) < 0)
{
return RMW_RET_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson

Suggested change
return RMW_RET_ERROR;
return false;

?

Copy link
Collaborator Author

@eboasson eboasson May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks! It’ll probably never get there (the function can’t fail if the reader/writer exist) but that’s no excuse for messing up the type so badly ...

I have think about the fix, I might be that return true is better than returning false because if this fails once, it’ll probably fail the next time, too. Returning true would definitely require a comment, and it’d probably make more sense to return an rmw_ret_t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to discriminating error conditions from entity absence.

@eboasson
Copy link
Collaborator Author

Apologies for the force push ... the original two commits are still there, but I noticed that the subject line mentioned the wrong function and I didn't want to run the risk of that mistake getting merged.

What is new is:

  • fb040c5 which changes the request headers on the wire to use rmw_request_id_t (so a 128-bit client identifier + sequence number instead of a 64-bit client identifier). The extra room is then used by
  • 4c3b8fa to do a precise check whether a reader/writer pair of a single server has been matched (in rmw_service_server_is_available) and whether the requesting client's response reader has been matched (in rmw_send_response).

The mechanism employed is generating unique client/service identifiers (based on the participant GUID), storing these in the reader/writer USER_DATA QoS (as a key/value pair, using serviceid and clientid as keys) and in the GUID part of the rmw_request_id_t, and then using the various operations to interrogate information on the matched endpoints to see if there is a match.

It adds overhead and complexity to the previous proposal but as far as I can tell, it doesn't suffer from ever having to wait indefinitely (it is by design immune to adding/removing other clients, and it still handles the case where the client has disappeared).

But clearly this is still a workaround.

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting the time @eboasson ! I wonder though, is this change backwards compatible (i.e. it can work with a peer participant running without this patch, with the known potential races)? Also, won't this change compromise cross-vendor communication?

void * ud;
size_t udsz;
if (!dds_qget_userdata(qos, &ud, &udsz)) {
std::map<std::string, std::vector<uint8_t>> emptymap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit: I'd think any compiler would be smart enough to copy-elide the return value, but

std::map<std::string, std::vector<uint8_t>> map;
void * ud;
size_t udsz;
if (dds_qget_userdata(qos, &ud, &udsz)) {
  std::vector<uint8_t> udvec(static_cast<uint8_t *>(ud), static_cast<uint8_t *>(ud) + udsz);
  dds_free(ud);
  map = rmw::impl::cpp::parse_key_value(udvec);
}
return map;

would make it simpler and slightly clearer.

dds_entity_t writer, dds_instance_handle_t readerih)
{
std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>>
ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit:

Suggested change
ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint);
ep(dds_get_matched_subscription_data(writer, readerih), free_builtintopic_endpoint);

dds_entity_t reader, dds_instance_handle_t writerih)
{
std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>>
ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit:

Suggested change
ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint);
ep(dds_get_matched_publication_data(reader, writerih), free_builtintopic_endpoint);

{
std::ostringstream os;
os << std::hex;
os << std::setw(2) << static_cast<int>(static_cast<uint8_t>(id.writer_guid[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson why static_cast<uint8_t>(...) ? Are you expecting to truncate it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That cast is there to avoid sign extension (id.writer_guid is an array of int8_t ...)

for (auto rdih : rds) {
auto rd = get_matched_subscription_data(client.pub->enth, rdih);
std::string serviceid;
if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit:

Suggested change
if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) {
if (rd && get_user_data_key(rd->qos, "serviceid", serviceid)) {

for (auto wrih : wrs) {
auto wr = get_matched_publication_data(client.sub->enth, wrih);
std::string serviceid;
if (wr.get() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit:

Suggested change
if (wr.get() &&
if (wr &&

wr->qos, "serviceid",
serviceid) && needles.find(serviceid) != needles.end())
{
*is_available = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson considering asserting that is_available != nullptr && !*is_available, which seems to be the implicit assumption in this function.

}
const std::string needle = request_id_writer_guid_to_string(reqid);
// if we have matched this client's reader, all is well
for (auto rdih : rds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit:

Suggested change
for (auto rdih : rds) {
for (auto & rdih : rds) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that dds_instance_handle_t is just a 64-bit integer, I don't understand why using a reference would be an improvement

}
// if not, we should stop waiting if the writer is no longer there,
// as that implies the client no longer exists
for (auto wrih : wrs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit:

Suggested change
for (auto wrih : wrs) {
for (auto & wrih : wrs) {

// break instead of returns makes gcc happy
break;
case client_present_t::MAYBE:
return RMW_RET_TIMEOUT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson how could st ever be client_present_t::MAYBE at this point if that wouldn't allow it to leave the loop in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, and it would be better to combine it with the ERROR case. I usually go out of my way to treat enums as enumerated types and cover all the cases in switches over them. But C & C++ semantics don't guarantee that one can assign only defined values to an object of an enumerated type, and what then happens is compiler dependent.

Clang doesn't warn if you cover all cases of an enum, presumably on the assumption that you treat the enum as an enum (or perhaps it proves that no other values ever get assigned to it). Gcc warns, presumably because technically other values are legal too and, again presumably, it doesn't do the analysis to prove that no other values ever get assigned to it. As the comment suggests, there used to be a return in line 3294. With that, the MAYBE case makes a bit more sense. Now, it just looks silly.

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments, but the new approach looks pretty good to me

@@ -490,6 +508,33 @@ static void get_entity_gid(dds_entity_t h, rmw_gid_t & gid)
convert_guid_to_gid(guid, gid);
}

static std::map<std::string, std::vector<uint8_t>> parse_user_data(const dds_qos_t * qos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer std::unordered_map

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_key_value from rmw uses std::map so I think it is better to leave this as-is

// sizeof((reinterpret_cast<rmw_request_id_t *>(0))->writer_guid)
//
// is not a constant, and the 16 is a hard-coded magic number in
// rmw_request_id_t ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully follow the comment, but you can also suppress a cpplint warning by using problematic line of code; // NOLINT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two comments with bad formatting: one that I put in originally: "strangely, the writer_guid in a request id is smaller than the rmw_gid_t". I find it weird that the writer_guid in a rmw_request_t is 16 bytes whereas a rmw_gid_t has 24 bytes.

The second bit is about lint. I didn't know about // NOLINT, but that's a better solution than a hard-coded constant. Thanks!

Comment on lines 3049 to 3052
static rmw_ret_t get_matched_endpoints(
dds_entity_t h, dds_return_t (* fn)(
dds_entity_t h,
dds_instance_handle_t * xs, size_t nxs), std::vector<dds_instance_handle_t> & res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: defining the pointer to function before highly increases readability:

using get_matched_endpoints_fn_t = dds_return_t (* fn)(
    dds_entity_t h,
    dds_instance_handle_t * xs, size_t nxs);
static rmw_ret_t get_matched_endpoints(
  dds_entity_t h, get_matched_endpoints_fn_t fn, std::vector<dds_instance_handle_t> & res)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have seen too many C function pointer types ... :) but yes, that would be wise

dds_free(e);
}

static std::unique_ptr<dds_builtintopic_endpoint_t,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here:

using BuiltinTopicEndpoint = std::unique_ptr<dds_builtintopic_endpoint_t,
  std::function<void(dds_builtintopic_endpoint_t *)>>;

std::ostringstream os;
os << std::hex;
os << std::setw(2) << static_cast<int>(static_cast<uint8_t>(id.writer_guid[0]));
for (size_t i = 1; i < sizeof(id.writer_guid) - 1; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be < sizeof(id.writer_guid)?
it seems that the last byte won't be copied if not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. (I changed it from 0 .. n-1 to 1..n, except I didn't change the upper bound.) There are not enough services/clients in tests to catch this one. This would have been a source of nasty bugs ...

std::ostringstream os;
os << std::hex;
os << std::setw(2) << static_cast<int>(id.data[0]);
for (size_t i = 1; i < sizeof(id.data) - 1; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same about the - 1, it doesn't sound correct ...

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved
}
// first extract all service ids from matched readers
std::set<std::string> needles;
for (auto rdih : rds) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const auto &

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies.

return RMW_RET_OK;
}
// then scan the writers to see if there is at least one with a service id in the set
for (auto wrih : wrs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const auto &

@eboasson
Copy link
Collaborator Author

@hidmic and @ivanpauno thanks for reviewing (and catching a few bugs!). I'm fine with all the nitpicks. I'll sort them out tomorrow, but I thought it might make sense to quickly respond to the comments given the time zone differences and the urgency.

@hidmic, regarding your two overall questions:

I wonder though, is this change backwards compatible (i.e. it can work with a peer participant running without this patch, with the known potential races)?

It isn't: firstly, the wire representation of the request id changed from the quick hack of yore to the kind-of sensible rmw_request_id. While that change is convenient, it is not strictly necessary: the alternative is to convert the publication_handle of the request in rmw_take_request to the client id, and then associate the rmw_request_id_t with the client id internally. Then rmw_send_response can use the request id to lookup the GUID, and so on. It is messy ... and it does assume that all requests result in exactly one response.

Secondly, it now only matches services/clients that have these identifiers. Making it backwards compatible requires treating an "unidentified" reader and writer pair as sufficient in rmw_service_server_is_available and never waiting in rmw_send_response.

Also, won't this change compromise cross-vendor communication?

That doesn't work anyway ... different wire representations, use of vendor-specific tricks ...

@eboasson
Copy link
Collaborator Author

I believe the first of the two commits addresses all the comments regarding small details (unless noted otherwise — if I am mistaken, I'll be happy to change them after all, as I don't know the C++ idiom very well).

The second addresses backwards compatibility. It also happens to make the code a bit simpler, too. The changes in that commit outside rmw_node.cpp are a simple reverting of fb040c5.

There is still the matter of the desirability of blocking in rmw_send_response. I don't particularly like this workaround, but having it ready at least puts us in a situation where we can decide what to do:

  • do as in dashing/eloquent (i.e., just do 0cf065d)
  • do as in dashing/eloquent for matching but do add the client/service identifiers (head, except for the blocking)
  • use this workaround
  • if desired, add a timeout and expect services to handle that case (that's easy enough)

Note that I don't think adding a timeout make sense: the duration would be completely arbitrary and the service implementations would then have be modified to deal with it.

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@wjwwood @ivanpauno @jacobperron I'd like your approval as well before merging anything. Test failures seem unrelated.

}
// first extract all service ids from matched readers
std::set<std::string> needles;
for (auto rdih : rds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies.

} else {
// scan the writers to see if there is at least one response writer
// matching a discovered request reader
for (auto wrih : wrs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson nit: const auto & wrih

// the service.
while (!check_for_response_reader(info->service.sub->enth, info->service.pub->enth)) {
dds_sleepfor(DDS_MSECS(10));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson any chance we can leave the loop after some period of time? I'd expect it to be innocuous, unless there's such traffic loss that it's unable to succeed in, say, 100 ms or more. In which case having the service server throw would be better than having either service server or service client hang silently.

header.seq = request_header->sequence_number;
return rmw_send_response_request(&info->service, header, ros_response);
// Block until the response reader has been matched by the response writer (this is a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eboasson mind to add a TODO for a proper fix? I've opened #191 to track that work.

@wjwwood
Copy link
Member

wjwwood commented May 29, 2020

@wjwwood @ivanpauno @jacobperron I'd like your approval as well before merging anything.

Based on the discussions we've had about this and the corresponding Fast-RTPS pull request, I'm good with it. I'll have another look, but the code changes also look good.

@hidmic
Copy link
Contributor

hidmic commented Jun 1, 2020

@eboasson I think Windows CI compilation issues stem from client_present_t::ERROR enum value clashing with the ERROR macro that comes with windows.h.

@jacobperron
Copy link
Member

Given that we're scheduled to release this week, I'd be more comfortable holding this PR for the first patch release to give us more time for testing. This means holding until the end of this week.

@eboasson
Copy link
Collaborator Author

eboasson commented Jun 2, 2020

I think this covers all the remarks. It has a conflict in rmw_node.cpp because the merged #190 uses the exact same code as dashing/eloquent, whereas this one had equivalent code in its place. If everyone agrees it is ok (which is not necessarily the same as it being ok to merge it straightaway, because of the Foxy release), I'll fix the conflict and squash the changes.

@hidmic
Copy link
Contributor

hidmic commented Jun 2, 2020

@eboasson +1, feel free to rebase.

The client checks using rmw_service_server_is_available whether the
request it sends will be delivered to service, but that does not imply
that the (independent, as far as DDS is concerned) response reader of
the client has been discovered by the service.  Usually that will be the
case, but there is no guarantee.

Ideally DDS would offer an interface that allows checking the reverse
discovery, but that does not yet exist in either the specification or in
Cyclone.  This commit works around that by delaying publishing the
response until the number of request writers matches the number of
response readers.

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
Assign a unique identifier to each client/service on creation, add it
to the USER_DATA QoS of the reader and writer and use it for the request
ids.  This allows:

* rmw_service_server_is_available to only return true once it has
  discovered a reader/writer pair of a single service (rather than a
  reader from some service and a writer from some service); and
* rmw_send_response to block until it has discovered the requesting
  client's response reader and to abandon the operation when the client
  has disappeared.

The USER_DATA is formatted in the same manner as the participant
USER_DATA, this uses the keys "serviceid" and "clientid".

This is still but a workaround for having a mechanism in DDS to ensure
that the response reader has been discovered prior by the request writer
prior to sending the request.

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
* Revert commit fb040c5 to retain the
  old wire representation;

* Embed the publication_handle of the request inside rmw_request_id_t,
  possible because reverting to the old wire representation frees up
  enough space, and use this in rmw_send_response to check for the
  presence of the client's reader;

* Clients and services without a client/service id in the reader/writer
  user data are treated as fully matched at all times.
Signed-off-by: Erik Boasson <eb@ilities.com>
The discovery will eventually result in the client's reader being known
or its writer no longer being known, so a timeout is not necessary for
correctness.  However, if it ever were to block for a longish
time (which is possible in the face of network failures), returning a
timeout to the caller is expected to result in less confusion.

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
@hidmic
Copy link
Contributor

hidmic commented Jun 16, 2020

Once Windows' CI is back, I think this is ok to go.

@hidmic
Copy link
Contributor

hidmic commented Jun 17, 2020

Alright, going in.

@hidmic hidmic merged commit f95c496 into ros2:master Jun 17, 2020
jacobperron pushed a commit that referenced this pull request Jul 21, 2020
* Block rmw_send_response if response reader unknown

The client checks using rmw_service_server_is_available whether the
request it sends will be delivered to service, but that does not imply
that the (independent, as far as DDS is concerned) response reader of
the client has been discovered by the service.  Usually that will be the
case, but there is no guarantee.

Ideally DDS would offer an interface that allows checking the reverse
discovery, but that does not yet exist in either the specification or in
Cyclone.  This commit works around that by delaying publishing the
response until the number of request writers matches the number of
response readers.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Change request headers to use rmw_request_id_t on the wire

Signed-off-by: Erik Boasson <eb@ilities.com>

* Precise check for matched client/service

Assign a unique identifier to each client/service on creation, add it
to the USER_DATA QoS of the reader and writer and use it for the request
ids.  This allows:

* rmw_service_server_is_available to only return true once it has
  discovered a reader/writer pair of a single service (rather than a
  reader from some service and a writer from some service); and
* rmw_send_response to block until it has discovered the requesting
  client's response reader and to abandon the operation when the client
  has disappeared.

The USER_DATA is formatted in the same manner as the participant
USER_DATA, this uses the keys "serviceid" and "clientid".

This is still but a workaround for having a mechanism in DDS to ensure
that the response reader has been discovered prior by the request writer
prior to sending the request.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Address review comments

Signed-off-by: Erik Boasson <eb@ilities.com>

* Backwards compatibility

* Revert commit fb040c5 to retain the
  old wire representation;

* Embed the publication_handle of the request inside rmw_request_id_t,
  possible because reverting to the old wire representation frees up
  enough space, and use this in rmw_send_response to check for the
  presence of the client's reader;

* Clients and services without a client/service id in the reader/writer
  user data are treated as fully matched at all times.

* Replace ERROR by FAILURE to because of windows.h

Signed-off-by: Erik Boasson <eb@ilities.com>

* Timeout rmw_send_response after waiting 100ms for discovery

The discovery will eventually result in the client's reader being known
or its writer no longer being known, so a timeout is not necessary for
correctness.  However, if it ever were to block for a longish
time (which is possible in the face of network failures), returning a
timeout to the caller is expected to result in less confusion.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Make iterators "const auto &"

Signed-off-by: Erik Boasson <eb@ilities.com>

* Add TODO for eliminating rmw_send_response blocking

Signed-off-by: Erik Boasson <eb@ilities.com>
jacobperron added a commit that referenced this pull request Jul 21, 2020
* Block rmw_send_response if response reader unknown

The client checks using rmw_service_server_is_available whether the
request it sends will be delivered to service, but that does not imply
that the (independent, as far as DDS is concerned) response reader of
the client has been discovered by the service.  Usually that will be the
case, but there is no guarantee.

Ideally DDS would offer an interface that allows checking the reverse
discovery, but that does not yet exist in either the specification or in
Cyclone.  This commit works around that by delaying publishing the
response until the number of request writers matches the number of
response readers.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Change request headers to use rmw_request_id_t on the wire

Signed-off-by: Erik Boasson <eb@ilities.com>

* Precise check for matched client/service

Assign a unique identifier to each client/service on creation, add it
to the USER_DATA QoS of the reader and writer and use it for the request
ids.  This allows:

* rmw_service_server_is_available to only return true once it has
  discovered a reader/writer pair of a single service (rather than a
  reader from some service and a writer from some service); and
* rmw_send_response to block until it has discovered the requesting
  client's response reader and to abandon the operation when the client
  has disappeared.

The USER_DATA is formatted in the same manner as the participant
USER_DATA, this uses the keys "serviceid" and "clientid".

This is still but a workaround for having a mechanism in DDS to ensure
that the response reader has been discovered prior by the request writer
prior to sending the request.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Address review comments

Signed-off-by: Erik Boasson <eb@ilities.com>

* Backwards compatibility

* Revert commit fb040c5 to retain the
  old wire representation;

* Embed the publication_handle of the request inside rmw_request_id_t,
  possible because reverting to the old wire representation frees up
  enough space, and use this in rmw_send_response to check for the
  presence of the client's reader;

* Clients and services without a client/service id in the reader/writer
  user data are treated as fully matched at all times.

* Replace ERROR by FAILURE to because of windows.h

Signed-off-by: Erik Boasson <eb@ilities.com>

* Timeout rmw_send_response after waiting 100ms for discovery

The discovery will eventually result in the client's reader being known
or its writer no longer being known, so a timeout is not necessary for
correctness.  However, if it ever were to block for a longish
time (which is possible in the face of network failures), returning a
timeout to the caller is expected to result in less confusion.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Make iterators "const auto &"

Signed-off-by: Erik Boasson <eb@ilities.com>

* Add TODO for eliminating rmw_send_response blocking

Signed-off-by: Erik Boasson <eb@ilities.com>

Co-authored-by: eboasson <eb@ilities.com>
@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-for-foxy-fitzroy-2020-07-23/15570/2

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-and-patch-release-for-ros-2-foxy-fitzroy-2020-08-07/15818/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants