Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read/write instrumentation with LTTng #898

Conversation

christophebedard
Copy link
Contributor

@christophebedard christophebedard commented Aug 2, 2021

This adds tracing instrumentation using LTTng for:

  • writer creation
  • reader creation
  • writing
  • reading

When a writer or reader entity is created, we note:

  • the pointer to the underlying writer/reader
  • the topic name
  • its GUID

When we write or read, we note:

  • the pointer to the underlying writer/reader
  • the pointer to the message buffer
  • (write only) the source timestamp

This allows us to:

  • match a writer or reader with an encapsulating/higher-level object in an application layer (above) using the GUID
  • match writer/reader creation events with write/read events using the writer/reader pointers
  • match the write/read message buffer pointers with other write/read/publish/take tracepoints in the application level
  • match the source timestamp between writing and receiving

Also, while it has to be separate from the ROS 2 instrumentation, these tracepoints can easily be enabled by the tracing-related launch/CLI utilities in https://gitlab.com/ros-tracing/ros2_tracing

See the discussion here: ros2/rmw_cyclonedds#324 (comment). As mentioned at the bottom of that PR, LTTng has only partial support for tracing SDT probes, but it can generate them alongside its own tracepoints if it is built from source and configured with --with-sdt.

Note: this PR targets master, but I'd like to get it cherry-picked (if the PR is accepted) to the iceoryx branch so that ROS 2 can use it (since it currently targets the iceoryx branch).

Signed-off-by: Christophe Bedard bedard.christophe@gmail.com

Copy link
Contributor

@eboasson eboasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @christophebedard thanks for contributing this first attempt at introducing proper tracing into the Cyclone code base. I have a problem with basing everything exclusively on LTTng, as I really want to (also) be able to use dtrace on macOS, for example.

That means probe definitions need to be a bit differently. This in itself is going to be tricky, so I would really like to see a PoC for first, or at least a sketch of one 😁.

Based on what you noted in ros2/rmw_cyclonedds#324 (comment) concerning semaphores, an argument could be made that LTTng tracepoints would necessarily end up in less interesting places and fewer in number and that consequently it wouldn't be worth the effort. In that case, arguably, LTTng tracepoints may not be a good choice for Cyclone to begin with.

That restriction does strike me as odd, as I haven't seen it in the LTTng documentation and instead found a remark that it uses a lock-free procedure for adding to the trace. I don't see how the two fit together.

Some of the comments are perhaps a bit stream-of-consciousness ... my apologies for that.

src/core/ddsc/CMakeLists.txt Outdated Show resolved Hide resolved

#ifdef DDS_HAS_LTTNG_TRACING
dds_guid_t guid;
(void)dds_get_guid(reader, &guid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dds_get_guid can fail if there is a concurrent call attempting to delete the new reader. Worse, the reader may have been deleted at this point, because dds_entity_init_complete on line 607 makes the new reader accessible via its handle and unpins it as well. That is way too early (I am sure there is a reason, so I need to look into it before making any changes).

However, if you were to try calling dds_get_guid prior to dds_entity_init_complete it would fail because it would still be "pending" ... It probably makes sense to introduce a function that takes an entity pointer and returns the corresponding dds_guid_t (analogous to the inner workings of dds_get_guid) and then refactor dds_get_guid to use that function.

I think it would be better to trace not (just) the address of the reader (rd) but (also) its handle (reader), because the handle is what is used consistently in the API. Tracing the address is primarily interesting for the purpose of analyzing Cyclone itself.

From a ROS 2 perspective, I can see that tracing just the topic name would be sufficient, but for more general DDS usage, the partition names might be rather useful, too. Based on the QoS matching rules one could argue that all QoS are important, but I don't know of anyone who deliberately relies on incompatible QoS's in a system, and I don't think this type of instrumentation is appropriate for debugging incorrect QoS.

For analyzing the reader side of Cyclone itself, having rd->m_rd and rd->m_rhc available would also be useful, as those are involved in handling data at a lower level in the stack (esp. the latter would be useful, it'd allow you to trace data going into and coming out of the reader history cache).

However, adding all that information to the trace brings it perilously close to the problems of the current tracing to a text file. There is always something else to add, and in the end you are tracing far too much, most of it completely useless for the problem at hand. This is incidentally one reason why I very much prefer dtrace/bpftrace-style tooling: then a script can extract all the necessary information from just the reader pointer.


#ifdef DDS_HAS_LTTNG_TRACING
dds_guid_t guid;
(void)dds_get_guid(writer, &guid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment at dds_create_reader: here the call to dds_entity_init_complete wasn't done too early, but still there is no guarantee that the writer is still accessible nor that wr is still a valid pointer.

(There is also the possibility that in between line 448 (perhaps 452 because deleting a writer does require locking the publisher for removing it from its table of children) and this tracepoint, the writer is deleted and a new one is created that just happens to be at the same address. If you're really out of luck you might end up with an ambiguous trace ...)

@@ -333,6 +334,8 @@ dds_return_t dds_write_impl (dds_writer *wr, const void * data, dds_time_t tstam
if (data == NULL)
return DDS_RETCODE_BAD_PARAMETER;

TRACEPOINT(write, (const void *)wr, data, tstamp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd trace the writer's handle, not its address (see comment at dds_create_reader).

There's also dds_writecdr_impl which is used by rmw_publish_serialized_message.

@@ -128,6 +129,7 @@ static dds_return_t dds_read_impl (bool take, dds_entity_t reader_or_condition,
}
dds_entity_unpin (entity);
thread_state_asleep (ts1);
TRACEPOINT(read, (const void *)rd, (const void *)*buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd probably want to instrument dds_readcdr_impl as well, it is used by rmw_take_serialized_message.

* SPDX-License-Identifier: EPL-2.0 OR BSD-3-Clause
*/

// Provide fake header guard for cpplint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, read enough of the docs and header files of LTTng to understand this comment. It would probably be better to explain that LTTng tracepoints expect multiple inclusion of this file and point to https://lttng.org/docs/v2.13/#doc-tpp-header for a bit of a backgrounder.

I also think it would make sense to not disable LLTng API version-0 compatibility and use the LTTNG_UST_ prefixed names instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, read enough of the docs and header files of LTTng to understand this comment. It would probably be better to explain that LTTng tracepoints expect multiple inclusion of this file and point to lttng.org/docs/v2.13/#doc-tpp-header for a bit of a backgrounder.

I don't think Cyclone DDS uses cpplint, so it might not be needed here if linters don't complain. I'll look into it and remove this part if it's not needed.

I also think it would make sense to not disable LLTng API version-0 compatibility and use the LTTNG_UST_ prefixed names instead.

I think it's the other way around, no? Version 0 doesn't use those prefixes. Besides, Ubuntu 20.04 is on LTTng 2.11.

#include <stdint.h>
#include <stdbool.h>

#define DDS_GID_STORAGE_SIZE 16u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be possible to include dds/dds.h and use dds_guid_t and its sizeof directly, rather than defining its size here and passing it as an array of uint8_t? Not that its definition will change 🙂

The other thing is that it would definitely by better to consistently use GUID rather than introduce GID here.

Copy link
Contributor Author

@christophebedard christophebedard Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be possible to include dds/dds.h and use dds_guid_t and its sizeof directly, rather than defining its size here and passing it as an array of uint8_t? Not that its definition will change

Sure I'll try.

The other thing is that it would definitely by better to consistently use GUID rather than introduce GID here.

Sure I'll change it. Honestly I couldn't figure out which term was the correct one. It seems like there are many different terms for this 😅

@@ -645,6 +646,12 @@ static dds_entity_t dds_create_reader_int (dds_entity_t participant_or_subscribe
dds_topic_allow_set_qos (tp);
dds_topic_unpin (tp);
dds_subscriber_unlock (sub);

#ifdef DDS_HAS_LTTNG_TRACING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to make it in, we'll have to find a better way to specify trace points and to hide all the conditional stuff inside the TRACEPOINT macro (or whatever it ends being).

Looking at USDT as well, I think an attempt at supporting both is worth a try. For example, the TRACEPOINT use here could be mapped to DTRACE_PROBE3(TRACEPOINT_PROVIDER, (const void *)rd, rd->m_topic->m_name, guid.v). It may not always be this straightforward, of course, and no doubt there'll be cases where one makes sense and the other doesn't.

As usual, Sun did a rather nice job, and as a consequence dtrace has separate probe definitions from which you generate C (and you get type information via CTF that is completely unrelated to that abbreviation's use by LTTng). That does mean the probe definitions currently in tracing_lttng.h will need equivalent tracepoints defined in .d files.

Getting all that to play nicely and without a lot of duplication is going to take some work. Requiring a work of art at this stage is, I think, detrimental to progress, but a proof-of-concept (even a fragile one) supporting both with a single source and some preprocessing (in Python/Perl/Haskell/CLang-derived compiler/whatever) and with the ifdefs pushed into header files is something I would really like to see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely can't promise I'll do all of that, but I can try to get a proof of concept for LTTng + "native" probes support at some point.

@christophebedard
Copy link
Contributor Author

@eboasson sorry for the late reply!

I have a problem with basing everything exclusively on LTTng, as I really want to (also) be able to use dtrace on macOS, for example.

I think you can use LTTng UST on macOS, so you can build the SDT probes with LTTng and you can use dtrace. Limitations aside, do you just want to avoid having to use LTTng?

That means probe definitions need to be a bit differently. This in itself is going to be tricky, so I would really like to see a PoC for first, or at least a sketch of one .

I can try something.

Based on what you noted in ros2/rmw_cyclonedds#324 (comment) concerning semaphores, an argument could be made that LTTng tracepoints would necessarily end up in less interesting places and fewer in number and that consequently it wouldn't be worth the effort. In that case, arguably, LTTng tracepoints may not be a good choice for Cyclone to begin with.

I'm not sure I understand. The restriction is for SDT probes only. Why would LTTng tracepoint necessarily end up in less interesting places?

That restriction does strike me as odd, as I haven't seen it in the LTTng documentation and instead found a remark that it uses a lock-free procedure for adding to the trace. I don't see how the two fit together.

AFAIU LTTng doesn't support SDT probe arguments simply because it's not currently implemented. It might be the case for semaphore support as well; I'm not sure.

eboasson and others added 26 commits September 8, 2021 18:30
This restores the directories used before the MinGW support was added.
For some reason, in the CI the VS and Xcode builds do store the output
in directories not suffixed with Release or Debug, but this is not
usually the case.  As a consequence, on a typical build with those build
systems, this test failed.

Signed-off-by: Erik Boasson <eb@ilities.com>
This fixes incompatibilities with bison 3.8:

* returning user-defined values from yyparse is not supported, this
  works aruond that by storing a more meaningful return value in the
  parser state object;

* the token name table generated by %token-table is deprecated and has
  been for quite some time, and it seems that now the accompanying
  mapping to internal token numbers is gone as well, this uses a hack to
  work around that.

A follow-up fix is needed to remove the use of the token name table
entirely.  The hack here is attractive only in that it provides a quick
solution for the failing CI runs on macOS.

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
The analyzer doesn't do a good job if there are no SAL2 annotations, but
those annotations are only supported by Visual Studio, the documentation
is very incomplete and it is (nearly) impossible to get them right
without actually compiling the code with Visual Studio.  The latter is
what it makes it really impossible to use them.

That makes dealing with false positives from the Visual Studio analzyer
far more trouble than it is worth (especially considering we're also
using the clang and gcc analyzers and Coverity).

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Dennis Potman <dennis.potman@adlinktech.com>
Signed-off-by: Jeroen Koekkoek <jeroen@koekkoek.nl>
Signed-off-by: Jeroen Koekkoek <jeroen@koekkoek.nl>
The length returned by snprintf does not includes the ending
null. So the test must be changed to ensure that if the final length
if sadly 64, the library can be found all the time.

Also add little comments.

Signed-off-by: José Bollo <jose.bollo@iot.bzh>
A message is emitted to stderr when either the plugin can
not be loaded or the symbol 'generate' can't be located
in the loaded plugin.

Logging to sdterr is in wait of some better integration to
some other logging system.

Signed-off-by: José Bollo <jose.bollo@iot.bzh>
Signed-off-by: Erik Boasson <eb@ilities.com>
- The parser was not proceeding correctly in the case of the state
  being IDL_SCAN_ANNOTATION_APPL_SCOPED_NAME

Signed-off-by: Martijn Reicher <martijn.reicher@adlinktech.com>
- The parser was setting the node mask to IDL_ANY in stead of the
  correct mask for bool, char and string literals

Signed-off-by: Martijn Reicher <martijn.reicher@adlinktech.com>
Signed-off-by: Jeroen Koekkoek <jeroen@koekkoek.nl>
Signed-off-by: Dennis Potman <dennis.potman@adlinktech.com>
Case label destructors deleted the const-expr as opposed to
unreferencing it causing a use-after-free if the same enum
was used in more than one union.

Fixes eclipse-cyclonedds#946.

Signed-off-by: Jeroen Koekkoek <jeroen@koekkoek.nl>
Destructors are not executed for untyped midrule actions because Bison
does not know which destructor to run. annotation_appl nodes were not
cleaned up if the annotation_appl could not be successfully finalized.
Remove use of midrule actions in annotation_appl as some platforms still
ship with Bison 3.0.4.

Fixes eclipse-cyclonedds#950.

Signed-off-by: Jeroen Koekkoek <jeroen@koekkoek.nl>
…g 'C' as a language platform-specific macros are irrelevant for idlc, and this is a blocker for compiling mac universal binaries.
Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
There is a maximum nesting depth for the configuration that is
determined by the internal configuration tables.  Simply aborting the
interpretation of the XML input on the first unrecognized element item
would guarantee never exceeding the maximum nesting depth.  As an error
handling strategy this is less user-friendly than ignoring some of the
input, attempting to interpret the remainder and reporting any further
issues.

This commit makes it abort the interpretation as soon as the maximum
nesting depth is reached, in all other cases it continues and reports as
many issues with the configuration as it can.

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
reicheratwork and others added 11 commits October 8, 2021 08:12
- This fixes eclipse-cyclonedds#968
- Add a byte swap to the participant GUID prefix

Signed-off-by: Martijn Reicher <martijn.reicher@adlinktech.com>
CMAKE_SOURCE_DIR always points to the top-level CMake project, so if
CycloneDDS was included as a gitmodule or pulled in using CMake
FetchContent, the CMake step would fail with errors like

```
CMake Error at cyclonedds/src/tools/idlc/CMakeLists.txt:52 (file):
file failed to open for reading (No such file or directory):
/home/robin/dds-project/src/ddsrt/include/getopt.h.in

CMake Error: File /home/robin/dds-project/src/ddsrt/src/getopt.c does not exist.
CMake Error at cyclonedds/src/tools/idlc/CMakeLists.txt:61 (configure_file):
configure_file Problem configuring file

CMake Error at cyclonedds/src/tools/idlc/CMakeLists.txt:95 (include):
include could not find load file:
/home/robin/dds-project/cmake/Modules/Generate.cmake
```

This commit replaces CMAKE_SOURCE_DIR with <PROJECT_NAME>_SOURCE_DIR, a
variable that always points to the source dir of the named project.

See:
https://cmake.org/cmake/help/v3.21/variable/CMAKE_SOURCE_DIR.html
https://cmake.org/cmake/help/v3.21/variable/PROJECT-NAME_SOURCE_DIR.html

Signed-off-by: Robin Lindén <_@robinlinden.eu>
Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
gcc 11 warns that there is a conversion from 'long int'
to 'uint32_t' when assigning the value of PTHREAD_STACK_MIN
to tattr.stackSize.  It's pedantically correct; tattr.stackSize
is a uint32_t, and PTHREAD_STACK_MIN is either an integer (if
it is hard-coded), or the result of calling sysconf(), which
always returns a long.  In either case it is a signed value,
so we should explicitly cast it to uint32_t here.  This quiets
the warning.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
This makes idlc show an error message in case the maximum number
of instructions is reached. Because of 16-bits offset fields in the
instructions, the maximum number of instructions that can be used
is limited to 65k.

Signed-off-by: Dennis Potman <dennis.potman@adlinktech.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Erik Boasson <eb@ilities.com>
This suppresses deprecation warnings from OpenSSL 3.0. It is not a
proper fix, but the OpenSSL documentation on how to deal with the
deprecations is such that it will take some time to fix this.

Signed-off-by: Erik Boasson <eb@ilities.com>
Signed-off-by: Christophe Bedard <bedard.christophe@gmail.com>
Signed-off-by: Christophe Bedard <bedard.christophe@gmail.com>
@thijsmie
Copy link
Contributor

While I would actually like to merge this PR it's been left too long and has become unreviewable with the many conflicts and highly diverged branches. @christophebedard would you like to rebase your changes onto the current master or should I close this and defer LTTng to the TODO backlog?

@thijsmie thijsmie added the enhancement Enhance existing functionality label Jul 20, 2022
@christophebedard
Copy link
Contributor Author

Thanks for the follow up. We can close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhance existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants