Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide human-readable definitions/alarms for RCL(C) initialisation errors #128

Open
ted-miller opened this issue Aug 16, 2023 · 17 comments
Assignees
Labels
enhancement New feature or request todo Not an issue, just a TODO
Milestone

Comments

@ted-miller
Copy link
Collaborator

@ted-miller: we might want to consider reporting some of these RCL(C) initialisation errors, or defining some subcodes for them.

Especially the NODE, PUBLISHER, SUBSCRIPTION, etc categories seem to be caused by things which could often be solved by users without needing to post on the tracker here.

Originally posted by @gavanderhoorn in #125 (reply in thread)

@gavanderhoorn gavanderhoorn added this to the 0.1.2 milestone Aug 18, 2023
@gavanderhoorn gavanderhoorn added enhancement New feature or request todo Not an issue, just a TODO labels Aug 18, 2023
@gavanderhoorn
Copy link
Collaborator

gavanderhoorn commented Aug 18, 2023

Looking into this a bit: this wouldn't be too much work to implement, but I'm trying to figure out a good place for these kinds of "RCL(C) initialisation errors/alarms" in ErrorHandling.h.

We don't really have a main alarm category for those I believe:

//===================================
//Main Codes
//===================================
typedef enum
{
ALARM_TASK_CREATE_FAIL = 8010,
ALARM_ASSERTION_FAIL,
ALARM_ALLOCATION_FAIL,
ALARM_CONFIGURATION_FAIL,
ALARM_INFORM_JOB_FAIL,
ALARM_DAT_FILE_PARSE_FAIL,
} ALARM_MAIN_CODE;

The issue title already refers to these kinds of problems as "initialisation errors", so it doesn't seem like they would fit under ALARM_CONFIGURATION_FAIL or ALARM_TASK_CREATE_FAIL. And we only use ALARM_ASSERTION_FAIL for actual assertions, which don't get much text on the PP.

do we still have 'space' for a category dedicated to RCLC_INIT errors/alarms?

We could then raise a regular alarm and provide some descriptive text as part of it.


Edit: we could also make it a 'micro-ROS' category.

@gavanderhoorn
Copy link
Collaborator

As to subcodes: initially I thought we could perhaps opt to just 'forward' the various RCL_RET_* defined codes.

However, there is no guarantee those values will never change and that would mean we'd have to keep track of that and update our documentation if that happens.

Two approaches I've come up with:

  1. define a single subcode, reuse one of the alarm categories, and include the RCL(C) error code as part of the message text

  2. define (something like) an ALARM_RCL_RCLC_FAIL category, then (re)define relevant (perhaps all?) RCL_RET_* as subcodes (in a new typedef enum). RCL(C) API failures could then be posted to the PP:

    mpSetAlarm(ALARM_RCL_RCLC_FAIL, "Invalid node name",
        SUBCODE_RCL_RCLC_FAIL_NODE_INVALID_NAME);

the former would be trivial to implement (although we need to figure out which category to use).

The latter would be a little more work, but would make RCL(C) errors more 'explicit' and we could document each of them individually.

I first preferred the second approach, but thinking about it, the former is less work, gets us what we're looking for (a way to document at least some of the RCL(C) initialisation errors/return codes when they could potentially be solved by the user) and wouldn't require adding 30 new subcodes and a new alarm category.

We could still document individual values (ie: 202) in the troubleshooting guide. The entry point would just be a single alarm + subcode, and the text would clarify which RCL(C) error it really is.

@ted-miller
Copy link
Collaborator Author

I'm in favor of the second option. I think that the purpose of raising these new alarms is to make it easier for the user the debug. In that regard, providing an explicit definition of the error (as opposed to some number that must be looked up) is much more user-friendly.

@gavanderhoorn
Copy link
Collaborator

It would mean adding quite a few new subcodes, with a lot of documentation to write as well.

@gavanderhoorn
Copy link
Collaborator

I'm going to postpone this to after 0.2.0, which would focus on updating the underlying infrastructure (newer Humble packages, Iron compatibility).

@gavanderhoorn gavanderhoorn modified the milestones: 0.1.2, untargeted Oct 3, 2023
@gavanderhoorn
Copy link
Collaborator

It would mean adding quite a few new subcodes, with a lot of documentation to write as well.

An additional consideration: IIUC, subcodes are limited to a UCHAR, which would mean a maximum of 255.

I realise there are only so many rcl(c) errors we might want to directly map, but they'd still be taking up valuable 'space'.

@ted-miller ted-miller modified the milestones: untargeted, 0.2.0 Mar 1, 2024
@ted-miller
Copy link
Collaborator Author

I just had another occurrence of this and wasted a good amount of time trying to remember how to track down the rcl error codes.

This shouldn't be difficult to implement, so I retargeted the next milestone.

@gavanderhoorn
Copy link
Collaborator

Should this be 0.2.1 instead?

0.2.0 was really intended to be just about updating libmicroros and adding Iron support.

@ted-miller
Copy link
Collaborator Author

Sure. I just want to get it on the todo list.

@ted-miller ted-miller modified the milestones: 0.2.0, 0.2.1 Mar 1, 2024
@jimmy-mcelwain jimmy-mcelwain self-assigned this Aug 29, 2024
@jimmy-mcelwain
Copy link
Collaborator

I am working on this, but I wanted to point out that the alarm will no longer contain information about where the problem happened that it currently does. For example, right now we have

ret = rclc_service_init_default(&g_serviceQueueTrajPoint, &g_microRosNodeInfo.node, type_support, SERVICE_NAME_QUEUE_TRAJ_POINT);
motoRosAssert((ret == RCL_RET_OK), SUBCODE_FAIL_CREATE_SERVICE_QUEUE_POINT);

The subcode here shows the user where the error happened (initializing the /queue_traj_point service in this case) but it does not tell the user what the problem is.

The proposed solution would describe to the user what the problem is, but it would fail to describe where that problem happened. So if the user receives an alarm with the message "Invalid node name", the user may not know whether which node this applies to.

There is not enough space in the alarm message (32 chars) to fully describe what went wrong. The debug broadcast could be used to give more detail, but the error message won't contain all the information that the user needs to solve it. It doesn't match the principle that @ted-miller describes here

Also, should I include "RCL/RCLC" in every alarm to make it clear, or can we assume that the user will look up the error code and see that the main alarm code is specifically for RCL/RCLC? This isn't as important, but I have been including "RCL/RCLC" in each alarm message, and it would be nice to have an extra few characters to use to write the actual alarm message.

@ted-miller
Copy link
Collaborator Author

As we discussed, you can use multiple concurrent alarms to provide additional data.

I'd suggest ommiting the "RCL/RCLC", as that's half your message. We don't need users to know that the code came from that layer necessarily. They can look up the code/message in our troubleshooting guide. Then we can provide additional detail there.

@gavanderhoorn
Copy link
Collaborator

@jimmy-mcelwain: would you already have pushed your changes somewhere?

RCL(C) provides some ways of retrieving a humand readable error message, so I'm curious what your approach has been so far.

We would really want to avoid defining a set of (shadow) codes ourselves I believe.

As we discussed, you can use multiple concurrent alarms to provide additional data.

I was indeed also going to suggest something like that.

@jimmy-mcelwain
Copy link
Collaborator

I just pushed a branch report_rcl_errors. It is not complete.

I did define a set of codes shadow codes myself. The issue is that the RCL_RET error codes defined in types.h range from 0 to 3001, but our alarm subcodes can only go up to 256.

Similarly, the alarm messages have a character limit of 32, so I have been manually creating them based on documentation rather than using any provided human-readable messages. That being said I can still print the provided string (I believe using rcutils_get_error_string()) to the debug log

And while I can raise multiple alarms, something will have to change. As of right now motoRosAssert and motoRosAssert_withMsg both print to the debug log forever if the assertion is false/the alarms are triggered. If we want to trigger multiple alarms, I am thinking that instead of doing this, each "assert" function should just check the condition and if it needs to alarm, puts it onto a queue of alarms and then there is a RaiseAlarms() function or something similar that gets called after the assert conditions are checked. This function will raise the alarms in the queue and print to the debug log forever. That way, all the alarms and the full group of debug messages for alarms can be printed.

@jimmy-mcelwain
Copy link
Collaborator

There is also a bit of overlap with #268

@gavanderhoorn
Copy link
Collaborator

@jimmy-mcelwain: thanks for looking into this.

Looking at main...report_rcl_errors though, I'm wondering whether this approach 'scales'. You already have 38 new subcodes and this basically sets us up to keep track of what RCL and RCLC are doing and keep our shadow set up-to-date/matching.

We also need to keep Ros_ErrorHandling_RCL_Ret_Populate(..) up-to-date.

I know @ted-miller wrote in #128 (comment) he prefers this approach over the alternative I described in #128 (comment), but that latter approach would have meant one or two additional subcodes and then including the RCL(C) error code in the alarm text.

Something like this for instance:

ALARM 80XX
 RCL(C) API error: ZZ
[YY]

A concurrently posted (fatal) alarm would provide the info on where this happened, and that alarm would have its own unique subcode to aid debugging (just as we now have).

We would document the values of ZZ which are user serviceable (similar to what we do in Alarm: 8011[23 - 54]) and the rest would essentially be treated as they are now: conditions preventing MotoROS2 from successfully initialising/working which would require checking the debug log and/or posting an issue on our support forum (with the debug log attached).

That would seem to be much less work (although we'd still have to update the documentation, but we'd have to do that in any case) and it wouldn't require us to keep anything up-to-date -- except the documentation.

I also know @ted-miller prefers alarm texts which include a hint on how to solve the issue that caused it to be raised, but 32 chars are just not enough for that in these cases, and we already have many alarms which don't do that.

In addition, for many of these RCL(C) issues, users can't do anything anyway, except reach out to us. As an example: SUBCODE_RCL_RCLC_FAIL_WRONG_LEXEME. This has Wrong lexeme as associated message. Fixing that would require information about the context and control flow up to that point, as it's unlikely an error due to incorrect input.

@ted-miller
Copy link
Collaborator Author

A concurrently posted (fatal) alarm would provide the info on where this happened, and that alarm would have its own unique subcode to aid debugging (just as we now have).
We would document the values of ZZ which are user serviceable

That's not the ideal scenario. But you make a valid point regarding maintainability. I could live with this.

@jimmy-mcelwain
Copy link
Collaborator

I pushed another commit to the branch which removed the shadow codes. I created a function motoRos_RCLAssertOK() which sets the alarm and writes debug messages for the RCL(C) failures and as well as the alarm and debug messages for the already existing motoRosAssert() statements. There is some code duplication, but this seems like the easiest way to have the debug broadcast both alarm messages forever rather than just one or the other. I haven't actually tested it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request todo Not an issue, just a TODO
Projects
None yet
Development

No branches or pull requests

3 participants