-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manual_control_setpoint split switches into new message manual_control_switches #9712
Conversation
0a6aa5a
to
80b926a
Compare
8312fcb
to
0c5089d
Compare
couple flights on the Pixracer (V4) manual flights going through stabilized, altitude and position modes; no issues |
Instead of (or in addition of) adding layers/logic on top, which may or may not help, I'd like a more structured approach to resolve this. There's no question that we need absolutely reliable RC. Questions to answer:
Information that should be collected from affected setups:
What I would like to see is a hardware setup that runs 24/7, checks for glitches and records all raw RC data. I offer to help implementing that. |
0c5089d
to
e571ff7
Compare
Crude debounce removed and rebased on master. |
@dagar I’ve come to the conclusion to deprecate v2 at this point? Do a maintenance Release and drop it along with bootloader update instructions for HW that is v3 but with an old bootloader. Time to move on! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit on the fence here. I see the advantages (commander update only on mode switch, log all RC switches (but we cannot infer that they were due to single RC glitches)), but also disadvantages (generally more system overhead due to splitting into small messages and flight review breakage).
No we can't infer the switch was due to an RC glitch, but at least we can know it even changed. In crashes like #9595 it's difficult to be completely certain RC is to blame. It's obviously quite likely, but we can't completely rule out other subtle errors in commander because we don't have the data. Taking a step back I see this as a fundamental split applicable to several areas throughout PX4. We have a mix of continuously changing data (50-100 Hz), and data that rarely changes (perhaps 10 times total). I'd like to deduplicate it as soon as possible to remove the burden from consumers and decrease our logging requirements while actually improving our ability to review flights. I'm all too familiar with the current uORB overhead at various levels, but there are solutions to that as well. |
@LorenzMeier I have one idea for a relatively small push we could make that would reduce a lot of bloat while also simplifying things. Once we eventually burn through that I think considering deprecation is reasonable, but at the moment the amount of duplicated code is kind of egregious regardless of hitting a flash limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In crashes like #9595 it's difficult to be completely certain RC is to blame. It's obviously quite likely, but we can't completely rule out other subtle errors in commander because we don't have the data.
What about adding a 'state switch reason' to one of the commander outputs?
Taking a step back I see this as a fundamental split applicable to several areas throughout PX4. We have a mix of continuously changing data (50-100 Hz), and data that rarely changes (perhaps 10 times total). I'd like to deduplicate it as soon as possible to remove the burden from consumers and decrease our logging requirements while actually improving our ability to review flights. I'm all too familiar with the current uORB overhead at various levels, but there are solutions to that as well.
That is all good, and I raised it now because I want to avoid running into issues first. The topic size reduction is 6 bytes, which is less than 10%, so the effect on the log is marginal.
4979610
to
a233853
Compare
That's perhaps something to think about for the state machines in general, and definitely something we still need to improve for rejections.
The point is to be able to be able to capture 100% of the relevant changes (instead of a 5-10% sampling), the tiny reduction in log size is a distant secondary benefit. |
I generally agree with that, but taking occational dropouts into account, the sampling is still better. |
Going back to where we started (#9595 crash) I'm reasonably certain there was a blip in RC that changed the switch interpretations, then immediately back without anything being captured in the log. I'm proposing this change so that we can potentially capture that behaviour in real testing, then consider adding a simple debounce if appropriate. Can we move forward with this? |
The change here has too many side-effects to me, without adding much benefit. This is why I suggested adding the state switching reason to commander instead. Do you think it's not enough? |
Additional commander logging is usually a good thing, but switches are evaluated in multiple different places within commander (main state, arm/disarm, failsafe escape, kill) as well as separately in VTOL for transitions and MC for landing gear. I get semi-regular reports from people with issues (sometimes fatal) that sound like they might be similar RC glitches, but we don't have the data to know. Reproducing the error on the bench can be out of reach when helping someone remotely or when their vehicle has already been destroyed. Splitting data structures along these lines is easy, allows us to optimize px4 architecture later, and makes it significantly more likely we'll already have the necessary data from normal real world logs when it matters. |
The problem with this approach is that it will not allow as to do deeper changes on the RC parsing level. In general you simply have not convinced me for the changes in this PR. I'll need more/better arguments to be convinced. |
How so? Again all this PR is currently doing is deduplicating data. If you're concerned about the potential next step of debouncing at this level it would be easy and cheap to additionally log the original version without debounce. The other thing to note is that this isn't a compromise for the sake of logging and debugging. The manual control setpoint and switch positions are different things used by different parts of the system. We're not losing anything by splitting on these boundaries.
I'm not really sure what you're suggesting, we can't log the original message at full rate by default, and we can't necessarily ask people that have crashed to try again. Please try to keep in mind we've still done nothing to address the original issue (#9595). |
a233853
to
58ca913
Compare
Not the PR, but the approach you are taking. All you can do by logging at that level is adding things like debouncing, whereas we should aim at resolving it at the RC parsing level (I'm not saying that it will be possible for sure).
As I wrote above, the current PR is worse in terms of log analysis because of potential dropouts. The situation would already be better if the switches were published at least once a second.
Ok, let's look at this:
So only a single module only requires the switches, and several use both. I see the logical reason for splitting the message, but there's simply no bigger benefit, and it creates churn.
It's not about losing anything. If we split up the system into too many small messages, we end up with a more complex and bloated solution. Surely we have/had cases where splitting is the obivous thing to do as messages contained too many different things, but each case needs to be evaluated with care.
Of course not. My suggestion was to add something to the log in case there is a single-frame switch change (or 2 or 3 for that matter), for example a printf warning. |
We can keep going back and forth spending a disproportionate amount of time on this fairly mundane change, but the bottom line is that we need to set something in motion that will help us get to a real solution later. I'm proposing a small, safe change that will give us some insight to act on when these things occur. If you disagree then please open an alternative or propose a concrete path forward. Yesterday I was talking to another user that had the kill switch falsely trigger in air briefly. No "glitch" found in input_rc, rc_channels (not logged), or manual_control_setpoint data and RC otherwise "seemed" fine. They disabled the kill switch, lost a little faith in our platform, and kept going. We can't keep ignoring these problems. |
0861ec7
to
225f8ee
Compare
225f8ee
to
3d7949c
Compare
3d7949c
to
01d2296
Compare
Rebased on master. |
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions. |
This PR splits the
manual_control_setpoint
switch positions into a new messagemanual_control_switches
. Then with the switches contained to a single message simple logic was added to require 2 identical consecutive frames before publishing a change (crude debounce). Additionally the message only publishes on change.This was originally done in response to a crash back in June (#9595) the test flight team experienced where the kill switch was activated briefly in flight. The log didn't contain the actual switch change.
It's rare, but I still receive reports that can be attributed to blips in RC. Things like mid-air disarming (with RC arm on switch), uncommanded mode change, etc.
Future work
Refactor commander to only evaluate RC switches for mode changes and arming when this uORB message updates.