Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manual_control_setpoint split switches into new message manual_control_switches #9712

Closed
wants to merge 1 commit into from

Conversation

dagar
Copy link
Member

@dagar dagar commented Jun 18, 2018

This PR splits the manual_control_setpoint switch positions into a new message manual_control_switches. Then with the switches contained to a single message simple logic was added to require 2 identical consecutive frames before publishing a change (crude debounce). Additionally the message only publishes on change.

This was originally done in response to a crash back in June (#9595) the test flight team experienced where the kill switch was activated briefly in flight. The log didn't contain the actual switch change.

It's rare, but I still receive reports that can be attributed to blips in RC. Things like mid-air disarming (with RC arm on switch), uncommanded mode change, etc.

Future work

Refactor commander to only evaluate RC switches for mode changes and arming when this uORB message updates.

@dagar dagar force-pushed the pr-manual_control_setpoint-split branch from 0a6aa5a to 80b926a Compare October 8, 2018 15:59
@dagar dagar changed the title [DO NOT MERGE] manual_control_setpoint split manual_control_setpoint split switches into new message manual_control_switches Oct 8, 2018
@dagar dagar force-pushed the pr-manual_control_setpoint-split branch 2 times, most recently from 8312fcb to 0c5089d Compare October 8, 2018 16:54
@dagar dagar requested a review from a team October 9, 2018 02:32
@dagar dagar requested a review from bkueng October 9, 2018 13:57
@santiago3dr
Copy link

couple flights on the Pixracer (V4)
https://logs.px4.io/plot_app?log=f6cf001b-2bef-496b-bfd0-46cc5586e544
https://logs.px4.io/plot_app?log=0624efc7-4f1c-4f35-a7bb-f2fa543af873

manual flights going through stabilized, altitude and position modes; no issues

@bkueng
Copy link
Member

bkueng commented Oct 10, 2018

Then with the switches contained to a single message simple logic was added to require 2 identical consecutive frames before publishing a change (crude debounce).

Instead of (or in addition of) adding layers/logic on top, which may or may not help, I'd like a more structured approach to resolve this. There's no question that we need absolutely reliable RC.

Questions to answer:

  • are the glitches only a single frame or multiple?
  • and related: by how much do we lower the probability of things going wrong with such a change here?

Information that should be collected from affected setups:

  • What Pixhawk is used?
  • What exact receiver RC model is used, what protocol and preferable which firmware on it?
  • How is the wiring done? A picture helps
  • Does it happen under specific situations?

What I would like to see is a hardware setup that runs 24/7, checks for glitches and records all raw RC data. I offer to help implementing that.

@dagar
Copy link
Member Author

dagar commented Oct 15, 2018

Unfortunately all we have to go on is incomplete logs and anecdata. My hunch is many of the blips can ultimately be explained by less than perfect wiring, but I'd certainly like to know for sure.

We could keep most of this PR other than the line requiring 2 consecutive identical switch messages. Then at least we'd have all of the switch changes logged.
image

@dagar dagar force-pushed the pr-manual_control_setpoint-split branch from 0c5089d to e571ff7 Compare October 15, 2018 16:02
@dagar
Copy link
Member Author

dagar commented Oct 15, 2018

Crude debounce removed and rebased on master.
@bkueng could you review?

@dagar
Copy link
Member Author

dagar commented Oct 15, 2018

Flash limit hit again.
image

@LorenzMeier
Copy link
Member

LorenzMeier commented Oct 15, 2018

@dagar I’ve come to the conclusion to deprecate v2 at this point? Do a maintenance Release and drop it along with bootloader update instructions for HW that is v3 but with an old bootloader. Time to move on!

Copy link
Member

@bkueng bkueng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence here. I see the advantages (commander update only on mode switch, log all RC switches (but we cannot infer that they were due to single RC glitches)), but also disadvantages (generally more system overhead due to splitting into small messages and flight review breakage).

src/modules/sensors/rc_update.cpp Outdated Show resolved Hide resolved
@dagar
Copy link
Member Author

dagar commented Oct 16, 2018

I'm a bit on the fence here. I see the advantages (commander update only on mode switch, log all RC switches (but we cannot infer that they were due to single RC glitches)), but also disadvantages (generally more system overhead due to splitting into small messages and flight review breakage).

No we can't infer the switch was due to an RC glitch, but at least we can know it even changed. In crashes like #9595 it's difficult to be completely certain RC is to blame. It's obviously quite likely, but we can't completely rule out other subtle errors in commander because we don't have the data.

Taking a step back I see this as a fundamental split applicable to several areas throughout PX4. We have a mix of continuously changing data (50-100 Hz), and data that rarely changes (perhaps 10 times total). I'd like to deduplicate it as soon as possible to remove the burden from consumers and decrease our logging requirements while actually improving our ability to review flights. I'm all too familiar with the current uORB overhead at various levels, but there are solutions to that as well.

@dagar
Copy link
Member Author

dagar commented Oct 16, 2018

@dagar I’ve come to the conclusion to deprecate v2 at this point? Do a maintenance Release and drop it along with bootloader update instructions for HW that is v3 but with an old bootloader. Time to move on!

@LorenzMeier I have one idea for a relatively small push we could make that would reduce a lot of bloat while also simplifying things. Once we eventually burn through that I think considering deprecation is reasonable, but at the moment the amount of duplicated code is kind of egregious regardless of hitting a flash limit.

Copy link
Member

@bkueng bkueng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In crashes like #9595 it's difficult to be completely certain RC is to blame. It's obviously quite likely, but we can't completely rule out other subtle errors in commander because we don't have the data.

What about adding a 'state switch reason' to one of the commander outputs?

Taking a step back I see this as a fundamental split applicable to several areas throughout PX4. We have a mix of continuously changing data (50-100 Hz), and data that rarely changes (perhaps 10 times total). I'd like to deduplicate it as soon as possible to remove the burden from consumers and decrease our logging requirements while actually improving our ability to review flights. I'm all too familiar with the current uORB overhead at various levels, but there are solutions to that as well.

That is all good, and I raised it now because I want to avoid running into issues first. The topic size reduction is 6 bytes, which is less than 10%, so the effect on the log is marginal.

msg/manual_control_switches.msg Outdated Show resolved Hide resolved
src/modules/vtol_att_control/vtol_att_control_main.h Outdated Show resolved Hide resolved
@dagar dagar force-pushed the pr-manual_control_setpoint-split branch from 4979610 to a233853 Compare November 9, 2018 20:25
@dagar
Copy link
Member Author

dagar commented Nov 9, 2018

What about adding a 'state switch reason' to one of the commander outputs?

That's perhaps something to think about for the state machines in general, and definitely something we still need to improve for rejections.

That is all good, and I raised it now because I want to avoid running into issues first. The topic size reduction is 6 bytes, which is less than 10%, so the effect on the log is marginal.

The point is to be able to be able to capture 100% of the relevant changes (instead of a 5-10% sampling), the tiny reduction in log size is a distant secondary benefit.

@bkueng
Copy link
Member

bkueng commented Nov 12, 2018

The point is to be able to be able to capture 100% of the relevant changes (instead of a 5-10% sampling), the tiny reduction in log size is a distant secondary benefit.

I generally agree with that, but taking occational dropouts into account, the sampling is still better.

@dagar
Copy link
Member Author

dagar commented Nov 12, 2018

Going back to where we started (#9595 crash) I'm reasonably certain there was a blip in RC that changed the switch interpretations, then immediately back without anything being captured in the log.

I'm proposing this change so that we can potentially capture that behaviour in real testing, then consider adding a simple debounce if appropriate.

Can we move forward with this?

@bkueng
Copy link
Member

bkueng commented Nov 12, 2018

Going back to where we started (#9595 crash) I'm reasonably certain there was a blip in RC that changed the switch interpretations, then immediately back without anything being captured in the log.

The change here has too many side-effects to me, without adding much benefit. This is why I suggested adding the state switching reason to commander instead. Do you think it's not enough?
I still think the better way to debug the original issue would be on the bench, then we can capture all the data we receive. A while ago I had a similar problem (random RC glitches), and I noticed the RC was not plugged in properly, so I think we can actually reproduce similar behavior.

@dagar
Copy link
Member Author

dagar commented Nov 12, 2018

Additional commander logging is usually a good thing, but switches are evaluated in multiple different places within commander (main state, arm/disarm, failsafe escape, kill) as well as separately in VTOL for transitions and MC for landing gear.

I get semi-regular reports from people with issues (sometimes fatal) that sound like they might be similar RC glitches, but we don't have the data to know. Reproducing the error on the bench can be out of reach when helping someone remotely or when their vehicle has already been destroyed. Splitting data structures along these lines is easy, allows us to optimize px4 architecture later, and makes it significantly more likely we'll already have the necessary data from normal real world logs when it matters.

@bkueng
Copy link
Member

bkueng commented Nov 13, 2018

The problem with this approach is that it will not allow as to do deeper changes on the RC parsing level.
If it's only about gathering data specifically for such cases, I suggest we limit the changes to do exactly that (i.e. leave the existing messages unchanged). This will allow us to do a better decision once we actually have data.

In general you simply have not convinced me for the changes in this PR. I'll need more/better arguments to be convinced.

@dagar
Copy link
Member Author

dagar commented Nov 13, 2018

The problem with this approach is that it will not allow as to do deeper changes on the RC parsing level.

How so? Again all this PR is currently doing is deduplicating data. If you're concerned about the potential next step of debouncing at this level it would be easy and cheap to additionally log the original version without debounce.

The other thing to note is that this isn't a compromise for the sake of logging and debugging. The manual control setpoint and switch positions are different things used by different parts of the system. We're not losing anything by splitting on these boundaries.

If it's only about gathering data specifically for such cases, I suggest we limit the changes to do exactly that (i.e. leave the existing messages unchanged). This will allow us to do a better decision once we actually have data.

I'm not really sure what you're suggesting, we can't log the original message at full rate by default, and we can't necessarily ask people that have crashed to try again. Please try to keep in mind we've still done nothing to address the original issue (#9595).

@dagar dagar force-pushed the pr-manual_control_setpoint-split branch from a233853 to 58ca913 Compare November 13, 2018 15:07
@bkueng
Copy link
Member

bkueng commented Nov 15, 2018

How so? Again all this PR is currently doing is deduplicating data.

Not the PR, but the approach you are taking. All you can do by logging at that level is adding things like debouncing, whereas we should aim at resolving it at the RC parsing level (I'm not saying that it will be possible for sure).

The other thing to note is that this isn't a compromise for the sake of logging and debugging.

As I wrote above, the current PR is worse in terms of log analysis because of potential dropouts. The situation would already be better if the switches were published at least once a second.

The manual control setpoint and switch positions are different things used by different parts of the system.

Ok, let's look at this:

  • Both are used by:
    • mavlink
    • commander
    • mc_pos_control
  • only sticks:
    • vmount
    • mc_acc_control
    • fw_att_control
    • gnd_att_control
    • fw_pos_control
  • only switches:
    • vtol_att_control

So only a single module only requires the switches, and several use both. I see the logical reason for splitting the message, but there's simply no bigger benefit, and it creates churn.

We're not losing anything by splitting on these boundaries.

It's not about losing anything. If we split up the system into too many small messages, we end up with a more complex and bloated solution. Surely we have/had cases where splitting is the obivous thing to do as messages contained too many different things, but each case needs to be evaluated with care.

I'm not really sure what you're suggesting, we can't log the original message at full rate by default, and we can't necessarily ask people that have crashed to try again.

Of course not. My suggestion was to add something to the log in case there is a single-frame switch change (or 2 or 3 for that matter), for example a printf warning.

@dagar
Copy link
Member Author

dagar commented Nov 15, 2018

We can keep going back and forth spending a disproportionate amount of time on this fairly mundane change, but the bottom line is that we need to set something in motion that will help us get to a real solution later. I'm proposing a small, safe change that will give us some insight to act on when these things occur. If you disagree then please open an alternative or propose a concrete path forward.

Yesterday I was talking to another user that had the kill switch falsely trigger in air briefly. No "glitch" found in input_rc, rc_channels (not logged), or manual_control_setpoint data and RC otherwise "seemed" fine. They disabled the kill switch, lost a little faith in our platform, and kept going. We can't keep ignoring these problems.

@dagar dagar force-pushed the pr-manual_control_setpoint-split branch from 3d7949c to 01d2296 Compare January 22, 2020 02:27
@PX4 PX4 deleted a comment from stale bot Jan 22, 2020
@dagar
Copy link
Member Author

dagar commented Jan 22, 2020

Rebased on master.

@stale stale bot added the stale label Apr 23, 2020
@PX4 PX4 deleted a comment from stale bot Apr 23, 2020
@stale stale bot removed the stale label Apr 23, 2020
@stale
Copy link

stale bot commented Jul 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Jul 25, 2020
@dagar dagar closed this Aug 6, 2020
@LorenzMeier LorenzMeier deleted the pr-manual_control_setpoint-split branch January 18, 2021 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants