Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary manifest not sent in some cases #89

Open
jsrc27 opened this issue Aug 23, 2022 · 8 comments
Open

Secondary manifest not sent in some cases #89

jsrc27 opened this issue Aug 23, 2022 · 8 comments

Comments

@jsrc27
Copy link
Collaborator

jsrc27 commented Aug 23, 2022

There was an issue with a customer of ours where the keys for their secondary got zeroed out. This caused the case where when an update was sent out, but it could never complete. Since without the proper keys the secondary could not produce a valid signed manifest for itself. Therefore the manifest sent to the server was in-complete. The director rightfully held the status for this device as "update pending" since it never got a complete manifest that marked the update as "done". This required manual intervention to get the device out of this now stuck pending state.

Now the issue with keys getting zeroed out was solved here: #82

But, this still sparked the discussion on what to do generally if a secondary can't produce a manifest for whatever reason. As then a device would get in the same state and require manual intervention. There seems to be an in-code comment as well where this behavior seems to have been brought up before: https://github.com/uptane/aktualizr/blob/master/src/libaktualizr/primary/sotauptaneclient.cc#L353

After discussing with @cajun-rat & @tkfu, we decided to bring this here for additional discussion. The hope being maybe there would be some way to improve this behavior that still aligns with the principals of Uptane.

@pattivacek
Copy link
Collaborator

pattivacek commented Aug 24, 2022

Now the issue with keys getting zeroed out was solved here: #82

Glad y'all fixed that. Sorry, the ManagedSecondary implementation was/is a little less robust than I'd like. aktualizr-secondary (IP/POSIX Secondaries) are better, but they were also originally just a demo, and real-world use revealed some issues with time that were addressed quickly.

But, this still sparked the discussion on what to do generally if a secondary can't produce a manifest for whatever reason. As then a device would get in the same state and require manual intervention.

Yeah, there are some danger zones where the only really safe option is manual recovery. I also deal with this on occasion. What would your ideal recovery mechanism be? Is it a hammer that works around Uptane? In any case, also consider raising the point more directly with the Uptane brain trust. For now I'll tag @JustinCappos and @trishankkarthik in case they have opinions.

There seems to be an in-code comment as well where this behavior seems to have been brought up before: https://github.com/uptane/aktualizr/blob/master/src/libaktualizr/primary/sotauptaneclient.cc#L353

I no longer have access to the ticket, unfortunately, but I agree, the comment implies we were aware that we needed to inform the Director somehow so that then a decision could be made on what to do. That would be a good first step, but it still doesn't actually address recovery.

@jsrc27
Copy link
Collaborator Author

jsrc27 commented Aug 24, 2022

What would your ideal recovery mechanism be?

Honestly we kind of hit a dead-end on this in our own discussion before coming here. If a secondary ECU can't send a manifest due to some issue, then it's probably fair to say that that ECU requires recovery outside of Aktualizr. In the case of the Director never getting any indication/signal is more what we were interested in our discussion. Though in our discussion we were going back and forth whether it is even correct for Aktualizr to even be doing anything in this case. Or whether this needed to be handled implicitly server-side.

Though @tkfu and @cajun-rat had their own opinions about this, especially @tkfu who manages the server side of our stack.

@tkfu
Copy link
Member

tkfu commented Aug 25, 2022

Yeah, my thinking on this is that it has to be the back-end's responsibility to deal with it somehow. Fundamentally, we must reckon with the possibility that the primary is lying to us. Our only reliable source of truth for what is installed on a secondary is the signed version report from the secondary itself. Now, managed secondaries aren't "true" secondaries, of course. There's no realistic scenario where aktualizr-primary is compromised, but a managed secondary is intact and trustable. (For anyone from Uptane reading this who might not be familiar with aktualizr: managed secondaries are "fake" or virtual secondaries. They're run entirely on the Linux-based primary.) So in principle, one could imagine a server-side workaround for this issue where you just decide to trust the installation report from the primary, thus closing out the assignment and allowing the managed secondary to re-register with a new key or whatever.

But that just sounds like an awful idea. Even if you could make the argument that it's not a violation of the standard (since managed secondaries aren't part of an Uptane system anyway), it would be a pretty bizarre contortion of the server side just to allow for slightly easier remediation of what ought to be a very rare case.

So I think the approach we'll take on the server side is to clean up our error reporting: if the primary reports success on all secondaries, but can't prove it via a signed version report from each secondary that had an assignment, we should disregard the report from the primary, and warn the repository owner that one of their secondaries isn't reporting in.

I do have some opinions on how aktualizr's reporting behaviour could be improved, but I'm going to separate those out into another comment.

@tkfu
Copy link
Member

tkfu commented Aug 25, 2022

Aktualizr's behaviour could be better here. We have three information channels relevant to the progress/status of an update assignment:

  1. The manifest, with signed version reports from each secondary, as defined in the Uptane standard. This tells us the current installed software on each ECU.
  2. Custom metadata included in the signed manifest, specifically the installation_report. This gives us additional information, signed by the primary, about the success or failure of an update assignment. This is critical to how our director implementation deals with update assignments (see [1] below).
  3. Events generated during the update process (Download started/complete, installation started/complete, installation applied). These are purely informational, and aren't signed by any ECU's key.

The problem that arose here is that the secondary was able to send (3) even though it didn't have a key, and aktualizr dutifully forwarded those reports to the server. Subsequently, aktualizr also assembled and signed a manifest that included an installation_report indicating success based on those untrusted events.

So, on this basis, the change I'd like to see in aktualizr would be to either:

  • Start signing the events with the ECU key of the ECU that generated them (requires support from secondaries), and drop events that aren't signed, or
  • Do some extra error checking around sending the installation_report--i.e. only send a success if it actually gets a signed version report from the secondary indicating that the directed target was installed.

Both options are complicated in their own way, though, and I'd certainly understand if we just decided it's not a priority to implement either one. If you start signing events, it means both the secondaries and the server also have to change, to start generating/accepting/validating those signed events. If you implement more error checking around the installation_report, there are a bunch of annoying cases to properly think through: for example, if the secondary doesn't respond to a request for its version report, it could be for a number of reasons, including temporary unavailability, so we wouldn't want to report a failure until some kind of timeout occurred or whatever, and that's an opinionated choice that has its own pitfalls.

[1] If aktualizr cannot fulfill an update assignment, it sends an installation_report indicating the failure, and it includes a correlation_id that director can use to decide what to do next. Naïvely, we might expect that director should continue to send targets metadata indicating the desired target for all ECUs, but that isn't very useful as a practical matter: we need richer information so that director can decide whether to tell the device to attempt the installation again or not. We don't want the vehicle to get stuck in a loop where it just keeps on trying to install an update, failing every time. You could have all of the retry logic on the client side, and/or attempt to embed a policy engine inside director targets metadata, but both of those options suck in their own way.

@tkfu
Copy link
Member

tkfu commented Aug 25, 2022

And finally, as for how to recover from this, I think it's clear that recovery is not something aktualizr can be responsible for. If an ECU can't sign version reports, it needs to be replaced, and repository owners already have (or at least should already have) a plan for how to replace ECUs, hopefully following our guidance in the deployment best practices.

@pattivacek
Copy link
Collaborator

Okay, glad to hear the recovery mechanism is out of scope here, or at least less important. As to the reporting:

Do some extra error checking around sending the installation_report--i.e. only send a success if it actually gets a signed version report from the secondary indicating that the directed target was installed.

This is my preferred option, but it is not perfect for the exact reason you specify. I wish the trigger for the installation report was from the Secondary's version report, not the events. Conceptually what we do now doesn't make sense. The events are not part of Uptane, are purely informational, and shouldn't be used for Uptane-related decision-making.

So I think the approach we'll take on the server side is to clean up our error reporting: if the primary reports success on all secondaries, but can't prove it via a signed version report from each secondary that had an assignment, we should disregard the report from the primary, and warn the repository owner that one of their secondaries isn't reporting in.

This would be great. I wish the server had some sort of mechanism for recognizing these error scenarios are reporting them to the user somehow.

in principle, one could imagine a server-side workaround for this issue where you just decide to trust the installation report from the primary, thus closing out the assignment and allowing the managed secondary to re-register with a new key or whatever.

But that just sounds like an awful idea. Even if you could make the argument that it's not a violation of the standard (since managed secondaries aren't part of an Uptane system anyway), it would be a pretty bizarre contortion of the server side just to allow for slightly easier remediation of what ought to be a very rare case.

Yeah, agreed, please don't do that. Ignoring errors is never good, even when they "shouldn't happen".

We don't want the vehicle to get stuck in a loop where it just keeps on trying to install an update, failing every time.

Yes, this is what currently happens in several cases, and it's really annoying and wasteful.

@tkfu
Copy link
Member

tkfu commented Aug 30, 2022

Yes, this is what currently happens in several cases, and it's really annoying and wasteful.

Can you elaborate? We don't generally have that problem, precisely because the installation reports with correlationId allow director to cancel the assignment once aktualizr has reported it as a failure. Is it because you're mostly working with devices that use aktualizr-lite, and thus have no director to talk to?

@pattivacek
Copy link
Collaborator

Is it because you're mostly working with devices that use aktualizr-lite, and thus have no director to talk to?

Yes, this is one such situation. However, I seem to recall from the days using the Director that we'd get situations where certain errors were not sent upstream effectively, and installations would be repeated. Maybe I'm thinking too far back and we'd fixed that in the meantime. I wouldn't trust that every error scenario in the Secondary gets reported correctly, though, as the OP's situation indicates. We have a lot of tests for these things, but I doubt we cover everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants