Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9593] move ota to config monitor, config monitor restart hook #369

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

mattjperez
Copy link
Member

@mattjperez mattjperez commented Dec 20, 2024

This PR:

  • moves ota logic to the config monitor
  • uses a while let Err() loop on ota.update() to ensure retries, blocking other config changes from being applied.
  • uses the periodic nature of the ConfigMonitor (every 10sec by default) to retry failed OTA attemps
  • addresses ticket RSDK-9594 in the config monitor, haven't checked if this change needs to be made in other parts of the codebase.

@mattjperez mattjperez self-assigned this Dec 20, 2024
@mattjperez mattjperez requested a review from a team as a code owner December 20, 2024 18:35
@@ -51,7 +51,7 @@ use super::server::{IncomingConnectionManager, WebRtcConfiguration};
use crate::common::provisioning::server::AsNetwork;

#[cfg(feature = "ota")]
use crate::common::{credentials_storage::OtaMetadataStorage, ota};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this here for now instead of shifting it to config_monitor.rs. I like that it's part of ViamServerStorage, especially if we remove the flag entirely to make it a builtin

// TODO(RSDK-9464): flush logs to app.viam before restarting
esp_idf_svc::hal::reset::restart();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shifts the responsibility of rebooting to whoever is calling ota::update()

Comment on lines 109 to 112
while let Err(ota_err) = ota.update().await {
log::error!("failed to update firmware: {}", ota_err);
log::info!("retrying firmware update");
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't sure if I should put a delay here at all since there is one internally in update for connection establishment already.

Copy link
Member Author

@mattjperez mattjperez Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionally, if the error is due to specific errors ConfigError, InvalidImage, etc, I'm not sure what the behavior should be.

we can
a) continue applying the other configs (might be risky if the new configs are only compatible with new firmware) this means changing my current implementation to add an extra check when the task first starts.

b) continue to block on applying the current update and surface an error, if so, we need to check for new configs between failures.

c) add options down the road for users to decide on the behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)

Copy link
Member

@gvaradarajan gvaradarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No logical issues, but I've suggested some shifting of code

self.executor.clone(),
) {
Ok(mut ota) => {
let curr_metadata = ota.stored_metadata().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the logic here could go back to the OTA service and this could be reduced to a function call on the service? You would no longer need the stored_metadata function and the pending_version would no longer need to be pub(crate)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. because this would run every 10 seconds when the config gets checked, I wanted to control how often logging gets propogated, otherwise we's see version up to date all the time.

I've since moved a single log to when viam server starts to just log the current firmware once, and report only the update target when attempting ota

Comment on lines 109 to 112
while let Err(ota_err) = ota.update().await {
log::error!("failed to update firmware: {}", ota_err);
log::info!("retrying firmware update");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)

}
}

if reboot {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks cleaner to use a flag instead of fiddling with feature gates

Comment on lines +483 to +489
#[cfg(feature = "ota")]
{
if self.storage.has_ota_metadata() {
let metadata = self.storage.get_ota_metadata().unwrap_or_default();
log::info!("firmware version: {}", metadata.version);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this here avoids complicating config monitor's ota stuff to get the right logging behavior.

Copy link
Member

@gvaradarajan gvaradarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with optional suggestion

@@ -256,31 +269,22 @@ impl<S: OtaMetadataStorage> OtaService<S> {
pending_version,
max_size,
address,
needs_reboot: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have update return a boolean as to whether an update was actually performed rather than store state?

Copy link
Member

@gvaradarajan gvaradarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, provided the one logical issue is resolved

let (mut sender, conn) = loop {
num_tries += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of where the increment is, this ends up only trying a maximum of twice rather than thrice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right. should go at the bottom of the loop. fixed

{
if self.storage.has_ota_metadata() {
let metadata = self.storage.get_ota_metadata().unwrap_or_default();
log::info!("firmware version: {}", metadata.version);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will log the empty string if there is no OTA metadata in NVS. I'm not sure that's the best outcome. Can we be more explicit?

Comment on lines +578 to +581
#[cfg(feature = "esp32")]
let hook = || crate::esp32::esp_idf_svc::hal::reset::restart();
#[cfg(not(feature = "esp32"))]
let hook = || std::process::exit(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Before we just called std::process::exit, but this goes back to an older style where on esp32 we do hal::reset::restart. What was the motivation for that change. I think I know, but I'd like to be sure.
  • If this is to effect the change I suspect, I think I'd want to see the same change in the RestartMonitor.
  • A block nearly identical to this exists in grpc::robot_shutdown. Can we DRY this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the motivation was to get rid of the coredump we see everytime on config change restarts since I'm already modifying the config_monitor's behavior.

I mention it in the description about partially addressing RSDK-9594, but just in config monitor.

I can look into grpc::robot_shutdown, but that depends if you want that done across the repo in RSDK-9594 or just here. Which would you prefer?

@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> {
}

impl<S: OtaMetadataStorage> OtaService<S> {
pub(crate) async fn stored_metadata(&self) -> OtaMetadata {
if !self.storage.has_ota_metadata() {
log::info!("no ota metadata currently stored in NVS");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: OTA

@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> {
}

impl<S: OtaMetadataStorage> OtaService<S> {
pub(crate) async fn stored_metadata(&self) -> OtaMetadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function really async?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't, but shouldn't it be? The underlying calls are essentially I/O, getting data from flash storage. But this is maybe is more of a "should our storage APIs be async?" question.

I'll remove it here though.

.inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e))
.unwrap_or_default()
};
pub(crate) async fn needs_update(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function really async?

@@ -337,6 +342,7 @@ impl<S: OtaMetadataStorage> OtaService<S> {
CONN_RETRY_SECS
);
Timer::after(Duration::from_secs(CONN_RETRY_SECS)).await;
num_tries += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is there a way to frame this loop without an explicitly managed counter?
  • If you really need the counter, I'd recommend incrementing at or near the top so it can't somehow become skipped later.

Copy link
Member Author

@mattjperez mattjperez Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried exploring different paths for a day or two:

  • wrapping the connection initialization logic into its own function
    • getting the return type right (even with generics) was a rabbit hole of satisfying the compiler and enabling an unstable/nightly features that are used to define a lower-level trait (something with Allocator if I remember right)
  • declaring the variables outside the loop
    • same issue as above
  • using for i in 0..CONN_RETRY
    • can't use break in a for-loop, only in loop

I think I tried a couple other things, (like inspect,map_err, ok_and, etc) but it seemed like too much time spend on trying to avoid the messiness and have the correct behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move the increment to the top of the loop and set the range accordingly

let (mut sender, conn) = loop {
if num_tries == NUM_RETRY_CONN {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a necessary part of this change? It seems independent of moving the driver of OTA checks to be the ConfigMonitor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't limit the number of connection attempts, it will block further config changes. So the change is to still have a limited number of short-lived retry attempts in case there's initial wonkiness with the connection.

If there is any general wonkiness with establishing the connection, there isn't any guarantee that returning immediately and depending on the next invocation of ConfigMonitor (in 10sec) will resolve it.

three one-second retries seemed like a decent middle ground that we can iterate on if anything.

self.storage
.get_ota_metadata()
.inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e))
.unwrap_or_default()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return a default value rather than an error?

@@ -27,11 +30,14 @@ where
pub fn new(
curr_config: Box<RobotConfig>,
storage: Storage,
#[cfg(feature = "ota")] executor: Executor,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inconsistent with how we usually do cfg things.

}

if reboot {
// TODO(RSDK-9464): flush logs to app.viam before restarting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ticket would be another reason to DRY the shutdown handling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in, have a global shutdown function that handles log-flushing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants