[RSDK-9593] move ota to config monitor, config monitor restart hook #369

mattjperez · 2024-12-20T18:35:14Z

This PR:

moves ota logic to the config monitor
uses a while let Err() loop on ota.update() to ensure retries, blocking other config changes from being applied.
uses the periodic nature of the ConfigMonitor (every 10sec by default) to retry failed OTA attemps
addresses ticket RSDK-9594 in the config monitor, haven't checked if this change needs to be made in other parts of the codebase.

mattjperez · 2024-12-20T18:37:44Z

micro-rdk/src/common/conn/viam.rs

@@ -51,7 +51,7 @@ use super::server::{IncomingConnectionManager, WebRtcConfiguration};
 use crate::common::provisioning::server::AsNetwork;

 #[cfg(feature = "ota")]
-use crate::common::{credentials_storage::OtaMetadataStorage, ota};


I've left this here for now instead of shifting it to config_monitor.rs. I like that it's part of ViamServerStorage, especially if we remove the flag entirely to make it a builtin

mattjperez · 2024-12-20T18:40:49Z

micro-rdk/src/common/ota.rs

            // TODO(RSDK-9464): flush logs to app.viam before restarting
-            esp_idf_svc::hal::reset::restart();


this shifts the responsibility of rebooting to whoever is calling ota::update()

mattjperez · 2024-12-20T18:42:46Z

micro-rdk/src/common/config_monitor.rs

+                                    while let Err(ota_err) = ota.update().await {
+                                        log::error!("failed to update firmware: {}", ota_err);
+                                        log::info!("retrying firmware update");
+                                    }


wasn't sure if I should put a delay here at all since there is one internally in update for connection establishment already.

additionally, if the error is due to specific errors ConfigError, InvalidImage, etc, I'm not sure what the behavior should be.

we can
a) continue applying the other configs (might be risky if the new configs are only compatible with new firmware) this means changing my current implementation to add an extra check when the task first starts.

b) continue to block on applying the current update and surface an error, if so, we need to check for new configs between failures.

c) add options down the road for users to decide on the behavior

I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)

gvaradarajan

No logical issues, but I've suggested some shifting of code

gvaradarajan · 2024-12-20T21:55:02Z

micro-rdk/src/common/config_monitor.rs

+                            self.executor.clone(),
+                        ) {
+                            Ok(mut ota) => {
+                                let curr_metadata = ota.stored_metadata().await;


I feel like the logic here could go back to the OTA service and this could be reduced to a function call on the service? You would no longer need the stored_metadata function and the pending_version would no longer need to be pub(crate)

you're right. because this would run every 10 seconds when the config gets checked, I wanted to control how often logging gets propogated, otherwise we's see version up to date all the time.

I've since moved a single log to when viam server starts to just log the current firmware once, and report only the update target when attempting ota

gvaradarajan · 2024-12-20T21:59:02Z

micro-rdk/src/common/config_monitor.rs

+                                    while let Err(ota_err) = ota.update().await {
+                                        log::error!("failed to update firmware: {}", ota_err);
+                                        log::info!("retrying firmware update");
+                                    }


I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)

mattjperez · 2024-12-20T23:41:53Z

micro-rdk/src/common/config_monitor.rs

+                    }
+                }
+
+                if reboot {


looks cleaner to use a flag instead of fiddling with feature gates

mattjperez · 2024-12-20T23:44:19Z

micro-rdk/src/common/conn/viam.rs

+        #[cfg(feature = "ota")]
+        {
+            if self.storage.has_ota_metadata() {
+                let metadata = self.storage.get_ota_metadata().unwrap_or_default();
+                log::info!("firmware version: {}", metadata.version);
+            }
+        }


adding this here avoids complicating config monitor's ota stuff to get the right logging behavior.

gvaradarajan

LGTM with optional suggestion

gvaradarajan · 2024-12-23T20:49:33Z

micro-rdk/src/common/ota.rs

@@ -256,31 +269,22 @@ impl<S: OtaMetadataStorage> OtaService<S> {
            pending_version,
            max_size,
            address,
+            needs_reboot: false,


I think it's better to have update return a boolean as to whether an update was actually performed rather than store state?

gvaradarajan

LGTM, provided the one logical issue is resolved

gvaradarajan · 2024-12-31T20:22:30Z

micro-rdk/src/common/ota.rs

        let (mut sender, conn) = loop {
+            num_tries += 1;


because of where the increment is, this ends up only trying a maximum of twice rather than thrice?

ah you're right. should go at the bottom of the loop. fixed

acmorrow · 2025-01-03T17:00:45Z

micro-rdk/src/common/conn/viam.rs

+        {
+            if self.storage.has_ota_metadata() {
+                let metadata = self.storage.get_ota_metadata().unwrap_or_default();
+                log::info!("firmware version: {}", metadata.version);


This will log the empty string if there is no OTA metadata in NVS. I'm not sure that's the best outcome. Can we be more explicit?

acmorrow · 2025-01-03T17:03:11Z

micro-rdk/src/common/conn/viam.rs

+        #[cfg(feature = "esp32")]
+        let hook = || crate::esp32::esp_idf_svc::hal::reset::restart();
+        #[cfg(not(feature = "esp32"))]
+        let hook = || std::process::exit(0);


Before we just called std::process::exit, but this goes back to an older style where on esp32 we do hal::reset::restart. What was the motivation for that change. I think I know, but I'd like to be sure.

If this is to effect the change I suspect, I think I'd want to see the same change in the RestartMonitor.

A block nearly identical to this exists in grpc::robot_shutdown. Can we DRY this?

the motivation was to get rid of the coredump we see everytime on config change restarts since I'm already modifying the config_monitor's behavior.

I mention it in the description about partially addressing RSDK-9594, but just in config monitor.

I can look into grpc::robot_shutdown, but that depends if you want that done across the repo in RSDK-9594 or just here. Which would you prefer?

acmorrow · 2025-01-03T17:07:37Z

micro-rdk/src/common/ota.rs

@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> {
 }

 impl<S: OtaMetadataStorage> OtaService<S> {
+    pub(crate) async fn stored_metadata(&self) -> OtaMetadata {
+        if !self.storage.has_ota_metadata() {
+            log::info!("no ota metadata currently stored in NVS");


acmorrow · 2025-01-03T17:08:27Z

micro-rdk/src/common/ota.rs

@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> {
 }

 impl<S: OtaMetadataStorage> OtaService<S> {
+    pub(crate) async fn stored_metadata(&self) -> OtaMetadata {


Is this function really async?

It isn't, but shouldn't it be? The underlying calls are essentially I/O, getting data from flash storage. But this is maybe is more of a "should our storage APIs be async?" question.

I'll remove it here though.

acmorrow · 2025-01-03T17:08:47Z

micro-rdk/src/common/ota.rs

-                .inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e))
-                .unwrap_or_default()
-        };
+    pub(crate) async fn needs_update(&self) -> bool {


Is this function really async?

acmorrow · 2025-01-03T17:16:11Z

micro-rdk/src/common/ota.rs

@@ -337,6 +342,7 @@ impl<S: OtaMetadataStorage> OtaService<S> {
                CONN_RETRY_SECS
            );
            Timer::after(Duration::from_secs(CONN_RETRY_SECS)).await;
+            num_tries += 1;


Is there a way to frame this loop without an explicitly managed counter?

If you really need the counter, I'd recommend incrementing at or near the top so it can't somehow become skipped later.

I tried exploring different paths for a day or two:

wrapping the connection initialization logic into its own function

getting the return type right (even with generics) was a rabbit hole of satisfying the compiler and enabling an unstable/nightly features that are used to define a lower-level trait (something with Allocator if I remember right)

declaring the variables outside the loop

same issue as above

using for i in 0..CONN_RETRY

can't use break in a for-loop, only in loop

I think I tried a couple other things, (like inspect,map_err, ok_and, etc) but it seemed like too much time spend on trying to avoid the messiness and have the correct behavior.

I can move the increment to the top of the loop and set the range accordingly

acmorrow · 2025-01-03T17:17:30Z

micro-rdk/src/common/ota.rs

        let (mut sender, conn) = loop {
+            if num_tries == NUM_RETRY_CONN {


Is this a necessary part of this change? It seems independent of moving the driver of OTA checks to be the ConfigMonitor.

if we don't limit the number of connection attempts, it will block further config changes. So the change is to still have a limited number of short-lived retry attempts in case there's initial wonkiness with the connection.

If there is any general wonkiness with establishing the connection, there isn't any guarantee that returning immediately and depending on the next invocation of ConfigMonitor (in 10sec) will resolve it.

three one-second retries seemed like a decent middle ground that we can iterate on if anything.

acmorrow · 2025-01-03T17:19:00Z

micro-rdk/src/common/ota.rs

+        self.storage
+            .get_ota_metadata()
+            .inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e))
+            .unwrap_or_default()


Why return a default value rather than an error?

acmorrow · 2025-01-03T17:19:53Z

micro-rdk/src/common/config_monitor.rs

@@ -27,11 +30,14 @@ where
    pub fn new(
        curr_config: Box<RobotConfig>,
        storage: Storage,
+        #[cfg(feature = "ota")] executor: Executor,


nit: inconsistent with how we usually do cfg things.

acmorrow · 2025-01-03T17:22:11Z

micro-rdk/src/common/config_monitor.rs

+                }
+
+                if reboot {
+                    // TODO(RSDK-9464): flush logs to app.viam before restarting


This ticket would be another reason to DRY the shutdown handling.

as in, have a global shutdown function that handles log-flushing?

move ota to config_monitor, with retries, fix reset hook

f770c97

mattjperez requested a review from gvaradarajan December 20, 2024 18:35

mattjperez self-assigned this Dec 20, 2024

mattjperez requested a review from a team as a code owner December 20, 2024 18:35

mattjperez commented Dec 20, 2024

View reviewed changes

rm reboot from ota::update

a9c61b6

mattjperez commented Dec 20, 2024

View reviewed changes

mattjperez added 3 commits December 20, 2024 13:45

update comments

26f1f6e

fix reset hook

986e3a3

fmt

d8e2ee6

gvaradarajan requested changes Dec 20, 2024

View reviewed changes

add suggestions, rm retry loop, extra log

acb1e5e

mattjperez commented Dec 20, 2024

View reviewed changes

micro-rdk/src/common/config_monitor.rs

}

}

if reboot {

Copy link

Member Author

mattjperez Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks cleaner to use a flag instead of fiddling with feature gates

mattjperez commented Dec 20, 2024

View reviewed changes

mattjperez requested a review from gvaradarajan December 20, 2024 23:44

clippy

adb66d4

gvaradarajan approved these changes Dec 23, 2024

View reviewed changes

mattjperez added 2 commits December 31, 2024 13:55

add suggestions, refactor

c2eebd8

add test flag to service config for ota loop

3600bff

mattjperez requested a review from gvaradarajan December 31, 2024 19:44

mattjperez and others added 2 commits December 31, 2024 13:44

Merge branch 'main' into rsdk-9593-config-monitor-ota

b003612

add ticket comment

9caf29c

gvaradarajan approved these changes Dec 31, 2024

View reviewed changes

mattjperez added 2 commits December 31, 2024 15:46

rm ota test-loop flag

42fceb5

fix loop bug

e63c84b

acmorrow requested changes Jan 3, 2025

View reviewed changes

wip

c7b1b2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RSDK-9593] move ota to config monitor, config monitor restart hook #369

[RSDK-9593] move ota to config monitor, config monitor restart hook #369

mattjperez commented Dec 20, 2024 •

edited

Loading

mattjperez Dec 20, 2024

mattjperez Dec 20, 2024

mattjperez Dec 20, 2024

mattjperez Dec 20, 2024 •

edited

Loading

gvaradarajan Dec 20, 2024

gvaradarajan left a comment

gvaradarajan Dec 20, 2024

mattjperez Dec 20, 2024

gvaradarajan Dec 20, 2024

mattjperez Dec 20, 2024

mattjperez Dec 20, 2024

gvaradarajan left a comment

gvaradarajan Dec 23, 2024

gvaradarajan left a comment

gvaradarajan Dec 31, 2024

mattjperez Dec 31, 2024

acmorrow Jan 3, 2025

acmorrow Jan 3, 2025

mattjperez Jan 3, 2025

acmorrow Jan 3, 2025

acmorrow Jan 3, 2025

mattjperez Jan 4, 2025

acmorrow Jan 3, 2025

acmorrow Jan 3, 2025

mattjperez Jan 3, 2025 •

edited

Loading

mattjperez Jan 3, 2025

acmorrow Jan 3, 2025

mattjperez Jan 3, 2025

acmorrow Jan 3, 2025

acmorrow Jan 3, 2025

acmorrow Jan 3, 2025

mattjperez Jan 3, 2025

		// TODO(RSDK-9464): flush logs to app.viam before restarting
		esp_idf_svc::hal::reset::restart();

		let (mut sender, conn) = loop {
		if num_tries == NUM_RETRY_CONN {

[RSDK-9593] move ota to config monitor, config monitor restart hook #369

Are you sure you want to change the base?

[RSDK-9593] move ota to config monitor, config monitor restart hook #369

Conversation

mattjperez commented Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjperez Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvaradarajan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvaradarajan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvaradarajan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjperez Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjperez commented Dec 20, 2024 •

edited

Loading

mattjperez Dec 20, 2024 •

edited

Loading

mattjperez Jan 3, 2025 •

edited

Loading