-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RSDK-9593] move ota to config monitor, config monitor restart hook #369
base: main
Are you sure you want to change the base?
[RSDK-9593] move ota to config monitor, config monitor restart hook #369
Conversation
@@ -51,7 +51,7 @@ use super::server::{IncomingConnectionManager, WebRtcConfiguration}; | |||
use crate::common::provisioning::server::AsNetwork; | |||
|
|||
#[cfg(feature = "ota")] | |||
use crate::common::{credentials_storage::OtaMetadataStorage, ota}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left this here for now instead of shifting it to config_monitor.rs
. I like that it's part of ViamServerStorage
, especially if we remove the flag entirely to make it a builtin
// TODO(RSDK-9464): flush logs to app.viam before restarting | ||
esp_idf_svc::hal::reset::restart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shifts the responsibility of rebooting to whoever is calling ota::update()
while let Err(ota_err) = ota.update().await { | ||
log::error!("failed to update firmware: {}", ota_err); | ||
log::info!("retrying firmware update"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't sure if I should put a delay here at all since there is one internally in update
for connection establishment already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, if the error is due to specific errors ConfigError
, InvalidImage
, etc, I'm not sure what the behavior should be.
we can
a) continue applying the other configs (might be risky if the new configs are only compatible with new firmware) this means changing my current implementation to add an extra check when the task first starts.
b) continue to block on applying the current update and surface an error, if so, we need to check for new configs between failures.
c) add options down the road for users to decide on the behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No logical issues, but I've suggested some shifting of code
self.executor.clone(), | ||
) { | ||
Ok(mut ota) => { | ||
let curr_metadata = ota.stored_metadata().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the logic here could go back to the OTA service and this could be reduced to a function call on the service? You would no longer need the stored_metadata
function and the pending_version
would no longer need to be pub(crate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. because this would run every 10 seconds when the config gets checked, I wanted to control how often logging gets propogated, otherwise we's see version up to date
all the time.
I've since moved a single log to when viam server starts to just log the current firmware once, and report only the update target when attempting ota
while let Err(ota_err) = ota.update().await { | ||
log::error!("failed to update firmware: {}", ota_err); | ||
log::info!("retrying firmware update"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)
} | ||
} | ||
|
||
if reboot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks cleaner to use a flag instead of fiddling with feature gates
#[cfg(feature = "ota")] | ||
{ | ||
if self.storage.has_ota_metadata() { | ||
let metadata = self.storage.get_ota_metadata().unwrap_or_default(); | ||
log::info!("firmware version: {}", metadata.version); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this here avoids complicating config monitor's ota stuff to get the right logging behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with optional suggestion
micro-rdk/src/common/ota.rs
Outdated
@@ -256,31 +269,22 @@ impl<S: OtaMetadataStorage> OtaService<S> { | |||
pending_version, | |||
max_size, | |||
address, | |||
needs_reboot: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have update
return a boolean as to whether an update was actually performed rather than store state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, provided the one logical issue is resolved
micro-rdk/src/common/ota.rs
Outdated
let (mut sender, conn) = loop { | ||
num_tries += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of where the increment is, this ends up only trying a maximum of twice rather than thrice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right. should go at the bottom of the loop. fixed
{ | ||
if self.storage.has_ota_metadata() { | ||
let metadata = self.storage.get_ota_metadata().unwrap_or_default(); | ||
log::info!("firmware version: {}", metadata.version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will log the empty string if there is no OTA metadata in NVS. I'm not sure that's the best outcome. Can we be more explicit?
#[cfg(feature = "esp32")] | ||
let hook = || crate::esp32::esp_idf_svc::hal::reset::restart(); | ||
#[cfg(not(feature = "esp32"))] | ||
let hook = || std::process::exit(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Before we just called
std::process::exit
, but this goes back to an older style where on esp32 we dohal::reset::restart
. What was the motivation for that change. I think I know, but I'd like to be sure. - If this is to effect the change I suspect, I think I'd want to see the same change in the
RestartMonitor
. - A block nearly identical to this exists in
grpc::robot_shutdown
. Can we DRY this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the motivation was to get rid of the coredump we see everytime on config change restarts since I'm already modifying the config_monitor's behavior.
I mention it in the description about partially addressing RSDK-9594, but just in config monitor.
I can look into grpc::robot_shutdown
, but that depends if you want that done across the repo in RSDK-9594 or just here. Which would you prefer?
@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> { | |||
} | |||
|
|||
impl<S: OtaMetadataStorage> OtaService<S> { | |||
pub(crate) async fn stored_metadata(&self) -> OtaMetadata { | |||
if !self.storage.has_ota_metadata() { | |||
log::info!("no ota metadata currently stored in NVS"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: OTA
@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> { | |||
} | |||
|
|||
impl<S: OtaMetadataStorage> OtaService<S> { | |||
pub(crate) async fn stored_metadata(&self) -> OtaMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function really async
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't, but shouldn't it be? The underlying calls are essentially I/O, getting data from flash storage. But this is maybe is more of a "should our storage APIs be async?" question.
I'll remove it here though.
.inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e)) | ||
.unwrap_or_default() | ||
}; | ||
pub(crate) async fn needs_update(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function really async
?
@@ -337,6 +342,7 @@ impl<S: OtaMetadataStorage> OtaService<S> { | |||
CONN_RETRY_SECS | |||
); | |||
Timer::after(Duration::from_secs(CONN_RETRY_SECS)).await; | |||
num_tries += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is there a way to frame this loop without an explicitly managed counter?
- If you really need the counter, I'd recommend incrementing at or near the top so it can't somehow become skipped later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried exploring different paths for a day or two:
- wrapping the connection initialization logic into its own function
- getting the return type right (even with generics) was a rabbit hole of satisfying the compiler and enabling an unstable/nightly features that are used to define a lower-level trait (something with Allocator if I remember right)
- declaring the variables outside the loop
- same issue as above
- using
for i in 0..CONN_RETRY
- can't use
break
in afor-loop
, only inloop
- can't use
I think I tried a couple other things, (like inspect,map_err, ok_and, etc) but it seemed like too much time spend on trying to avoid the messiness and have the correct behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the increment to the top of the loop and set the range accordingly
let (mut sender, conn) = loop { | ||
if num_tries == NUM_RETRY_CONN { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a necessary part of this change? It seems independent of moving the driver of OTA checks to be the ConfigMonitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't limit the number of connection attempts, it will block further config changes. So the change is to still have a limited number of short-lived retry attempts in case there's initial wonkiness with the connection.
If there is any general wonkiness with establishing the connection, there isn't any guarantee that returning immediately and depending on the next invocation of ConfigMonitor (in 10sec) will resolve it.
three one-second retries seemed like a decent middle ground that we can iterate on if anything.
self.storage | ||
.get_ota_metadata() | ||
.inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e)) | ||
.unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return a default value rather than an error?
@@ -27,11 +30,14 @@ where | |||
pub fn new( | |||
curr_config: Box<RobotConfig>, | |||
storage: Storage, | |||
#[cfg(feature = "ota")] executor: Executor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inconsistent with how we usually do cfg
things.
} | ||
|
||
if reboot { | ||
// TODO(RSDK-9464): flush logs to app.viam before restarting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ticket would be another reason to DRY the shutdown handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in, have a global shutdown function that handles log-flushing?
This PR:
uses a.while let Err()
loop onota.update()
to ensure retries, blocking other config changes from being appliedConfigMonitor
(every 10sec by default) to retry failed OTA attemps