[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

ycombinator · 2023-10-06T00:16:43Z

Today, neither the Fleet UI nor the Fleet API will allow the user to initiate the second upgrade if the first one is deemed to be still in progress. The UI grays out the "Upgrade agent" link and the API returns a 400 Bad Request response saying the agent is not upgradeable.

Side bar: The Upgrade Watcher is a process that the old (as in, pre-upgrade) version of Agent spawns right before it starts up the new (as in, post-upgrade) version of Agent. The job of the Upgrade Watcher is to monitor the health of the new Agent and, if necessary, rollback to the old Agent.

I have noticed that Fleet considers an upgrade as no longer in progress even while the Upgrade Watcher from that upgrade is still running. In other words, Fleet today will allow a user to request a second upgrade even while the Upgrade Watcher from the first upgrade is still running.

Going forward, however, once elastic/elastic-agent#3119 is resolved, Agent will start sending Fleet Server details about an ongoing upgrade via the upgrade_details field in check-in API requests. This field will contain, amongst other information, the current state of the upgrade process, with one of the states being UPG_WATCHING. This state means the Upgrade Watcher is running.

As long as the upgrade_details field is present in check-in API requests, Fleet should consider an upgrade to be in progress and should not allow a second upgrade to be requested.

For more background on this enhancement request, see elastic/elastic-agent#2706 (comment)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-10-06T00:16:45Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2023-10-06T09:16:44Z

@ycombinator Fleet UI/API considers upgrade ready when upgrade_started_at field is cleared and upgraded_at field is set. Instead of changing this condition, would it be possible for Agent/Fleet Server to only update these fields when all states completed?

cmacknz · 2023-10-06T13:33:59Z

There is a possible high impact problem here even for older versions of the agent that we can't change.

We start a process called the upgrade watcher that supervises the new version of the agent for 10 minutes after an upgrade. If another upgrade is attempted in this 10 minute window the upgrade watcher will interpret the restart that happens during this second upgrade as a crash and try to roll back the agent version. A roll back that occurs while another upgrade is in progress can have unpredictable results, the worst outcome is a broken agent installation.

The safest thing to do is to for versions of the agent that don't report upgrade states is to forbid another upgrade attempt until 10 minutes have elapsed from the previous upgrade attempt.

I am also starting to believe the default 10 minute upgrade watcher period is probably too long. Anything that is going to go wrong in the upgrade process is likely to happen much sooner than that, but this isn't a problem we can fix with Fleet.

Edit: added note that the worst case outcome without doing this is a broken agent installation.

cmacknz · 2023-10-06T14:04:09Z

Let's add a 10 minute cool down for all versions, and leave this issue as is to optimize for versions that report detailed upgrade states.

I created #168233 for the cool down implementation.

jillguyonnet · 2023-10-31T15:47:11Z

Quick clarification on the requirement:

As long as the upgrade_details field is present in check-in API requests, Fleet should consider an upgrade to be in progress and should not allow a second upgrade to be requested.

Is the upgrade_details field cleaned up in case of failed upgrade, i.e. when the upgrade state goes to UPG_FAILED? In other words can there be a scenario in which this field still exists and an upgrade should be allowed?

ycombinator · 2023-11-01T04:58:23Z

Quick clarification on the requirement:

As long as the upgrade_details field is present in check-in API requests, Fleet should consider an upgrade to be in progress and should not allow a second upgrade to be requested.

Is the upgrade_details field cleaned up in case of failed upgrade, i.e. when the upgrade state goes to UPG_FAILED? In other words can there be a scenario in which this field still exists and an upgrade should be allowed?

It's a good clarification, thank you for raising it.

I had forgotten (from the RFC) that we do want to allow a user to retry a failed upgrade (i.e. when upgrade_details.state == "UPG_FAILED"). So, you are correct: as UPG_FAILED is a terminal state of the upgrade process, we should allow users to (re)request an upgrade from this state.

As of now there are no plans to automatically clean up the upgrade_details field in case of a failed upgrade. This is by design as we would like the user to notice that the upgrade has failed and perhaps investigate which step the upgrade failed from (by looking at the upgrade_details.metadata.failed_state field) and the reason for the failure (by looking at the upgrade_details.metadata.error_msg field).

## Summary Closes #168171 Prevent upgrading an agent if it is already upgrading. Agents that report upgrade details are considered as upgrading when the `upgrade_details` field exists and is not equal to `UPG_FAILED` (cf. [this comment](#168171 (comment))). Agents that do _not_ report upgrade details are considered as upgrading when the `upgrade_started_at` field is set and the `upgraded_at` field is not. NB: this is existing behaviour, this PR does not change this logic. ### Checklist - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

ycombinator added enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team labels Oct 6, 2023

ycombinator mentioned this issue Oct 6, 2023

Quick multiple upgrades while agent is still in grace period break agent installation elastic/elastic-agent#2706

Closed

This was referenced Oct 6, 2023

[Fleet] Allow users to force running an upgrade again from the UI #135539

Closed

[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent #168233

Closed

jillguyonnet self-assigned this Oct 24, 2023

jillguyonnet mentioned this issue Nov 9, 2023

[Fleet] Prevent agent upgrade requests when already upgrading #170963

Merged

2 tasks

jillguyonnet closed this as completed in #170963 Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

ycombinator commented Oct 6, 2023

elasticmachine commented Oct 6, 2023

juliaElastic commented Oct 6, 2023

cmacknz commented Oct 6, 2023 •

edited

Loading

cmacknz commented Oct 6, 2023

jillguyonnet commented Oct 31, 2023

ycombinator commented Nov 1, 2023 •

edited

Loading

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

Comments

ycombinator commented Oct 6, 2023

elasticmachine commented Oct 6, 2023

juliaElastic commented Oct 6, 2023

cmacknz commented Oct 6, 2023 • edited Loading

cmacknz commented Oct 6, 2023

jillguyonnet commented Oct 31, 2023

ycombinator commented Nov 1, 2023 • edited Loading

cmacknz commented Oct 6, 2023 •

edited

Loading

ycombinator commented Nov 1, 2023 •

edited

Loading