Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

Closed
ycombinator opened this issue Oct 6, 2023 · 6 comments · Fixed by #170963
Closed

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress #168171

ycombinator opened this issue Oct 6, 2023 · 6 comments · Fixed by #170963
Assignees
Labels
enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@ycombinator
Copy link
Contributor

Today, neither the Fleet UI nor the Fleet API will allow the user to initiate the second upgrade if the first one is deemed to be still in progress. The UI grays out the "Upgrade agent" link and the API returns a 400 Bad Request response saying the agent is not upgradeable.

Side bar: The Upgrade Watcher is a process that the old (as in, pre-upgrade) version of Agent spawns right before it starts up the new (as in, post-upgrade) version of Agent. The job of the Upgrade Watcher is to monitor the health of the new Agent and, if necessary, rollback to the old Agent.

I have noticed that Fleet considers an upgrade as no longer in progress even while the Upgrade Watcher from that upgrade is still running. In other words, Fleet today will allow a user to request a second upgrade even while the Upgrade Watcher from the first upgrade is still running.

Going forward, however, once elastic/elastic-agent#3119 is resolved, Agent will start sending Fleet Server details about an ongoing upgrade via the upgrade_details field in check-in API requests. This field will contain, amongst other information, the current state of the upgrade process, with one of the states being UPG_WATCHING. This state means the Upgrade Watcher is running.

As long as the upgrade_details field is present in check-in API requests, Fleet should consider an upgrade to be in progress and should not allow a second upgrade to be requested.

For more background on this enhancement request, see elastic/elastic-agent#2706 (comment)

@ycombinator ycombinator added enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team labels Oct 6, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor

@ycombinator Fleet UI/API considers upgrade ready when upgrade_started_at field is cleared and upgraded_at field is set. Instead of changing this condition, would it be possible for Agent/Fleet Server to only update these fields when all states completed?

@cmacknz
Copy link
Member

cmacknz commented Oct 6, 2023

There is a possible high impact problem here even for older versions of the agent that we can't change.

We start a process called the upgrade watcher that supervises the new version of the agent for 10 minutes after an upgrade. If another upgrade is attempted in this 10 minute window the upgrade watcher will interpret the restart that happens during this second upgrade as a crash and try to roll back the agent version. A roll back that occurs while another upgrade is in progress can have unpredictable results, the worst outcome is a broken agent installation.

The safest thing to do is to for versions of the agent that don't report upgrade states is to forbid another upgrade attempt until 10 minutes have elapsed from the previous upgrade attempt.

I am also starting to believe the default 10 minute upgrade watcher period is probably too long. Anything that is going to go wrong in the upgrade process is likely to happen much sooner than that, but this isn't a problem we can fix with Fleet.

Edit: added note that the worst case outcome without doing this is a broken agent installation.

@cmacknz
Copy link
Member

cmacknz commented Oct 6, 2023

Let's add a 10 minute cool down for all versions, and leave this issue as is to optimize for versions that report detailed upgrade states.

I created #168233 for the cool down implementation.

@jillguyonnet jillguyonnet self-assigned this Oct 24, 2023
@jillguyonnet
Copy link
Contributor

Quick clarification on the requirement:

As long as the upgrade_details field is present in check-in API requests, Fleet should consider an upgrade to be in progress and should not allow a second upgrade to be requested.

Is the upgrade_details field cleaned up in case of failed upgrade, i.e. when the upgrade state goes to UPG_FAILED? In other words can there be a scenario in which this field still exists and an upgrade should be allowed?

@ycombinator
Copy link
Contributor Author

ycombinator commented Nov 1, 2023

Quick clarification on the requirement:

As long as the upgrade_details field is present in check-in API requests, Fleet should consider an upgrade to be in progress and should not allow a second upgrade to be requested.

Is the upgrade_details field cleaned up in case of failed upgrade, i.e. when the upgrade state goes to UPG_FAILED? In other words can there be a scenario in which this field still exists and an upgrade should be allowed?

It's a good clarification, thank you for raising it.

I had forgotten (from the RFC) that we do want to allow a user to retry a failed upgrade (i.e. when upgrade_details.state == "UPG_FAILED"). So, you are correct: as UPG_FAILED is a terminal state of the upgrade process, we should allow users to (re)request an upgrade from this state.

As of now there are no plans to automatically clean up the upgrade_details field in case of a failed upgrade. This is by design as we would like the user to notice that the upgrade has failed and perhaps investigate which step the upgrade failed from (by looking at the upgrade_details.metadata.failed_state field) and the reason for the failure (by looking at the upgrade_details.metadata.error_msg field).

jillguyonnet added a commit that referenced this issue Nov 16, 2023
## Summary

Closes #168171

Prevent upgrading an agent if it is already upgrading.

Agents that report upgrade details are considered as upgrading when the
`upgrade_details` field exists and is not equal to `UPG_FAILED` (cf.
[this
comment](#168171 (comment))).

Agents that do _not_ report upgrade details are considered as upgrading
when the `upgrade_started_at` field is set and the `upgraded_at` field
is not. NB: this is existing behaviour, this PR does not change this
logic.

### Checklist

- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants