[Fleet] Allow users to force running an upgrade again from the UI #135539

joshdover · 2022-06-30T11:56:23Z

Agent upgrades can get stuck in an "updating" status from time to time. While we address this at the root cause level, it could be helpful to have a way to re-trigger an upgrade from the UI. We already support this from the API:

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.3.0" }

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-30T11:56:26Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2022-09-05T10:14:19Z

I think we should consider this for 8.6. Upgrades can fail for a number of reasons and right now there's no way to force a clean state from the UI. We should at least allow users to do this after the original upgrade action has expired.

joshdover · 2022-09-05T10:16:36Z

We should short-circuiting this a bit. If the agent is already on the requested version (based on local_metadata field in .fleet-agents) and the force upgrade API is called, we should probably just update the upgraded_at and upgrade_started_at fields instead of doing a full upgrade again.

Otherwise there is a risk that the agent will try to install over itself which can cause problems.

joshdover · 2022-09-05T14:47:35Z

@nimarezainia Could you help us define exactly how the UX should work here?

nimarezainia · 2022-10-05T00:21:37Z

@joshdover (sorry for the delay)

I think we should follow the pattern we already have for the "Unenroll" command where the user is asked if they want to FORCE unenroll. I realize that the action taken is very different however in this case We have an "Upgrade" action - perhaps an extra dialogue to ask if they need it to be "FORCED". Perhaps an extra checkbox in the following dialog box:

@dborodyansky & @kpollich thoughts?

dborodyansky · 2022-10-05T22:36:18Z

Is there a time duration at which we can infer that a problem may have occurred? If so, we may consider displaying an alert and restart action at an appropriate point in time as follows. (copy needs review)

Otherwise, we may present a restart action anytime an upgrade is in progress. Might this cause potentially unnecessary restarts?

kpollich · 2022-10-10T12:16:54Z

Is there a time duration at which we can infer that a problem may have occurred?

This might be something we can infer from telemetry around upgrades, but I'm not sure how to get that data myself.

Otherwise, we may present a restart action anytime an upgrade is in progress. Might this cause potentially unnecessary restarts?

I'm a little wary about providing the "restart" option at all times, but I don't think we'd have technical issues around back-to-back restarts. Josh's comment above about short-circuiting the upgrade path mitigates the risks around force upgrading as I understand it.

nimarezainia · 2023-05-08T03:08:37Z

@joshdover @cmacknz @kpollich I wantd to resurface this issue and see if we can go down the path of self healing - wthout user intervention.

Once we have more granular telemetry around the upgrades, would we be in a better position to detect a "stuck" upgrade?
Could we then have a retry mechanism in place that would force the upgrade (at least maybe once), to correct without user intervention?

joshdover · 2023-05-08T08:09:05Z

We introduced retries locally on the agent in 8.6.0: elastic/elastic-agent#778

I'm curious how many of the bugs we're seeing now happen after upgrading to 8.6.0 and then attempting to upgrade to a newer version. It's possible this already has solved many of the cases, except in cases where a retry won't fix the problem (such as, no disk space for the binary download).

These are the cases where I think this issue comes into play. But I think we first need to move the agent into a different state that then allows the user to diagnose the issue and determine what to do next before retrying the upgrade.

michaelmagyar · 2023-05-17T14:13:08Z

Better update logic would be very welcome. We still have many agents that fail to upgrade and it's hard to diagnose the reason (it's not disk space or network issues).

In the meantime, please do implement a UI option to force upgrades, preferably with the ability to select many agents. We've been asking for that feature for at least a year to address this issue.

I just spent a few hours re-upgrading our fleet after moving from 8.6.2 to 8.7.1. The UI upgrade failed for almost all agents with them all stuck as "Upgrading", so I had to grab each stuck agent's ID and use the upgrade API with "force": true for each agent to unblock each one. The fun part? Even though I specified "version": "8.7.1", most of the agents actually didn't upgrade and stayed on 8.6.2 (I waited an hour+). At least they went back to healthy. I then retried upgrading the agents from the UI again, and again many of them were unsuccessful and are stuck "Upgrading" again. I may end up uninstalling all agents and reinstalling them fresh, because that seems to be the only solution.

Managing agents is a consistent pain point. If a force upgrade UI option isn't in the next version, I will spend a few hours building a front-end script to effectively "select all upgrading; force upgrade". I may even build scripts to completely uninstall and reinstall the agent. This eats up so much of our time and unnecessarily adds to the TCO.

joshdover · 2023-08-22T07:05:01Z

@nimarezainia @zombieFox We need a decision on the UI here before work can begin. There are two suggestions above and I'm not sure we decided on which to do. I also think we probably need a way to do this in bulk from the agent list view. This also overlaps with the more granular upgrade status reporting.

I prefer this option as it's more obvious, but we need guidance on how the bulk upgrade modal should change too.

nimarezainia · 2023-08-23T04:41:05Z

discussing this in the UX issue. I also do agree that selecting agents after a filter is applied (like status : updating) and then performing the action in bulk would be the most common workflow.

juliaElastic · 2023-09-11T08:43:59Z

The definition of this feature is not conclusive, what do you think to implement this way:

For single agent, add the warning icon and the Restart upgrade callout shown here if the agent has started upgrading more than 1 day (?) ago.
For bulk actions, add a new action Restart upgrade that is only enabled if all selected agents are in updating state, regardless of when the update started.

The Restart upgrade action could replace the Upgrade agents action in this case.

For consistency, we should add the Restart upgrade action to the single agent action list, in which case 1. might be redundant.
In case of bulk action, when more than one page is selected (actioning all agents that match a query), it's not trivial to determine whether all of them are in Updating state. Is this feature in scope for this issue? I think in most cases there will be a few agents stuck in updating, otherwise there is a workaround to call the API with force flag manually.
Do we want to change anything in the Upgrade agent modal for Restart upgrade? We can change the title to Restart upgrade agent.

@nimarezainia @zombieFox WDYT?

cmacknz · 2023-09-11T16:08:55Z

Just chiming in to raise elastic/elastic-agent#2706 as something we'll either need to implement before this, or account for in this implementation. Today if you implemented this with no changes to the agent you could only attempt a force upgrade once every 10 minutes to be safe.

nimarezainia · 2023-09-12T01:43:51Z

The definition of this feature is not conclusive, what do you think to implement this way:

1. For single agent, add the warning icon and the `Restart upgrade` callout [shown here](https://github.com/elastic/kibana/issues/135539#issuecomment-1687594884) if the agent has started upgrading more than 1 day (?) ago.

I think this would be great, although why more than 1 day? there are issues with quick succession of upgrades, I guess we do need to throttle this.

2. For bulk actions, add a new action `Restart upgrade` that is only enabled if all selected agents are in updating state, regardless of when the update started.

Can it be enabled if there's at least one agent in the set that is in "updating" state - and we only send the action to the set in Updating? (agents in other states don't get the action). I suspect you can't do this because it's a bulk action.
What would happen if this action was sent to a Healthy agent (or any agent in a state other than Updating) - will this action be ignored?

* The `Restart upgrade` action could replace the `Upgrade agents` action in this case.

Is this because of real estate? I'm happy for it to grey out rather than fully removed.

3. For consistency, we should add the `Restart upgrade` action to the single agent action list, in which case 1. might be redundant.

Good point. Yes.

4. In case of bulk action, when more than one page is selected (actioning all agents that match a query), it's not trivial to determine whether all of them are in `Updating` state. Is this feature in scope for this issue? I think in most cases there will be a few agents stuck in updating, otherwise there is a workaround to call the API with force flag manually.

Would what I suggested above address this? before sending the action, are we able to tell what state the agent is in? I suspect not because this is a bulk action. Perhaps a no-operation on the agent if it's already in a state other than Updating?

5. Do we want to change anything in the `Upgrade agent` modal for `Restart upgrade`? We can change the title to `Restart upgrade agent`.

I'd say restart upgrade is pretty good.

@nimarezainia @zombieFox WDYT?

juliaElastic · 2023-09-12T09:54:17Z

I think this would be great, although why more than 1 day? there are issues with quick succession of upgrades, I guess we do need to throttle this.

I took a guess that 1 day would mean the agent is stuck in updating, we can choose any other timeframe if we think that works better. What about 1 hour or at least 10 minutes?

Can it be enabled if there's at least one agent in the set that is in "updating" state - and we only send the action to the set in Updating? (agents in other states don't get the action). I suspect you can't do this because it's a bulk action.
What would happen if this action was sent to a Healthy agent (or any agent in a state other than Updating) - will this action be ignored?

I think it would be confusing to only send the action to updating agents, it would look on the UI as a bug that not all agents received an action. If we send the action to healthy agents too, it might have unintended consequences, as this is a force upgrade action e.g. trying to use a lower version on an already upgraded agent might cause a downgrade.

Would what I suggested above address this? before sending the action, are we able to tell what state the agent is in? I suspect not because this is a bulk action. Perhaps a no-operation on the agent if it's already in a state other than Updating?

Not really, for the query based actions, we need another API call to determine whether all agents are in updating state. We can do this, just wanted to highlight that this increases the effort.

So to summarize, I think we should only allow bulk restart upgrade if all selected agents are in updating status. We can suggest filtering down on Updating status first to trigger restart.

Also, given the agent limitation, we should also do a check on bulk action that all agents started upgrading since at least X amount of time. The action would fail for updating agents that started upgrade more recently.

juliaElastic · 2023-09-12T15:14:13Z

This is work in progress, this is how the callout would look:

cmacknz · 2023-09-12T17:22:25Z

What about 1 hour or at least 10 minutes?

Whatever time interval is picked should be longer than the currently configured download timeout, which currently defaults to 2 hours.

https://github.com/elastic/elastic-agent/blob/0a19f365f4a6c1014074cbe79752f63bd0312f92/internal/pkg/agent/application/upgrade/artifact/config.go#L158-L161

Any amount less than this will result in agents falsely being marked as stuck in updating.

nimarezainia · 2023-09-13T00:44:41Z

So to summarize, I think we should only allow bulk restart upgrade if all selected agents are in updating status. We can suggest filtering down on Updating status first to trigger restart.

@juliaElastic ok thanks let's do this. But please let's also put plenty of information on the flyout about this so the users are aware. For example, they have selected a set of agents to apply this restart upgrade action to, but that set includes healthy agents too. Let's be informative and let the user know (ideally) ho many agents in the list are healthy and therefore the action can't be applied. I'm happy to help review the copy.

nimarezainia · 2023-09-13T01:02:12Z

What about 1 hour or at least 10 minutes?

Whatever time interval is picked should be longer than the currently configured download timeout, which currently defaults to 2 hours.

https://github.com/elastic/elastic-agent/blob/0a19f365f4a6c1014074cbe79752f63bd0312f92/internal/pkg/agent/application/upgrade/artifact/config.go#L158-L161

Any amount less than this will result in agents falsely being marked as stuck in updating.

How are we going to track the download window defind by this time out? Say the user selects 100 agents to apply this to, their individual upgrade process started at different intervals (based on our rolling upgrades). these agents could have very differing lengths of time left on their timeout.

@cmacknz I'd imagine that at the moment Fleet doesn't know whether an agent is stuck in updating due to failed download. If it was in another state of the Upgrade process, we could perhaps apply this force upgrade action.

Instead of building complexity in Fleet to know which agent and when an action should be sent, wouldn't be more scalable to have logic in agent to ignore actions that are not applicable? so in this scenario, an agent stuck in upgrading because of download issues would just reject the force update action.

Does this make sense?
Do we need the upgrade telemetry changes before we can do this?

cmacknz · 2023-09-13T01:20:40Z

How are we going to track the download window defind by this time out? Say the user selects 100 agents to apply this to, their individual upgrade process started at different intervals (based on our rolling upgrades). these agents could have very differing lengths of time left on their timeout.

Correct. It may not actually be worth trying to track whether an agent is stuck in updated in Fleet. With the changes in https://github.com/elastic/ingest-dev/issues/1621 the agent should just be telling Fleet exactly what is happening with the upgrade and Fleet won't need to infer or calculate anything.

wouldn't be more scalable to have logic in agent to ignore actions that are not applicable?

In general the most scalable thing to do is to decide to stop doing work as early as possible. In this situation it would be better for Fleet to realize the agent in the bulk upgrade is ineligible to be upgraded and not generate the upgrade action at all, removing the work of delivering and processing it from Fleet Server and the Agent itself. For an analogy if we sent this to the agent, it would be like someone (the agent) making a phone call just to tell you they weren't going to call you back.

juliaElastic · 2023-09-13T06:56:57Z

We can check the upgrade_started_at field for all agents in the bulk upgrade logic in Kibana Fleet API and prevent sending the action for those that are not stuck in updating (started less than 2 hours ago).

EDIT: after thinking about this more, it's going to be tricky on the API side to decide if this is a restart upgrade, as the API just gets a force upgrade call. So unless we want to introduce a new API/field for this scenario (which I'm not keen on), it would be better on the UI side to determine if agents are stuck updating in a bulk action.

What we could do is enable the Restart upgrade action on any agent selection, and then when the modal opens, do a query for agents stuck in updating, and display that info to the users, and only action those agents. This would make it easier for them without having to filter on updating too.

Something like this: they selected 20 agents, 2 are stuck in updating, so the modal confirms that 2 will be upgraded:

The first working version of this is available in the pr: #166154

Correct. It may not actually be worth trying to track whether an agent is stuck in updated in Fleet. With the changes in https://github.com/elastic/ingest-dev/issues/1621 the agent should just be telling Fleet exactly what is happening with the upgrade and Fleet won't need to infer or calculate anything.

We can add the Restart upgrade action now with the condition that the upgrade started at least 2 hours ago, and change the condition to the upgrade state being failed when the feature is ready.

Regarding the copy, should we be more specific? Instead of "Agent has been updating for a while" we could say "Agent has been updating for more than 2 hours" and later replace it with "Agent upgrade is stuck in failed state".

## Summary Ready to review, tests are WIP Resolves #135539 For agents in `Updating` state, adding a `Restart upgrade` action, which uses the force flag on the API to bypass the updating check. For single agent, the action is only added if the agent has started upgrade for more than 2 hours (see discussion in [issue](#135539 (comment))) For bulk selection, the action appears if all selected agents are in updating state. To verify: - Start local es, kibana, fleet server (can be docker) - Enroll agents with horde - update agent docs with the queries below to simulate stuck in updating for more than 2 hours (modify the query to target specific agents) - bulk select agents and check that `Restart upgrade` action is visible, and it triggers another upgrade with force flag only on the agents stuck in updating - when calling the bulk_upgrade API, the UI adds the updating condition to the query / or includes only those agent ids that are stuck in updating, depending on query or manual selection ``` curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \ http://localhost:9200/_security/role/fleet_superuser -d ' { "indices": [ { "names": [".fleet*",".kibana*"], "privileges": ["all"], "allow_restricted_indices": true } ] }' curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \ http://localhost:9200/_security/user/fleet_superuser -d ' { "password": "password", "roles": ["superuser", "fleet_superuser"] }' curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \ -H'x-elastic-product-origin:fleet' \ http://localhost:9200/.fleet-agents/_update_by_query -d ' { "script": { "source": "ctx._source.upgrade_started_at = \"2023-09-13T11:26:23Z\"", "lang": "painless" }, "query": { "exists": { "field": "tags" } } }' ``` Agent details action: <img width="1440" alt="image" src="https://github.com/elastic/kibana/assets/90178898/3704e781-337e-4175-b6d2-a99375b6cc24"> Agent list, bulk action: <img width="1482" alt="image" src="https://github.com/elastic/kibana/assets/90178898/74f46861-393e-4a86-ab1f-c21e8d325e89"> Agent list, single agent action: <img width="369" alt="image" src="https://github.com/elastic/kibana/assets/90178898/2aa087a1-b6e1-4e44-b1db-34592f3df959"> Agent details callout: <img width="771" alt="image" src="https://github.com/elastic/kibana/assets/90178898/d1584fe6-0c98-4033-8527-27235812c004"> Select all agents on first page, restart upgrade modal shows only those that are stuck in upgrading: <img width="1317" alt="image" src="https://github.com/elastic/kibana/assets/90178898/fe858815-4393-4007-abe6-e26e4a884bd3"> Select all agents on all pages, restart upgrade modal shows only those that are stuck upgrading, with an extra api call to get the count: <img width="2443" alt="image" src="https://github.com/elastic/kibana/assets/90178898/481b6ab5-fc87-4586-aa9b-432439234ac6"> ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Related to #135539 Added doc link Styling fix <img width="753" alt="image" src="https://github.com/elastic/kibana/assets/90178898/a847c882-a2eb-4729-86da-5885f636b94c"> <img width="574" alt="image" src="https://github.com/elastic/kibana/assets/90178898/f642e1b2-97c9-4bc8-9763-f2e224d4b3bd">

amolnater-qasource · 2023-09-29T07:01:56Z

Hi Team,

We have created 04 testcases under testrail for this feature under Fleet test suite at links:

Validate Restart agent upgrade button is available for agents stuck in updating state for 2 hours.
Validate on clicking the Restart agent upgrade button new changes are updated to agent configuration and the agent gets upgraded successfully.
Validate correct agent count is visible under the Restart agent upgrade modal for only those multiple agents which are stuck in updating state.
Validate on clicking the Restart agent upgrade button for multiple agents stuck in updating gets upgraded to the latest version.

Please let us know if any other scenario needs to be added from our end.

Thanks!

cmacknz · 2023-10-06T13:34:28Z

Related #168171 (comment)

We likely need to forbid attempting force upgrade attempts more frequently than once every 10 minutes.

amolnater-qasource · 2023-10-26T06:26:30Z

Hi Team,

We have executed 04 testcases under the Feature test run for the 8.11.0 release at the link:

Force Running an upgrade

Status:

PASS: 04

Build details:
VERSION: 8.11.0 BC4
BUILD: 68044
COMMIT: caa3613
Artifact Link: https://staging.elastic.co/8.11.0-cb971279/summary-8.11.0.html

As the testing is completed on this feature, we are marking this as QA:Validated.

Please let us know if anything else is required from our end.
Thanks

joshdover added Team:Fleet Team label for Observability Data Collection Fleet team Project:FleetScaling labels Jun 30, 2022

kpollich self-assigned this Dec 1, 2022

nicpenning mentioned this issue Feb 3, 2023

Possible to not ACK upgrade action elastic/elastic-agent#760

Closed

1 task

jlind23 unassigned kpollich May 23, 2023

cmacknz mentioned this issue Jun 1, 2023

[Feature Request] Enable a 'repair agent' button in fleet elastic/elastic-agent#697

Open

juliaElastic added the QA:Needs Validation Issue needs to be validated by QA label Jul 19, 2023

juliaElastic self-assigned this Sep 6, 2023

juliaElastic mentioned this issue Sep 11, 2023

added restart upgrade action #166154

Merged

2 tasks

juliaElastic closed this as completed in #166154 Sep 18, 2023

juliaElastic mentioned this issue Sep 19, 2023

[Fleet] added doc link, styling fix #166695

Merged

cmacknz mentioned this issue Oct 6, 2023

[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent #168233

Closed

jlind23 mentioned this issue Oct 12, 2023

[Fleet] Enforce 10 min cooldown for agent upgrade #168606

Merged

1 task

amolnater-qasource added QA:Validated Issue has been validated by QA and removed QA:Needs Validation Issue needs to be validated by QA labels Oct 26, 2023

juliaElastic mentioned this issue Dec 19, 2023

[Fleet] Update restart agent upgrade action to use upgrade details #171419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Allow users to force running an upgrade again from the UI #135539

[Fleet] Allow users to force running an upgrade again from the UI #135539

joshdover commented Jun 30, 2022

elasticmachine commented Jun 30, 2022

joshdover commented Sep 5, 2022

joshdover commented Sep 5, 2022 •

edited

Loading

joshdover commented Sep 5, 2022

nimarezainia commented Oct 5, 2022

dborodyansky commented Oct 5, 2022

kpollich commented Oct 10, 2022

nimarezainia commented May 8, 2023

joshdover commented May 8, 2023

michaelmagyar commented May 17, 2023

joshdover commented Aug 22, 2023

nimarezainia commented Aug 23, 2023

juliaElastic commented Sep 11, 2023 •

edited

Loading

cmacknz commented Sep 11, 2023

nimarezainia commented Sep 12, 2023 •

edited

Loading

juliaElastic commented Sep 12, 2023 •

edited

Loading

juliaElastic commented Sep 12, 2023 •

edited

Loading

cmacknz commented Sep 12, 2023

nimarezainia commented Sep 13, 2023

nimarezainia commented Sep 13, 2023

cmacknz commented Sep 13, 2023 •

edited

Loading

juliaElastic commented Sep 13, 2023 •

edited

Loading

amolnater-qasource commented Sep 29, 2023 •

edited

Loading

cmacknz commented Oct 6, 2023

amolnater-qasource commented Oct 26, 2023

[Fleet] Allow users to force running an upgrade again from the UI #135539

[Fleet] Allow users to force running an upgrade again from the UI #135539

Comments

joshdover commented Jun 30, 2022

elasticmachine commented Jun 30, 2022

joshdover commented Sep 5, 2022

joshdover commented Sep 5, 2022 • edited Loading

joshdover commented Sep 5, 2022

nimarezainia commented Oct 5, 2022

dborodyansky commented Oct 5, 2022

kpollich commented Oct 10, 2022

nimarezainia commented May 8, 2023

joshdover commented May 8, 2023

michaelmagyar commented May 17, 2023

joshdover commented Aug 22, 2023

nimarezainia commented Aug 23, 2023

juliaElastic commented Sep 11, 2023 • edited Loading

cmacknz commented Sep 11, 2023

nimarezainia commented Sep 12, 2023 • edited Loading

juliaElastic commented Sep 12, 2023 • edited Loading

juliaElastic commented Sep 12, 2023 • edited Loading

cmacknz commented Sep 12, 2023

nimarezainia commented Sep 13, 2023

nimarezainia commented Sep 13, 2023

cmacknz commented Sep 13, 2023 • edited Loading

juliaElastic commented Sep 13, 2023 • edited Loading

amolnater-qasource commented Sep 29, 2023 • edited Loading

cmacknz commented Oct 6, 2023

amolnater-qasource commented Oct 26, 2023

joshdover commented Sep 5, 2022 •

edited

Loading

juliaElastic commented Sep 11, 2023 •

edited

Loading

nimarezainia commented Sep 12, 2023 •

edited

Loading

juliaElastic commented Sep 12, 2023 •

edited

Loading

juliaElastic commented Sep 12, 2023 •

edited

Loading

cmacknz commented Sep 13, 2023 •

edited

Loading

juliaElastic commented Sep 13, 2023 •

edited

Loading

amolnater-qasource commented Sep 29, 2023 •

edited

Loading