-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Allow users to force running an upgrade again from the UI #135539
Comments
Pinging @elastic/fleet (Team:Fleet) |
I think we should consider this for 8.6. Upgrades can fail for a number of reasons and right now there's no way to force a clean state from the UI. We should at least allow users to do this after the original upgrade action has expired. |
We should short-circuiting this a bit. If the agent is already on the requested version (based on Otherwise there is a risk that the agent will try to install over itself which can cause problems. |
@nimarezainia Could you help us define exactly how the UX should work here? |
@joshdover (sorry for the delay) I think we should follow the pattern we already have for the "Unenroll" command where the user is asked if they want to FORCE unenroll. I realize that the action taken is very different however in this case We have an "Upgrade" action - perhaps an extra dialogue to ask if they need it to be "FORCED". Perhaps an extra checkbox in the following dialog box: @dborodyansky & @kpollich thoughts? |
Is there a time duration at which we can infer that a problem may have occurred? If so, we may consider displaying an alert and restart action at an appropriate point in time as follows. (copy needs review) Otherwise, we may present a restart action anytime an upgrade is in progress. Might this cause potentially unnecessary restarts? |
This might be something we can infer from telemetry around upgrades, but I'm not sure how to get that data myself.
I'm a little wary about providing the "restart" option at all times, but I don't think we'd have technical issues around back-to-back restarts. Josh's comment above about short-circuiting the upgrade path mitigates the risks around force upgrading as I understand it. |
@joshdover @cmacknz @kpollich I wantd to resurface this issue and see if we can go down the path of self healing - wthout user intervention.
|
We introduced retries locally on the agent in 8.6.0: elastic/elastic-agent#778 I'm curious how many of the bugs we're seeing now happen after upgrading to 8.6.0 and then attempting to upgrade to a newer version. It's possible this already has solved many of the cases, except in cases where a retry won't fix the problem (such as, no disk space for the binary download). These are the cases where I think this issue comes into play. But I think we first need to move the agent into a different state that then allows the user to diagnose the issue and determine what to do next before retrying the upgrade. |
Better update logic would be very welcome. We still have many agents that fail to upgrade and it's hard to diagnose the reason (it's not disk space or network issues). In the meantime, please do implement a UI option to force upgrades, preferably with the ability to select many agents. We've been asking for that feature for at least a year to address this issue. I just spent a few hours re-upgrading our fleet after moving from 8.6.2 to 8.7.1. The UI upgrade failed for almost all agents with them all stuck as "Upgrading", so I had to grab each stuck agent's ID and use the upgrade API with Managing agents is a consistent pain point. If a force upgrade UI option isn't in the next version, I will spend a few hours building a front-end script to effectively "select all upgrading; force upgrade". I may even build scripts to completely uninstall and reinstall the agent. This eats up so much of our time and unnecessarily adds to the TCO. |
@nimarezainia @zombieFox We need a decision on the UI here before work can begin. There are two suggestions above and I'm not sure we decided on which to do. I also think we probably need a way to do this in bulk from the agent list view. This also overlaps with the more granular upgrade status reporting. I prefer this option as it's more obvious, but we need guidance on how the bulk upgrade modal should change too. |
discussing this in the UX issue. I also do agree that selecting agents after a filter is applied (like status : updating) and then performing the action in bulk would be the most common workflow. |
The definition of this feature is not conclusive, what do you think to implement this way:
@nimarezainia @zombieFox WDYT? |
Just chiming in to raise elastic/elastic-agent#2706 as something we'll either need to implement before this, or account for in this implementation. Today if you implemented this with no changes to the agent you could only attempt a force upgrade once every 10 minutes to be safe. |
I think this would be great, although why more than 1 day? there are issues with quick succession of upgrades, I guess we do need to throttle this.
Can it be enabled if there's at least one agent in the set that is in "updating" state - and we only send the action to the set in Updating? (agents in other states don't get the action). I suspect you can't do this because it's a bulk action.
Is this because of real estate? I'm happy for it to grey out rather than fully removed.
Good point. Yes.
Would what I suggested above address this? before sending the action, are we able to tell what state the agent is in? I suspect not because this is a bulk action. Perhaps a no-operation on the agent if it's already in a state other than Updating?
I'd say restart upgrade is pretty good.
|
I took a guess that 1 day would mean the agent is stuck in updating, we can choose any other timeframe if we think that works better. What about 1 hour or at least 10 minutes?
I think it would be confusing to only send the action to updating agents, it would look on the UI as a bug that not all agents received an action. If we send the action to healthy agents too, it might have unintended consequences, as this is a force upgrade action e.g. trying to use a lower version on an already upgraded agent might cause a downgrade.
Not really, for the query based actions, we need another API call to determine whether all agents are in updating state. We can do this, just wanted to highlight that this increases the effort. So to summarize, I think we should only allow bulk restart upgrade if all selected agents are in Also, given the agent limitation, we should also do a check on bulk action that all agents started upgrading since at least X amount of time. The action would fail for updating agents that started upgrade more recently. |
Whatever time interval is picked should be longer than the currently configured download timeout, which currently defaults to 2 hours. Any amount less than this will result in agents falsely being marked as stuck in updating. |
@juliaElastic ok thanks let's do this. But please let's also put plenty of information on the flyout about this so the users are aware. For example, they have selected a set of agents to apply this restart upgrade action to, but that set includes healthy agents too. Let's be informative and let the user know (ideally) ho many agents in the list are healthy and therefore the action can't be applied. I'm happy to help review the copy. |
How are we going to track the download window defind by this time out? Say the user selects 100 agents to apply this to, their individual upgrade process started at different intervals (based on our rolling upgrades). these agents could have very differing lengths of time left on their timeout. @cmacknz I'd imagine that at the moment Fleet doesn't know whether an agent is stuck in updating due to failed download. If it was in another state of the Upgrade process, we could perhaps apply this force upgrade action. Instead of building complexity in Fleet to know which agent and when an action should be sent, wouldn't be more scalable to have logic in agent to ignore actions that are not applicable? so in this scenario, an agent stuck in upgrading because of download issues would just reject the force update action. Does this make sense? |
Correct. It may not actually be worth trying to track whether an agent is stuck in updated in Fleet. With the changes in https://github.com/elastic/ingest-dev/issues/1621 the agent should just be telling Fleet exactly what is happening with the upgrade and Fleet won't need to infer or calculate anything.
In general the most scalable thing to do is to decide to stop doing work as early as possible. In this situation it would be better for Fleet to realize the agent in the bulk upgrade is ineligible to be upgraded and not generate the upgrade action at all, removing the work of delivering and processing it from Fleet Server and the Agent itself. For an analogy if we sent this to the agent, it would be like someone (the agent) making a phone call just to tell you they weren't going to call you back. |
We can check the EDIT: after thinking about this more, it's going to be tricky on the API side to decide if this is a restart upgrade, as the API just gets a force upgrade call. So unless we want to introduce a new API/field for this scenario (which I'm not keen on), it would be better on the UI side to determine if agents are stuck updating in a bulk action. What we could do is enable the Something like this: they selected 20 agents, 2 are stuck in updating, so the modal confirms that 2 will be upgraded: The first working version of this is available in the pr: #166154
We can add the Regarding the copy, should we be more specific? Instead of "Agent has been updating for a while" we could say "Agent has been updating for more than 2 hours" and later replace it with "Agent upgrade is stuck in failed state". |
## Summary Ready to review, tests are WIP Resolves #135539 For agents in `Updating` state, adding a `Restart upgrade` action, which uses the force flag on the API to bypass the updating check. For single agent, the action is only added if the agent has started upgrade for more than 2 hours (see discussion in [issue](#135539 (comment))) For bulk selection, the action appears if all selected agents are in updating state. To verify: - Start local es, kibana, fleet server (can be docker) - Enroll agents with horde - update agent docs with the queries below to simulate stuck in updating for more than 2 hours (modify the query to target specific agents) - bulk select agents and check that `Restart upgrade` action is visible, and it triggers another upgrade with force flag only on the agents stuck in updating - when calling the bulk_upgrade API, the UI adds the updating condition to the query / or includes only those agent ids that are stuck in updating, depending on query or manual selection ``` curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \ http://localhost:9200/_security/role/fleet_superuser -d ' { "indices": [ { "names": [".fleet*",".kibana*"], "privileges": ["all"], "allow_restricted_indices": true } ] }' curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \ http://localhost:9200/_security/user/fleet_superuser -d ' { "password": "password", "roles": ["superuser", "fleet_superuser"] }' curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \ -H'x-elastic-product-origin:fleet' \ http://localhost:9200/.fleet-agents/_update_by_query -d ' { "script": { "source": "ctx._source.upgrade_started_at = \"2023-09-13T11:26:23Z\"", "lang": "painless" }, "query": { "exists": { "field": "tags" } } }' ``` Agent details action: <img width="1440" alt="image" src="https://github.com/elastic/kibana/assets/90178898/3704e781-337e-4175-b6d2-a99375b6cc24"> Agent list, bulk action: <img width="1482" alt="image" src="https://github.com/elastic/kibana/assets/90178898/74f46861-393e-4a86-ab1f-c21e8d325e89"> Agent list, single agent action: <img width="369" alt="image" src="https://github.com/elastic/kibana/assets/90178898/2aa087a1-b6e1-4e44-b1db-34592f3df959"> Agent details callout: <img width="771" alt="image" src="https://github.com/elastic/kibana/assets/90178898/d1584fe6-0c98-4033-8527-27235812c004"> Select all agents on first page, restart upgrade modal shows only those that are stuck in upgrading: <img width="1317" alt="image" src="https://github.com/elastic/kibana/assets/90178898/fe858815-4393-4007-abe6-e26e4a884bd3"> Select all agents on all pages, restart upgrade modal shows only those that are stuck upgrading, with an extra api call to get the count: <img width="2443" alt="image" src="https://github.com/elastic/kibana/assets/90178898/481b6ab5-fc87-4586-aa9b-432439234ac6"> ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Related to #135539 Added doc link Styling fix <img width="753" alt="image" src="https://github.com/elastic/kibana/assets/90178898/a847c882-a2eb-4729-86da-5885f636b94c"> <img width="574" alt="image" src="https://github.com/elastic/kibana/assets/90178898/f642e1b2-97c9-4bc8-9763-f2e224d4b3bd">
Hi Team, We have created 04 testcases under testrail for this feature under Fleet test suite at links:
Please let us know if any other scenario needs to be added from our end. Thanks! |
Related #168171 (comment) We likely need to forbid attempting force upgrade attempts more frequently than once every 10 minutes. |
Hi Team, We have executed 04 testcases under the Feature test run for the 8.11.0 release at the link: Status:
Build details: As the testing is completed on this feature, we are marking this as QA:Validated. Please let us know if anything else is required from our end. |
Agent upgrades can get stuck in an "updating" status from time to time. While we address this at the root cause level, it could be helpful to have a way to re-trigger an upgrade from the UI. We already support this from the API:
The text was updated successfully, but these errors were encountered: