Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Allow users to force running an upgrade again from the UI #135539

Closed
joshdover opened this issue Jun 30, 2022 · 25 comments · Fixed by #166154
Closed

[Fleet] Allow users to force running an upgrade again from the UI #135539

joshdover opened this issue Jun 30, 2022 · 25 comments · Fixed by #166154
Assignees
Labels
Project:FleetScaling QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@joshdover
Copy link
Contributor

Agent upgrades can get stuck in an "updating" status from time to time. While we address this at the root cause level, it could be helpful to have a way to re-trigger an upgrade from the UI. We already support this from the API:

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.3.0" }

Screenshot 2022-06-30 at 13 32 49

@joshdover joshdover added Team:Fleet Team label for Observability Data Collection Fleet team Project:FleetScaling labels Jun 30, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor Author

I think we should consider this for 8.6. Upgrades can fail for a number of reasons and right now there's no way to force a clean state from the UI. We should at least allow users to do this after the original upgrade action has expired.

@joshdover
Copy link
Contributor Author

joshdover commented Sep 5, 2022

We should short-circuiting this a bit. If the agent is already on the requested version (based on local_metadata field in .fleet-agents) and the force upgrade API is called, we should probably just update the upgraded_at and upgrade_started_at fields instead of doing a full upgrade again.

Otherwise there is a risk that the agent will try to install over itself which can cause problems.

@joshdover
Copy link
Contributor Author

@nimarezainia Could you help us define exactly how the UX should work here?

@nimarezainia
Copy link
Contributor

@joshdover (sorry for the delay)

I think we should follow the pattern we already have for the "Unenroll" command where the user is asked if they want to FORCE unenroll. I realize that the action taken is very different however in this case We have an "Upgrade" action - perhaps an extra dialogue to ask if they need it to be "FORCED". Perhaps an extra checkbox in the following dialog box:

Image

@dborodyansky & @kpollich thoughts?

@dborodyansky
Copy link
Contributor

Is there a time duration at which we can infer that a problem may have occurred? If so, we may consider displaying an alert and restart action at an appropriate point in time as follows. (copy needs review)

image

Otherwise, we may present a restart action anytime an upgrade is in progress. Might this cause potentially unnecessary restarts?

image

@kpollich
Copy link
Member

Is there a time duration at which we can infer that a problem may have occurred?

This might be something we can infer from telemetry around upgrades, but I'm not sure how to get that data myself.

Otherwise, we may present a restart action anytime an upgrade is in progress. Might this cause potentially unnecessary restarts?

I'm a little wary about providing the "restart" option at all times, but I don't think we'd have technical issues around back-to-back restarts. Josh's comment above about short-circuiting the upgrade path mitigates the risks around force upgrading as I understand it.

@nimarezainia
Copy link
Contributor

@joshdover @cmacknz @kpollich I wantd to resurface this issue and see if we can go down the path of self healing - wthout user intervention.

  1. Once we have more granular telemetry around the upgrades, would we be in a better position to detect a "stuck" upgrade?
  2. Could we then have a retry mechanism in place that would force the upgrade (at least maybe once), to correct without user intervention?

@joshdover
Copy link
Contributor Author

We introduced retries locally on the agent in 8.6.0: elastic/elastic-agent#778

I'm curious how many of the bugs we're seeing now happen after upgrading to 8.6.0 and then attempting to upgrade to a newer version. It's possible this already has solved many of the cases, except in cases where a retry won't fix the problem (such as, no disk space for the binary download).

These are the cases where I think this issue comes into play. But I think we first need to move the agent into a different state that then allows the user to diagnose the issue and determine what to do next before retrying the upgrade.

@michaelmagyar
Copy link

Better update logic would be very welcome. We still have many agents that fail to upgrade and it's hard to diagnose the reason (it's not disk space or network issues).

In the meantime, please do implement a UI option to force upgrades, preferably with the ability to select many agents. We've been asking for that feature for at least a year to address this issue.

I just spent a few hours re-upgrading our fleet after moving from 8.6.2 to 8.7.1. The UI upgrade failed for almost all agents with them all stuck as "Upgrading", so I had to grab each stuck agent's ID and use the upgrade API with "force": true for each agent to unblock each one. The fun part? Even though I specified "version": "8.7.1", most of the agents actually didn't upgrade and stayed on 8.6.2 (I waited an hour+). At least they went back to healthy. I then retried upgrading the agents from the UI again, and again many of them were unsuccessful and are stuck "Upgrading" again. I may end up uninstalling all agents and reinstalling them fresh, because that seems to be the only solution.

Managing agents is a consistent pain point. If a force upgrade UI option isn't in the next version, I will spend a few hours building a front-end script to effectively "select all upgrading; force upgrade". I may even build scripts to completely uninstall and reinstall the agent. This eats up so much of our time and unnecessarily adds to the TCO.

@joshdover
Copy link
Contributor Author

@nimarezainia @zombieFox We need a decision on the UI here before work can begin. There are two suggestions above and I'm not sure we decided on which to do. I also think we probably need a way to do this in bulk from the agent list view. This also overlaps with the more granular upgrade status reporting.

I prefer this option as it's more obvious, but we need guidance on how the bulk upgrade modal should change too.

Image

@nimarezainia
Copy link
Contributor

discussing this in the UX issue. I also do agree that selecting agents after a filter is applied (like status : updating) and then performing the action in bulk would be the most common workflow.

@juliaElastic juliaElastic self-assigned this Sep 6, 2023
@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 11, 2023

The definition of this feature is not conclusive, what do you think to implement this way:

  1. For single agent, add the warning icon and the Restart upgrade callout shown here if the agent has started upgrading more than 1 day (?) ago.
  2. For bulk actions, add a new action Restart upgrade that is only enabled if all selected agents are in updating state, regardless of when the update started.
  • The Restart upgrade action could replace the Upgrade agents action in this case.
  1. For consistency, we should add the Restart upgrade action to the single agent action list, in which case 1. might be redundant.
  2. In case of bulk action, when more than one page is selected (actioning all agents that match a query), it's not trivial to determine whether all of them are in Updating state. Is this feature in scope for this issue? I think in most cases there will be a few agents stuck in updating, otherwise there is a workaround to call the API with force flag manually.
  3. Do we want to change anything in the Upgrade agent modal for Restart upgrade? We can change the title to Restart upgrade agent.
image

@nimarezainia @zombieFox WDYT?

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2023

Just chiming in to raise elastic/elastic-agent#2706 as something we'll either need to implement before this, or account for in this implementation. Today if you implemented this with no changes to the agent you could only attempt a force upgrade once every 10 minutes to be safe.

@nimarezainia
Copy link
Contributor

nimarezainia commented Sep 12, 2023

The definition of this feature is not conclusive, what do you think to implement this way:

1. For single agent, add the warning icon and the `Restart upgrade` callout [shown here](https://github.com/elastic/kibana/issues/135539#issuecomment-1687594884) if the agent has started upgrading more than 1 day (?) ago.

I think this would be great, although why more than 1 day? there are issues with quick succession of upgrades, I guess we do need to throttle this.

2. For bulk actions, add a new action `Restart upgrade` that is only enabled if all selected agents are in updating state, regardless of when the update started.

Can it be enabled if there's at least one agent in the set that is in "updating" state - and we only send the action to the set in Updating? (agents in other states don't get the action). I suspect you can't do this because it's a bulk action.
What would happen if this action was sent to a Healthy agent (or any agent in a state other than Updating) - will this action be ignored?

* The `Restart upgrade` action could replace the `Upgrade agents` action in this case.

Is this because of real estate? I'm happy for it to grey out rather than fully removed.

3. For consistency, we should add the `Restart upgrade` action to the single agent action list, in which case 1. might be redundant.

Good point. Yes.

4. In case of bulk action, when more than one page is selected (actioning all agents that match a query), it's not trivial to determine whether all of them are in `Updating` state. Is this feature in scope for this issue? I think in most cases there will be a few agents stuck in updating, otherwise there is a workaround to call the API with force flag manually.

Would what I suggested above address this? before sending the action, are we able to tell what state the agent is in? I suspect not because this is a bulk action. Perhaps a no-operation on the agent if it's already in a state other than Updating?

5. Do we want to change anything in the `Upgrade agent` modal for `Restart upgrade`? We can change the title to `Restart upgrade agent`.

I'd say restart upgrade is pretty good.

image

@nimarezainia @zombieFox WDYT?

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 12, 2023

I think this would be great, although why more than 1 day? there are issues with quick succession of upgrades, I guess we do need to throttle this.

I took a guess that 1 day would mean the agent is stuck in updating, we can choose any other timeframe if we think that works better. What about 1 hour or at least 10 minutes?

Can it be enabled if there's at least one agent in the set that is in "updating" state - and we only send the action to the set in Updating? (agents in other states don't get the action). I suspect you can't do this because it's a bulk action.
What would happen if this action was sent to a Healthy agent (or any agent in a state other than Updating) - will this action be ignored?

I think it would be confusing to only send the action to updating agents, it would look on the UI as a bug that not all agents received an action. If we send the action to healthy agents too, it might have unintended consequences, as this is a force upgrade action e.g. trying to use a lower version on an already upgraded agent might cause a downgrade.

Would what I suggested above address this? before sending the action, are we able to tell what state the agent is in? I suspect not because this is a bulk action. Perhaps a no-operation on the agent if it's already in a state other than Updating?

Not really, for the query based actions, we need another API call to determine whether all agents are in updating state. We can do this, just wanted to highlight that this increases the effort.

So to summarize, I think we should only allow bulk restart upgrade if all selected agents are in updating status. We can suggest filtering down on Updating status first to trigger restart.

Also, given the agent limitation, we should also do a check on bulk action that all agents started upgrading since at least X amount of time. The action would fail for updating agents that started upgrade more recently.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 12, 2023

This is work in progress, this is how the callout would look:

image

@cmacknz
Copy link
Member

cmacknz commented Sep 12, 2023

What about 1 hour or at least 10 minutes?

Whatever time interval is picked should be longer than the currently configured download timeout, which currently defaults to 2 hours.

https://github.com/elastic/elastic-agent/blob/0a19f365f4a6c1014074cbe79752f63bd0312f92/internal/pkg/agent/application/upgrade/artifact/config.go#L158-L161

Any amount less than this will result in agents falsely being marked as stuck in updating.

@nimarezainia
Copy link
Contributor

So to summarize, I think we should only allow bulk restart upgrade if all selected agents are in updating status. We can suggest filtering down on Updating status first to trigger restart.

@juliaElastic ok thanks let's do this. But please let's also put plenty of information on the flyout about this so the users are aware. For example, they have selected a set of agents to apply this restart upgrade action to, but that set includes healthy agents too. Let's be informative and let the user know (ideally) ho many agents in the list are healthy and therefore the action can't be applied. I'm happy to help review the copy.

@nimarezainia
Copy link
Contributor

What about 1 hour or at least 10 minutes?

Whatever time interval is picked should be longer than the currently configured download timeout, which currently defaults to 2 hours.

https://github.com/elastic/elastic-agent/blob/0a19f365f4a6c1014074cbe79752f63bd0312f92/internal/pkg/agent/application/upgrade/artifact/config.go#L158-L161

Any amount less than this will result in agents falsely being marked as stuck in updating.

How are we going to track the download window defind by this time out? Say the user selects 100 agents to apply this to, their individual upgrade process started at different intervals (based on our rolling upgrades). these agents could have very differing lengths of time left on their timeout.

@cmacknz I'd imagine that at the moment Fleet doesn't know whether an agent is stuck in updating due to failed download. If it was in another state of the Upgrade process, we could perhaps apply this force upgrade action.

Instead of building complexity in Fleet to know which agent and when an action should be sent, wouldn't be more scalable to have logic in agent to ignore actions that are not applicable? so in this scenario, an agent stuck in upgrading because of download issues would just reject the force update action.

Does this make sense?
Do we need the upgrade telemetry changes before we can do this?

@cmacknz
Copy link
Member

cmacknz commented Sep 13, 2023

How are we going to track the download window defind by this time out? Say the user selects 100 agents to apply this to, their individual upgrade process started at different intervals (based on our rolling upgrades). these agents could have very differing lengths of time left on their timeout.

Correct. It may not actually be worth trying to track whether an agent is stuck in updated in Fleet. With the changes in https://github.com/elastic/ingest-dev/issues/1621 the agent should just be telling Fleet exactly what is happening with the upgrade and Fleet won't need to infer or calculate anything.

wouldn't be more scalable to have logic in agent to ignore actions that are not applicable?

In general the most scalable thing to do is to decide to stop doing work as early as possible. In this situation it would be better for Fleet to realize the agent in the bulk upgrade is ineligible to be upgraded and not generate the upgrade action at all, removing the work of delivering and processing it from Fleet Server and the Agent itself. For an analogy if we sent this to the agent, it would be like someone (the agent) making a phone call just to tell you they weren't going to call you back.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 13, 2023

We can check the upgrade_started_at field for all agents in the bulk upgrade logic in Kibana Fleet API and prevent sending the action for those that are not stuck in updating (started less than 2 hours ago).

EDIT: after thinking about this more, it's going to be tricky on the API side to decide if this is a restart upgrade, as the API just gets a force upgrade call. So unless we want to introduce a new API/field for this scenario (which I'm not keen on), it would be better on the UI side to determine if agents are stuck updating in a bulk action.

What we could do is enable the Restart upgrade action on any agent selection, and then when the modal opens, do a query for agents stuck in updating, and display that info to the users, and only action those agents. This would make it easier for them without having to filter on updating too.

Something like this: they selected 20 agents, 2 are stuck in updating, so the modal confirms that 2 will be upgraded:
image

The first working version of this is available in the pr: #166154

Correct. It may not actually be worth trying to track whether an agent is stuck in updated in Fleet. With the changes in https://github.com/elastic/ingest-dev/issues/1621 the agent should just be telling Fleet exactly what is happening with the upgrade and Fleet won't need to infer or calculate anything.

We can add the Restart upgrade action now with the condition that the upgrade started at least 2 hours ago, and change the condition to the upgrade state being failed when the feature is ready.

Regarding the copy, should we be more specific? Instead of "Agent has been updating for a while" we could say "Agent has been updating for more than 2 hours" and later replace it with "Agent upgrade is stuck in failed state".

juliaElastic added a commit that referenced this issue Sep 18, 2023
## Summary

Ready to review, tests are WIP

Resolves #135539
 
For agents in `Updating` state, adding a `Restart upgrade` action, which
uses the force flag on the API to bypass the updating check. For single
agent, the action is only added if the agent has started upgrade for
more than 2 hours (see discussion in
[issue](#135539 (comment)))
For bulk selection, the action appears if all selected agents are in
updating state.

To verify:
- Start local es, kibana, fleet server (can be docker)
- Enroll agents with horde
- update agent docs with the queries below to simulate stuck in updating
for more than 2 hours (modify the query to target specific agents)
- bulk select agents and check that `Restart upgrade` action is visible,
and it triggers another upgrade with force flag only on the agents stuck
in updating
- when calling the bulk_upgrade API, the UI adds the updating condition
to the query / or includes only those agent ids that are stuck in
updating, depending on query or manual selection

```
curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \
http://localhost:9200/_security/role/fleet_superuser -d '
   {
      "indices": [
            {
               "names": [".fleet*",".kibana*"],
               "privileges": ["all"],
               "allow_restricted_indices": true
            }
      ]
   }'

curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \
http://localhost:9200/_security/user/fleet_superuser -d '
   {
      "password": "password",
      "roles": ["superuser", "fleet_superuser"]
   }'

    curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \
    -H'x-elastic-product-origin:fleet' \
    http://localhost:9200/.fleet-agents/_update_by_query -d '
    {
       "script": {
         "source": "ctx._source.upgrade_started_at = \"2023-09-13T11:26:23Z\"",
         "lang": "painless"
       },
       "query": {
         "exists": {
           "field": "tags"
         }
       }
     }'
```


Agent details action:

<img width="1440" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/3704e781-337e-4175-b6d2-a99375b6cc24">

Agent list, bulk action:

<img width="1482" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/74f46861-393e-4a86-ab1f-c21e8d325e89">

Agent list, single agent action:

<img width="369" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/2aa087a1-b6e1-4e44-b1db-34592f3df959">

Agent details callout:
<img width="771" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/d1584fe6-0c98-4033-8527-27235812c004">

Select all agents on first page, restart upgrade modal shows only those
that are stuck in upgrading:
<img width="1317" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/fe858815-4393-4007-abe6-e26e4a884bd3">

Select all agents on all pages, restart upgrade modal shows only those
that are stuck upgrading, with an extra api call to get the count:
<img width="2443" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/481b6ab5-fc87-4586-aa9b-432439234ac6">


### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@cmacknz
Copy link
Member

cmacknz commented Oct 6, 2023

Related #168171 (comment)

We likely need to forbid attempting force upgrade attempts more frequently than once every 10 minutes.

@amolnater-qasource
Copy link

Hi Team,

We have executed 04 testcases under the Feature test run for the 8.11.0 release at the link:

Status:

  • PASS: 04

Build details:
VERSION: 8.11.0 BC4
BUILD: 68044
COMMIT: caa3613
Artifact Link: https://staging.elastic.co/8.11.0-cb971279/summary-8.11.0.html

As the testing is completed on this feature, we are marking this as QA:Validated.

Please let us know if anything else is required from our end.
Thanks

@amolnater-qasource amolnater-qasource added QA:Validated Issue has been validated by QA and removed QA:Needs Validation Issue needs to be validated by QA labels Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:FleetScaling QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants