Quick multiple upgrades while agent is still in grace period break agent installation #2706

pchila · 2023-05-17T17:22:28Z

Please post all questions and issues concerning the Elastic Agent on https://discuss.elastic.co/c/beats
before opening a Github Issue. Your questions will reach a wider audience there,
and if we confirm that there is a bug, then you can open a new issue.

For security vulnerabilities please only send reports to security@elastic.co.
See https://www.elastic.co/community/security for more information.

Please include configurations and logs if available.

For confirmed bugs, please report:

Version: 8.9.0-SNAPSHOT
Operating System: Linux
Discuss Forum URL:

Steps to Reproduce:

Upgrade/downgrade agent using CLI
After the upgrade initiate immediately (or before 10 minutes of grace period have elapsed) another upgrade using the CLI
in my specific case I performed 8.9.0-SNAPSHOT -> 8.8.0-SNAPSHOT -> 8.9.0-SNAPSHOT and I ended up with the following (the agent 0af676 that was running had been cleaned up by the upgrade process)

root@pchila-elastic:/opt/Elastic/Agent# ll
total 1016
drwxr-x--- 4 root root   4096 May 17 12:25 ./
drwxr-xr-x 3 root root   4096 May 17 12:13 ../
-rw-r----- 1 root root     41 May 17 12:13 .build_hash.txt
drwxr-x--- 4 root root   4096 May 17 12:26 data/
lrwxrwxrwx 1 root root     58 May 17 12:25 elastic-agent -> /opt/Elastic/Agent/data/elastic-agent-0af676/elastic-agent
-rw-r----- 1 root root      6 May 17 12:25 .elastic-agent.active.commit
-rw-r----- 1 root root  10026 May 17 12:13 elastic-agent.reference.yml
-rw------- 1 root root  10486 May 17 12:13 elastic-agent.yml
-rw------- 1 root root    274 May 17 12:13 fleet.enc
-rw------- 1 root root      0 May 17 12:13 fleet.enc.lock
-rw-r--r-- 1 root root      0 May 17 12:13 .installed
-rw-r----- 1 root root  13675 May 17 12:13 LICENSE.txt
-rw-r----- 1 root root 964376 May 17 12:13 NOTICE.txt
-rw-r----- 1 root root    309 May 17 12:13 README.md
drwxr-x--- 2 root root   4096 May 17 12:13 vault/
-rw------- 1 root root      0 May 17 12:16 watcher.lock
root@pchila-elastic:/opt/Elastic/Agent# ll data/
total 16
drwxr-x--- 4 root root 4096 May 17 12:26 ./
drwxr-x--- 4 root root 4096 May 17 12:25 ../
-rw------- 1 root root    0 May 17 12:13 agent.lock
drwxr-xr-x 5 root root 4096 May 17 12:25 elastic-agent-8ecdff/
drwxr-x--- 2 root root 4096 May 17 12:25 tmp/
root@pchila-elastic:/opt/Elastic/Agent# ll data/elastic-agent-8ecdff/
total 49152
drwxr-xr-x 5 root root     4096 May 17 12:25 ./
drwxr-x--- 4 root root     4096 May 17 12:26 ../
drwxr-xr-x 8 root root     4096 May 17 12:16 components/
-rwxr-xr-x 1 root root 50306176 May 17 12:15 elastic-agent*
drwx------ 2 root root     4096 May 17 12:25 logs/
drwxr-x--- 6 root root     4096 May 17 12:16 run/
root@pchila-elastic:/opt/Elastic/Agent#  data/elastic-agent-8ecdff/elastic-agent version
WARN: the running daemon of Elastic Agent does not match this version.
Binary: 8.8.0-SNAPSHOT (build: 8ecdffd297715597b1c2aace8fb7ec039fa2528f at 2023-05-04 12:01:12 +0000 UTC)
Daemon: 8.9.0-SNAPSHOT (build: 0af676d2c10ff5b0d5e2446270786f4718bc8e19 at 2023-05-16 10:17:47 +0000 UTC)
root@pchila-elastic:/opt/Elastic/Agent# data/elastic-agent-8ecdff/elastic-agent uninstall
Elastic Agent is installed but currently broken: service exists but installation path is missing
Continuing will uninstall the broken Elastic Agent at /opt/Elastic/Agent. Do you want to continue? [Y/n]:Y
Elastic Agent has been uninstalled.

Agent should refuse an upgrade if a previous upgrade is still ongoing (that includes the grace period when the agent is still monitored by the watcher)

The text was updated successfully, but these errors were encountered:

jlind23 · 2023-05-17T17:36:20Z

@pchila I agree this is a bug but I'm not sure we are going to see a lot of consecutive update though. What do you think?

jlind23 · 2023-05-17T17:41:12Z

We discussed this with some folks today and apparently the best solution would be that Agent should refuse to upgrade until the grace period is over.

pchila · 2023-05-18T07:16:52Z

@jlind23 Yes, that is the easiest and safest fix, I'll update the issue description

blakerouse · 2023-09-07T14:23:01Z

This is some what related to #3371, but has possible 2 solutions:

Don't allow upgrade during the grace period
Kill the running watcher and spawn a new one (killing the watcher would be the same approach that Uninstall does not stop a running watcher after upgrade #3371 must perform)

cmacknz · 2023-09-11T16:07:58Z

Kill the running watcher and spawn a new one (killing the watcher would be the same approach that #3371 must perform)

This is my preference, because the watcher is an implementation detail that users shouldn't to know about or account for. If they want to upgrade again immediately, they should be able to.

ycombinator · 2023-09-13T13:52:50Z

Discussed this issue in the weekly meeting. To summarize, here's how we want to solve it:

For standalone: return error if user requests an upgrade while a previous upgrade is in progress
For Fleet-managed: eventually perform the second upgrade without needing user intervention

I will re-work #3399 to implement the above solution.

ycombinator · 2023-09-29T21:12:02Z

For Fleet-managed: eventually perform the second upgrade without needing user intervention

As things stand, neither the Fleet UI nor the Fleet API will allow the user to initiate the second upgrade if the first one is still deemed to be in progress. The UI grays out the "Upgrade agent" link and the API returns a 400 Bad Request response saying the agent is not upgradeable.

I'm not sure yet how Fleet decides that an upgrade is in progress. However, I have noticed that Fleet considers an upgrade as no longer in progress even while the Upgrade Watcher from that upgrade is still running. In other words, Fleet today will allow a user to request a second upgrade even while the Upgrade Watcher from the first upgrade is still running. What happens in this case is that the second upgrade's Upgrade Watcher never runs because the lock file, watcher.lock, created by the first upgrade's Upgrade Watcher still exists. So, in effect, we end up have the first upgrade's Upgrade Watcher monitoring the second upgrade, which is not ideal.

When the Agent receives the UPGRADE action from Fleet for the second upgrade request, it should check if the Upgrade Watcher is still running. If it is, it should somehow enqueue this upgrade request and dequeue+process if once the Upgrade Watcher has finished running. Depending on how Fleet is deciding if an upgrade is in progress, the user might see the second upgrade as being in progress for a while, while the first upgrade's Upgrade Watcher finishes running and the second upgrade request is dequeued and processed within Agent.

This approach certainly achieves the goal of the user not having to intervene to make that second upgrade happen.

One thing I don't like about this approach, however, is that Fleet doesn't consider the Upgrade Watcher step as part of the upgrade process today but going forward, Agent will be reporting each upgrade step to Fleet, and one of the steps it will report is UPG_WATCHING, when the Upgrade Watcher is still running. In this future world would we allow Fleet users to request the second upgrade while Agent is reporting UPG_WATCHING from the first upgrade?

If yes, then we can go ahead with the queueing approach mentioned above.
If no, then we should leave things as-is until we have implemented reporting each upgrade step to Fleet. And at that time, we should ensure that Fleet does not allow the second upgrade to be requested until the Agent has gone through all the states of the first upgrade (either successfully or unsuccessfully).

Personally, I think we should go with the second option ("if no, ...") because:

it's more accurate in terms of what an upgrade cycle looks like and that's reflected completely in Fleet,
it will match the behavior of user's attempting the second upgrade for standalone agent ([Standalone Agent] Disallow upgrade if upgrade is already in progress #3473) by not allowing that upgrade while the Upgrade Watcher from the first upgrade is still running, and
it avoids adding complexity into the Agent with enqueueing the second upgrade request, detecting that the Upgrade Watcher from the first upgrade finished running, and then dequeuing and processing the second upgrade request.

WDYT @cmacknz?

blakerouse · 2023-10-02T20:56:54Z

I agree with @ycombinator on option 2.

cmacknz · 2023-10-03T14:51:03Z

Agreed, option 2 is the better path.

@ycombinator create an issue in https://github.com/elastic/kibana/issues for the Fleet team to forbid upgrading based on the agent's reported upgrade states as suggested.

ycombinator · 2023-10-06T00:16:58Z

Agreed, option 2 is the better path.

@ycombinator create an issue in https://github.com/elastic/kibana/issues for the Fleet team to forbid upgrading based on the agent's reported upgrade states as suggested.

elastic/kibana#168171

blakerouse · 2023-12-04T18:43:37Z

This was fixed in #3473

pchila added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels May 17, 2023

cmacknz mentioned this issue Sep 7, 2023

Upgrades that fail and are rolled back can break the elastic-agent command symlink #2264

Closed

cmacknz assigned ycombinator Sep 8, 2023

cmacknz mentioned this issue Sep 11, 2023

[Fleet] Allow users to force running an upgrade again from the UI elastic/kibana#135539

Closed

This was referenced Sep 11, 2023

Allow upgrades in quick succession without interference from Upgrade Watcher #3399

Closed

Uninstall finds and kills any running elastic-agent watch process #3384

Merged

ycombinator mentioned this issue Sep 25, 2023

[Standalone Agent] Disallow upgrade if upgrade is already in progress #3473

Merged

7 tasks

ycombinator mentioned this issue Oct 6, 2023

[Fleet UI] Forbid Agent from being upgraded if upgrade is currently in progress elastic/kibana#168171

Closed

cmacknz mentioned this issue Oct 6, 2023

[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent elastic/kibana#168233

Closed

pierrehilbert unassigned ycombinator Nov 20, 2023

blakerouse closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick multiple upgrades while agent is still in grace period break agent installation #2706

Quick multiple upgrades while agent is still in grace period break agent installation #2706

pchila commented May 17, 2023 •

edited

Loading

jlind23 commented May 17, 2023

jlind23 commented May 17, 2023

pchila commented May 18, 2023

blakerouse commented Sep 7, 2023

cmacknz commented Sep 11, 2023

ycombinator commented Sep 13, 2023 •

edited

Loading

ycombinator commented Sep 29, 2023 •

edited

Loading

blakerouse commented Oct 2, 2023

cmacknz commented Oct 3, 2023

ycombinator commented Oct 6, 2023

blakerouse commented Dec 4, 2023

Quick multiple upgrades while agent is still in grace period break agent installation #2706

Quick multiple upgrades while agent is still in grace period break agent installation #2706

Comments

pchila commented May 17, 2023 • edited Loading

jlind23 commented May 17, 2023

jlind23 commented May 17, 2023

pchila commented May 18, 2023

blakerouse commented Sep 7, 2023

cmacknz commented Sep 11, 2023

ycombinator commented Sep 13, 2023 • edited Loading

ycombinator commented Sep 29, 2023 • edited Loading

blakerouse commented Oct 2, 2023

cmacknz commented Oct 3, 2023

ycombinator commented Oct 6, 2023

blakerouse commented Dec 4, 2023

pchila commented May 17, 2023 •

edited

Loading

ycombinator commented Sep 13, 2023 •

edited

Loading

ycombinator commented Sep 29, 2023 •

edited

Loading