Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick multiple upgrades while agent is still in grace period break agent installation #2706

Closed
pchila opened this issue May 17, 2023 · 11 comments
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@pchila
Copy link
Member

pchila commented May 17, 2023

Please post all questions and issues concerning the Elastic Agent on https://discuss.elastic.co/c/beats
before opening a Github Issue. Your questions will reach a wider audience there,
and if we confirm that there is a bug, then you can open a new issue.

For security vulnerabilities please only send reports to security@elastic.co.
See https://www.elastic.co/community/security for more information.

Please include configurations and logs if available.

For confirmed bugs, please report:

  • Version: 8.9.0-SNAPSHOT
  • Operating System: Linux
  • Discuss Forum URL:
  • Steps to Reproduce:
    • Upgrade/downgrade agent using CLI
    • After the upgrade initiate immediately (or before 10 minutes of grace period have elapsed) another upgrade using the CLI
    • in my specific case I performed 8.9.0-SNAPSHOT -> 8.8.0-SNAPSHOT -> 8.9.0-SNAPSHOT and I ended up with the following (the agent 0af676 that was running had been cleaned up by the upgrade process)
    root@pchila-elastic:/opt/Elastic/Agent# ll
    total 1016
    drwxr-x--- 4 root root   4096 May 17 12:25 ./
    drwxr-xr-x 3 root root   4096 May 17 12:13 ../
    -rw-r----- 1 root root     41 May 17 12:13 .build_hash.txt
    drwxr-x--- 4 root root   4096 May 17 12:26 data/
    lrwxrwxrwx 1 root root     58 May 17 12:25 elastic-agent -> /opt/Elastic/Agent/data/elastic-agent-0af676/elastic-agent
    -rw-r----- 1 root root      6 May 17 12:25 .elastic-agent.active.commit
    -rw-r----- 1 root root  10026 May 17 12:13 elastic-agent.reference.yml
    -rw------- 1 root root  10486 May 17 12:13 elastic-agent.yml
    -rw------- 1 root root    274 May 17 12:13 fleet.enc
    -rw------- 1 root root      0 May 17 12:13 fleet.enc.lock
    -rw-r--r-- 1 root root      0 May 17 12:13 .installed
    -rw-r----- 1 root root  13675 May 17 12:13 LICENSE.txt
    -rw-r----- 1 root root 964376 May 17 12:13 NOTICE.txt
    -rw-r----- 1 root root    309 May 17 12:13 README.md
    drwxr-x--- 2 root root   4096 May 17 12:13 vault/
    -rw------- 1 root root      0 May 17 12:16 watcher.lock
    root@pchila-elastic:/opt/Elastic/Agent# ll data/
    total 16
    drwxr-x--- 4 root root 4096 May 17 12:26 ./
    drwxr-x--- 4 root root 4096 May 17 12:25 ../
    -rw------- 1 root root    0 May 17 12:13 agent.lock
    drwxr-xr-x 5 root root 4096 May 17 12:25 elastic-agent-8ecdff/
    drwxr-x--- 2 root root 4096 May 17 12:25 tmp/
    root@pchila-elastic:/opt/Elastic/Agent# ll data/elastic-agent-8ecdff/
    total 49152
    drwxr-xr-x 5 root root     4096 May 17 12:25 ./
    drwxr-x--- 4 root root     4096 May 17 12:26 ../
    drwxr-xr-x 8 root root     4096 May 17 12:16 components/
    -rwxr-xr-x 1 root root 50306176 May 17 12:15 elastic-agent*
    drwx------ 2 root root     4096 May 17 12:25 logs/
    drwxr-x--- 6 root root     4096 May 17 12:16 run/
    root@pchila-elastic:/opt/Elastic/Agent#  data/elastic-agent-8ecdff/elastic-agent version
    WARN: the running daemon of Elastic Agent does not match this version.
    Binary: 8.8.0-SNAPSHOT (build: 8ecdffd297715597b1c2aace8fb7ec039fa2528f at 2023-05-04 12:01:12 +0000 UTC)
    Daemon: 8.9.0-SNAPSHOT (build: 0af676d2c10ff5b0d5e2446270786f4718bc8e19 at 2023-05-16 10:17:47 +0000 UTC)
    root@pchila-elastic:/opt/Elastic/Agent# data/elastic-agent-8ecdff/elastic-agent uninstall
    Elastic Agent is installed but currently broken: service exists but installation path is missing
    Continuing will uninstall the broken Elastic Agent at /opt/Elastic/Agent. Do you want to continue? [Y/n]:Y
    Elastic Agent has been uninstalled.
    

Agent should refuse an upgrade if a previous upgrade is still ongoing (that includes the grace period when the agent is still monitored by the watcher)

@pchila pchila added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels May 17, 2023
@jlind23
Copy link
Contributor

jlind23 commented May 17, 2023

@pchila I agree this is a bug but I'm not sure we are going to see a lot of consecutive update though. What do you think?

@jlind23
Copy link
Contributor

jlind23 commented May 17, 2023

We discussed this with some folks today and apparently the best solution would be that Agent should refuse to upgrade until the grace period is over.

@pchila
Copy link
Member Author

pchila commented May 18, 2023

@jlind23 Yes, that is the easiest and safest fix, I'll update the issue description

@blakerouse
Copy link
Contributor

This is some what related to #3371, but has possible 2 solutions:

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2023

Kill the running watcher and spawn a new one (killing the watcher would be the same approach that #3371 must perform)

This is my preference, because the watcher is an implementation detail that users shouldn't to know about or account for. If they want to upgrade again immediately, they should be able to.

@ycombinator
Copy link
Contributor

ycombinator commented Sep 13, 2023

Discussed this issue in the weekly meeting. To summarize, here's how we want to solve it:

  • For standalone: return error if user requests an upgrade while a previous upgrade is in progress
  • For Fleet-managed: eventually perform the second upgrade without needing user intervention

I will re-work #3399 to implement the above solution.

@ycombinator
Copy link
Contributor

ycombinator commented Sep 29, 2023

For Fleet-managed: eventually perform the second upgrade without needing user intervention

As things stand, neither the Fleet UI nor the Fleet API will allow the user to initiate the second upgrade if the first one is still deemed to be in progress. The UI grays out the "Upgrade agent" link and the API returns a 400 Bad Request response saying the agent is not upgradeable.

I'm not sure yet how Fleet decides that an upgrade is in progress. However, I have noticed that Fleet considers an upgrade as no longer in progress even while the Upgrade Watcher from that upgrade is still running. In other words, Fleet today will allow a user to request a second upgrade even while the Upgrade Watcher from the first upgrade is still running. What happens in this case is that the second upgrade's Upgrade Watcher never runs because the lock file, watcher.lock, created by the first upgrade's Upgrade Watcher still exists. So, in effect, we end up have the first upgrade's Upgrade Watcher monitoring the second upgrade, which is not ideal.

When the Agent receives the UPGRADE action from Fleet for the second upgrade request, it should check if the Upgrade Watcher is still running. If it is, it should somehow enqueue this upgrade request and dequeue+process if once the Upgrade Watcher has finished running. Depending on how Fleet is deciding if an upgrade is in progress, the user might see the second upgrade as being in progress for a while, while the first upgrade's Upgrade Watcher finishes running and the second upgrade request is dequeued and processed within Agent.

This approach certainly achieves the goal of the user not having to intervene to make that second upgrade happen.

One thing I don't like about this approach, however, is that Fleet doesn't consider the Upgrade Watcher step as part of the upgrade process today but going forward, Agent will be reporting each upgrade step to Fleet, and one of the steps it will report is UPG_WATCHING, when the Upgrade Watcher is still running. In this future world would we allow Fleet users to request the second upgrade while Agent is reporting UPG_WATCHING from the first upgrade?

  • If yes, then we can go ahead with the queueing approach mentioned above.
  • If no, then we should leave things as-is until we have implemented reporting each upgrade step to Fleet. And at that time, we should ensure that Fleet does not allow the second upgrade to be requested until the Agent has gone through all the states of the first upgrade (either successfully or unsuccessfully).

Personally, I think we should go with the second option ("if no, ...") because:

  • it's more accurate in terms of what an upgrade cycle looks like and that's reflected completely in Fleet,
  • it will match the behavior of user's attempting the second upgrade for standalone agent ([Standalone Agent] Disallow upgrade if upgrade is already in progress #3473) by not allowing that upgrade while the Upgrade Watcher from the first upgrade is still running, and
  • it avoids adding complexity into the Agent with enqueueing the second upgrade request, detecting that the Upgrade Watcher from the first upgrade finished running, and then dequeuing and processing the second upgrade request.

WDYT @cmacknz?

@blakerouse
Copy link
Contributor

I agree with @ycombinator on option 2.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Agreed, option 2 is the better path.

@ycombinator create an issue in https://github.com/elastic/kibana/issues for the Fleet team to forbid upgrading based on the agent's reported upgrade states as suggested.

@ycombinator
Copy link
Contributor

Agreed, option 2 is the better path.

@ycombinator create an issue in https://github.com/elastic/kibana/issues for the Fleet team to forbid upgrading based on the agent's reported upgrade states as suggested.

elastic/kibana#168171

@blakerouse
Copy link
Contributor

This was fixed in #3473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
5 participants