Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent #168233

Closed
cmacknz opened this issue Oct 6, 2023 · 6 comments · Fixed by #168606
Closed

[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent #168233

cmacknz opened this issue Oct 6, 2023 · 6 comments · Fixed by #168606
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@cmacknz
Copy link
Member

cmacknz commented Oct 6, 2023

Summarizing my comment from: #168171 (comment)

As part of an upgrade the agent starts a process called the upgrade watcher that supervises the new version of the agent for 10 minutes after an upgrade. If another upgrade is attempted in this 10 minute window the upgrade watcher will interpret the restart that happens during this second upgrade as a crash and try to roll back the agent version. A roll back that occurs while another upgrade is in progress can have unpredictable results, the worst outcome is a broken agent installation. This has been observed in the agent integration tests when upgrades occur too frequently.

The safest thing to do is to forbid another upgrade attempt until 10 minutes have elapsed from the previous upgrade attempt.

I don't think we can release #135539 without this change, but the bug exists even without the force upgrade bug. The force upgrade UI just makes it more likely a user will accidentally instruct an agent to upgrade while the watcher process is still running.

Once the detailed upgrade state reporting is implemented the cool down can be removed in favour of waiting for the agent to report it has has finished the upgrade monitoring state as described in #168171.

@cmacknz cmacknz added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 6, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 6, 2023

Does this mean in practice that Fleet shouldn't allow another upgrade for 10 mins after the upgraded_at field value?

@cmacknz
Copy link
Member Author

cmacknz commented Oct 6, 2023

Yes since that is the point at which the new agent version has checked in with Fleet for the first time (if I am recalling how this works correctly).

Here's the bug on the agent side for context: elastic/elastic-agent#2706

CC @jlind23 we should get this into 8.11

@juliaElastic
Copy link
Contributor

Do we want to enforce the 10m period even for a force upgrade?

@cmacknz
Copy link
Member Author

cmacknz commented Oct 6, 2023

Yes, it is not safe to attempt another upgrade until 10 minutes have elapsed from the previous upgrade regardless of the source of the upgrade request.

@jlind23
Copy link
Contributor

jlind23 commented Oct 6, 2023

Just added it to next sprint as a P0 cc @kpollich

@jillguyonnet jillguyonnet self-assigned this Oct 9, 2023
kpollich added a commit that referenced this issue Oct 18, 2023
## Summary

Closes #168233

This PR adds a check based on the `agent.upgraded_at` field and the time
a request to upgrade the issue. If the request is issued sooner than 10
minutes after the last upgrade, it is rejected, even if `force: true` is
passed:
- `POST agents/{agentId}/upgrade` will fail with 400
- agents included in `POST agents/bulk_upgrade` will not be upgraded

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Oct 18, 2023
## Summary

Closes elastic#168233

This PR adds a check based on the `agent.upgraded_at` field and the time
a request to upgrade the issue. If the request is issued sooner than 10
minutes after the last upgrade, it is rejected, even if `force: true` is
passed:
- `POST agents/{agentId}/upgrade` will fail with 400
- agents included in `POST agents/bulk_upgrade` will not be upgraded

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co>
(cherry picked from commit 4fffedd)
kibanamachine referenced this issue Oct 18, 2023
…169295)

# Backport

This will backport the following commits from `main` to `8.11`:
- [[Fleet] Enforce 10 min cooldown for agent upgrade
(#168606)](#168606)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Jill
Guyonnet","email":"jill.guyonnet@elastic.co"},"sourceCommit":{"committedDate":"2023-10-18T18:34:33Z","message":"[Fleet]
Enforce 10 min cooldown for agent upgrade (#168606)\n\n##
Summary\r\n\r\nCloses
https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a
check based on the `agent.upgraded_at` field and the time\r\na request
to upgrade the issue. If the request is issued sooner than 10\r\nminutes
after the last upgrade, it is rejected, even if `force: true`
is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with
400\r\n- agents included in `POST agents/bulk_upgrade` will not be
upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Kyle Pollich
<kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","backport:prev-minor","v8.12.0"],"number":168606,"url":"https://github.com/elastic/kibana/pull/168606","mergeCommit":{"message":"[Fleet]
Enforce 10 min cooldown for agent upgrade (#168606)\n\n##
Summary\r\n\r\nCloses
https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a
check based on the `agent.upgraded_at` field and the time\r\na request
to upgrade the issue. If the request is issued sooner than 10\r\nminutes
after the last upgrade, it is rejected, even if `force: true`
is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with
400\r\n- agents included in `POST agents/bulk_upgrade` will not be
upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Kyle Pollich
<kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168606","number":168606,"mergeCommit":{"message":"[Fleet]
Enforce 10 min cooldown for agent upgrade (#168606)\n\n##
Summary\r\n\r\nCloses
https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a
check based on the `agent.upgraded_at` field and the time\r\na request
to upgrade the issue. If the request is issued sooner than 10\r\nminutes
after the last upgrade, it is rejected, even if `force: true`
is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with
400\r\n- agents included in `POST agents/bulk_upgrade` will not be
upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Kyle Pollich
<kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5"}}]}]
BACKPORT-->

Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants