-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent #168233
Comments
Pinging @elastic/fleet (Team:Fleet) |
Does this mean in practice that Fleet shouldn't allow another upgrade for 10 mins after the |
Yes since that is the point at which the new agent version has checked in with Fleet for the first time (if I am recalling how this works correctly). Here's the bug on the agent side for context: elastic/elastic-agent#2706 CC @jlind23 we should get this into 8.11 |
Do we want to enforce the 10m period even for a force upgrade? |
Yes, it is not safe to attempt another upgrade until 10 minutes have elapsed from the previous upgrade regardless of the source of the upgrade request. |
Just added it to next sprint as a P0 cc @kpollich |
## Summary Closes #168233 This PR adds a check based on the `agent.upgraded_at` field and the time a request to upgrade the issue. If the request is issued sooner than 10 minutes after the last upgrade, it is rejected, even if `force: true` is passed: - `POST agents/{agentId}/upgrade` will fail with 400 - agents included in `POST agents/bulk_upgrade` will not be upgraded ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co>
## Summary Closes elastic#168233 This PR adds a check based on the `agent.upgraded_at` field and the time a request to upgrade the issue. If the request is issued sooner than 10 minutes after the last upgrade, it is rejected, even if `force: true` is passed: - `POST agents/{agentId}/upgrade` will fail with 400 - agents included in `POST agents/bulk_upgrade` will not be upgraded ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co> (cherry picked from commit 4fffedd)
…169295) # Backport This will backport the following commits from `main` to `8.11`: - [[Fleet] Enforce 10 min cooldown for agent upgrade (#168606)](#168606) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Jill Guyonnet","email":"jill.guyonnet@elastic.co"},"sourceCommit":{"committedDate":"2023-10-18T18:34:33Z","message":"[Fleet] Enforce 10 min cooldown for agent upgrade (#168606)\n\n## Summary\r\n\r\nCloses https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a check based on the `agent.upgraded_at` field and the time\r\na request to upgrade the issue. If the request is issued sooner than 10\r\nminutes after the last upgrade, it is rejected, even if `force: true` is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with 400\r\n- agents included in `POST agents/bulk_upgrade` will not be upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Kyle Pollich <kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","backport:prev-minor","v8.12.0"],"number":168606,"url":"https://github.com/elastic/kibana/pull/168606","mergeCommit":{"message":"[Fleet] Enforce 10 min cooldown for agent upgrade (#168606)\n\n## Summary\r\n\r\nCloses https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a check based on the `agent.upgraded_at` field and the time\r\na request to upgrade the issue. If the request is issued sooner than 10\r\nminutes after the last upgrade, it is rejected, even if `force: true` is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with 400\r\n- agents included in `POST agents/bulk_upgrade` will not be upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Kyle Pollich <kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168606","number":168606,"mergeCommit":{"message":"[Fleet] Enforce 10 min cooldown for agent upgrade (#168606)\n\n## Summary\r\n\r\nCloses https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a check based on the `agent.upgraded_at` field and the time\r\na request to upgrade the issue. If the request is issued sooner than 10\r\nminutes after the last upgrade, it is rejected, even if `force: true` is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with 400\r\n- agents included in `POST agents/bulk_upgrade` will not be upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Kyle Pollich <kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5"}}]}] BACKPORT--> Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
Summarizing my comment from: #168171 (comment)
As part of an upgrade the agent starts a process called the upgrade watcher that supervises the new version of the agent for 10 minutes after an upgrade. If another upgrade is attempted in this 10 minute window the upgrade watcher will interpret the restart that happens during this second upgrade as a crash and try to roll back the agent version. A roll back that occurs while another upgrade is in progress can have unpredictable results, the worst outcome is a broken agent installation. This has been observed in the agent integration tests when upgrades occur too frequently.
The safest thing to do is to forbid another upgrade attempt until 10 minutes have elapsed from the previous upgrade attempt.
I don't think we can release #135539 without this change, but the bug exists even without the force upgrade bug. The force upgrade UI just makes it more likely a user will accidentally instruct an agent to upgrade while the watcher process is still running.
Once the detailed upgrade state reporting is implemented the cool down can be removed in favour of waiting for the agent to report it has has finished the upgrade monitoring state as described in #168171.
The text was updated successfully, but these errors were encountered: