Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Enforce 10 min cooldown for agent upgrade #168606

Merged
merged 15 commits into from
Oct 18, 2023

Conversation

jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Oct 11, 2023

Summary

Closes #168233

This PR adds a check based on the agent.upgraded_at field and the time a request to upgrade the issue. If the request is issued sooner than 10 minutes after the last upgrade, it is rejected, even if force: true is passed:

  • POST agents/{agentId}/upgrade will fail with 400
  • agents included in POST agents/bulk_upgrade will not be upgraded

Checklist

@jillguyonnet jillguyonnet added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 11, 2023
@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@jillguyonnet jillguyonnet self-assigned this Oct 11, 2023
@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@jillguyonnet jillguyonnet added release_note:skip Skip the PR/issue when compiling release notes v8.12.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Oct 12, 2023
@jillguyonnet jillguyonnet marked this pull request as ready for review October 12, 2023 08:01
@jillguyonnet jillguyonnet requested a review from a team as a code owner October 12, 2023 08:01
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jillguyonnet
Copy link
Contributor Author

Opening this for review as CI failures seem unrelated.

Can someone confirm whether backport:prev-minor is what we want here?

@jlind23
Copy link
Contributor

jlind23 commented Oct 12, 2023

@jillguyonnet Yes we should backport it to 8.11 as it is needed for #135539
cc @cmacknz @kpollich

}

const elaspedSinceUpgradeInMillis = Date.now() - Date.parse(agent.upgraded_at);
const upgradeCooldownInMin = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you uppercase the name so it's clear that is a constant?

@@ -78,7 +78,8 @@ export async function upgradeBatch(
agentsToCheckUpgradeable.map(async (agent) => {
// Filter out agents currently unenrolling, unenrolled, or not upgradeable b/c of version check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also edit this comment to cover the new case?

Copy link
Contributor

@criamico criamico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

return response.customError({
statusCode: 400,
body: {
message: `agent ${request.params.agentId} was upgraded less than 10 minutes ago`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message should probably interpolate the UPGRADE_COOLDOWN_IN_MIN constant so we don't forget to update it if that constant ever changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for reviewing 👍

@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

return response.customError({
statusCode: 400,
body: {
message: `agent ${request.params.agentId} was upgraded less than ${AGENT_UPGRADE_COOLDOWN_IN_MIN} minutes ago`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message: `agent ${request.params.agentId} was upgraded less than ${AGENT_UPGRADE_COOLDOWN_IN_MIN} minutes ago`,
message: `agent ${request.params.agentId} was upgraded less than ${AGENT_UPGRADE_COOLDOWN_IN_MIN} minutes ago, confirming the upgrade will not roll back`,

I think we need to explain why we have this cooldown or it will just lead to questions. Possibly the explanation should link directly back to our documentation.

Would it also be possible to have the response contain how long the user needs to wait until they can try again? Otherwise they will just be spamming these requests until one of them goes through.

It could be more correct to return a 429 with a Retry-After header in the response: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

Copy link
Member

@kpollich kpollich Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some basic text for now to explain the cooldown.

image

I've added elastic/ingest-docs#605 to track the docs for this, and will create a follow-up issue once this PR is merged to capture the need to come back and add a docs link to this text.

I've also updated the API to return a 429 Too Many Requests and to include a time string in the format XXmYYs to indicate how long to wait for a retry as well as a retry-after header with the value in seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also give you a gold medal #168606 (review)

@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@kpollich
Copy link
Member

@elasticmachine merge upstream

@kpollich
Copy link
Member

Looks like there's some FTR failures related to the cooldown change, as we have tests that upgrade an agent quickly. Will take a look soon.

@kpollich
Copy link
Member

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
fleet 1.2MB 1.2MB +382.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
fleet 147.3KB 147.6KB +265.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @jillguyonnet

Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving latest changes that I made
image

@kpollich kpollich merged commit 4fffedd into elastic:main Oct 18, 2023
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 18, 2023
## Summary

Closes elastic#168233

This PR adds a check based on the `agent.upgraded_at` field and the time
a request to upgrade the issue. If the request is issued sooner than 10
minutes after the last upgrade, it is rejected, even if `force: true` is
passed:
- `POST agents/{agentId}/upgrade` will fail with 400
- agents included in `POST agents/bulk_upgrade` will not be upgraded

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co>
(cherry picked from commit 4fffedd)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.11

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 18, 2023
…169295)

# Backport

This will backport the following commits from `main` to `8.11`:
- [[Fleet] Enforce 10 min cooldown for agent upgrade
(#168606)](#168606)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Jill
Guyonnet","email":"jill.guyonnet@elastic.co"},"sourceCommit":{"committedDate":"2023-10-18T18:34:33Z","message":"[Fleet]
Enforce 10 min cooldown for agent upgrade (#168606)\n\n##
Summary\r\n\r\nCloses
https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a
check based on the `agent.upgraded_at` field and the time\r\na request
to upgrade the issue. If the request is issued sooner than 10\r\nminutes
after the last upgrade, it is rejected, even if `force: true`
is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with
400\r\n- agents included in `POST agents/bulk_upgrade` will not be
upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Kyle Pollich
<kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","backport:prev-minor","v8.12.0"],"number":168606,"url":"https://github.com/elastic/kibana/pull/168606","mergeCommit":{"message":"[Fleet]
Enforce 10 min cooldown for agent upgrade (#168606)\n\n##
Summary\r\n\r\nCloses
https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a
check based on the `agent.upgraded_at` field and the time\r\na request
to upgrade the issue. If the request is issued sooner than 10\r\nminutes
after the last upgrade, it is rejected, even if `force: true`
is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with
400\r\n- agents included in `POST agents/bulk_upgrade` will not be
upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Kyle Pollich
<kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168606","number":168606,"mergeCommit":{"message":"[Fleet]
Enforce 10 min cooldown for agent upgrade (#168606)\n\n##
Summary\r\n\r\nCloses
https://github.com/elastic/kibana/issues/168233\r\n\r\nThis PR adds a
check based on the `agent.upgraded_at` field and the time\r\na request
to upgrade the issue. If the request is issued sooner than 10\r\nminutes
after the last upgrade, it is rejected, even if `force: true`
is\r\npassed:\r\n- `POST agents/{agentId}/upgrade` will fail with
400\r\n- agents included in `POST agents/bulk_upgrade` will not be
upgraded\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Kyle Pollich
<kyle.pollich@elastic.co>","sha":"4fffedd4bb46922ef8355241318cf0db80e4c9f5"}}]}]
BACKPORT-->

Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
@jillguyonnet jillguyonnet deleted the fleet/agent-upgrade-cooldown branch October 23, 2023 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.11.0 v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Fleet] Enforce a 10 minute cool down for attempts to upgrade an agent
9 participants