-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Upgrade details telemetry #173356
[Fleet] Upgrade details telemetry #173356
Conversation
Pinging @elastic/fleet (Team:Fleet) |
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
}, | ||
{ | ||
field: 'upgrade_details.metadata.error_msg.keyword', | ||
missing: '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ensures that documents are included where error_msg
is missing, and uses empty string instead: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-multi-terms-aggregation.html#_missing_value_3
@@ -135,6 +141,22 @@ export const getAgentData = async ( | |||
], | |||
}, | |||
}, | |||
upgrade_details: { | |||
multi_terms: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliaElastic this will return only the first 10 items, it is what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! probably not, I'll increase this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increased to 1000
Yes, I think so because this would tell us how many Agents of a certain version failed upgrading at a certain step/state because of a certain error, so it will give us a fairly specific idea of where to focus any efforts. But I will defer to @jlind23 @cmacknz @nimarezainia to help decide what's useful in terms of telemetry. |
I would say yes, this will help us decide whether or not an issue is worth investigating based on the number of impacted Elastic Agent. |
Thanks, added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM 🚀
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
💔 All backports failed
Manual backportTo create the backport manually run:
Questions ?Please refer to the Backport tool documentation |
Relates elastic#162448 Added upgrade details telemetry, publishing to `fleet-agents index` in telemetry cluster, each bucket as separate documents. Implemented by doing a `multi_terms` aggregation to group the same `target_version, state, error_msg` values together. Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator Note: since this task runs every hour, it will most likely capture the `UPG_FAILED` states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful. E.g. 2 docs like the below become one telemetry event ``` // .fleet-agents upgrade_details: { target_version: '8.12.0', state: 'UPG_FAILED', metadata: { error_msg: 'Download failed', }, }, // telemetry event { target_version: '8.12.0', state: 'UPG_FAILED', error_msg: 'Download failed', } ``` To verify: - start kibana 8.13-SNAPSHOT locally - set an invalid agent download source in Fleet Settings - enroll an agent version 8.12-SNAPSHOT - upgrade to 8.13-SNAPSHOT with the API ``` POST kbn:/api/fleet/agents/<agent_id>/upgrade { "version": "8.13.0-SNAPSHOT", "force": true } ``` - wait 15m so that the upgrade goes to failed state - wait up to 1h for the telemetry task to run (speed up locally by setting a shorter interval in FleetUsageSender in kibana) - verify in debug logs: ``` [2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}] ``` - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Summary
Relates #162448
Added upgrade details telemetry, publishing to
fleet-agents index
in telemetry cluster, each bucket as separate documents.Implemented by doing a
multi_terms
aggregation to group the sametarget_version, state, error_msg
values together.Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator
Note: since this task runs every hour, it will most likely capture the
UPG_FAILED
states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful.E.g. 2 docs like the below become one telemetry event
To verify:
Checklist