-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet][Telemetry] Report upgrade details telemetry #162448
Comments
Pinging @elastic/fleet (Team:Fleet) |
@ycombinator how is this one different from: https://github.com/elastic/ingest-dev/issues/1937 ? |
@jlind23 this is the telemetry piece of the upgrade improvements |
We should backport this to 8.11 if feasible |
The upgrade details are only available in 8.12, so backporting the 8.11 won't do anything :) |
When this is completed I assume we are backporting to 8.12 since we were previously going to backport to 8.11? |
That is correct. |
@juliaElastic Could you please let me know when you start working on this? I'll be happy to start building dashboards based on that. |
@jlind23 I'm getting started today. |
@ycombinator when the |
BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI. |
Thanks. Is there a way to simulate an upgrade failure with upgrade_details now? I'm not sure if it's possible since the upgrade_details are available from agent 8.12.
Yes please. |
Yes, it should be possible. Essentially you'll need to run an
Done: #173370 |
Thanks, I am seeing upgrade_details like this. How long does it take to go to failed state? It seems to retry for a long time (started it 13m ago).
|
In case of download failures, since download is a retryable step, upgrade details won't go into Note that Upgrade Details are communicated to Fleet via the check-in API. The way this works is Agent will send a check-in API request to Fleet Server, then wait for Fleet Server's response, which could take up to 5 minutes. After receiving that response, Agent waits for 1 second and then sends the next check-in API request, and so on. So, in the worst case, new upgrade details could take 5 minutes + 1 second to be sent from Agent to Fleet. BTW, it's VERY strange that you are seeing a download attempt number that is so high — 3753! The |
## Summary Relates #162448 Added upgrade details telemetry, publishing to `fleet-agents index` in telemetry cluster, each bucket as separate documents. Implemented by doing a `multi_terms` aggregation to group the same `target_version, state, error_msg` values together. Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator Note: since this task runs every hour, it will most likely capture the `UPG_FAILED` states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful. E.g. 2 docs like the below become one telemetry event ``` // .fleet-agents upgrade_details: { target_version: '8.12.0', state: 'UPG_FAILED', metadata: { error_msg: 'Download failed', }, }, // telemetry event { target_version: '8.12.0', state: 'UPG_FAILED', error_msg: 'Download failed', } ``` To verify: - start kibana 8.13-SNAPSHOT locally - set an invalid agent download source in Fleet Settings - enroll an agent version 8.12-SNAPSHOT - upgrade to 8.13-SNAPSHOT with the API ``` POST kbn:/api/fleet/agents/<agent_id>/upgrade { "version": "8.13.0-SNAPSHOT", "force": true } ``` - wait 15m so that the upgrade goes to failed state - wait up to 1h for the telemetry task to run (speed up locally by setting a shorter interval in FleetUsageSender in kibana) - verify in debug logs: ``` [2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}] ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
I got a merge conflict on backporting to 8.12, because the presets telemetry was not backportet, is that intentional? #172838 |
Relates elastic#162448 Added upgrade details telemetry, publishing to `fleet-agents index` in telemetry cluster, each bucket as separate documents. Implemented by doing a `multi_terms` aggregation to group the same `target_version, state, error_msg` values together. Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator Note: since this task runs every hour, it will most likely capture the `UPG_FAILED` states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful. E.g. 2 docs like the below become one telemetry event ``` // .fleet-agents upgrade_details: { target_version: '8.12.0', state: 'UPG_FAILED', metadata: { error_msg: 'Download failed', }, }, // telemetry event { target_version: '8.12.0', state: 'UPG_FAILED', error_msg: 'Download failed', } ``` To verify: - start kibana 8.13-SNAPSHOT locally - set an invalid agent download source in Fleet Settings - enroll an agent version 8.12-SNAPSHOT - upgrade to 8.13-SNAPSHOT with the API ``` POST kbn:/api/fleet/agents/<agent_id>/upgrade { "version": "8.13.0-SNAPSHOT", "force": true } ``` - wait 15m so that the upgrade goes to failed state - wait up to 1h for the telemetry task to run (speed up locally by setting a shorter interval in FleetUsageSender in kibana) - verify in debug logs: ``` [2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}] ``` - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@juliaElastic I really don't think it was intentional as the presets are included in 8.12 |
Describe the feature:
Future versions of Elastic Agent will communicate details about an ongoing upgrade to Fleet Server. Fleet Server will persist these details, as an when it receives them from Elastic Agent, into that Agent's document in the
.fleet-agents
index under a newupgrade_details
object field. (Related: elastic/elasticsearch#97912 and https://github.com/elastic/ingest-dev/issues/2213).We wish to report telemetry on certain information captured in this
upgrade_details
field (see the Details section below for the structure of this field), specifically on:upgrade_details.target_version
).upgrade_details.state
).upgrade_details.metadata.error_msg
).Describe a specific use case for the feature:
Understanding exactly where the Agent upgrade process gets stuck or fails and why.
Details
The proposed structure of the upgrade details field is:
Where
upgrade_details.state
is expected to hold one of the following values:UPG_REQUESTED
UPG_SCHEDULED
UPG_DOWNLOADING
UPG_EXTRACTING
UPG_REPLACING
UPG_RESTARTING
UPG_WATCHING
UPG_ROLLBACK
UPG_FAILED
Here is a fairly comprehensive list of errors (contents of the
upgrade_details.metadata.error_msg
field) that could be reported during an ongoing upgrade:The text was updated successfully, but these errors were encountered: