[Fleet][Telemetry] Report upgrade details telemetry #162448

ycombinator · 2023-07-25T00:41:01Z

Describe the feature:

Future versions of Elastic Agent will communicate details about an ongoing upgrade to Fleet Server. Fleet Server will persist these details, as an when it receives them from Elastic Agent, into that Agent's document in the .fleet-agents index under a new upgrade_details object field. (Related: elastic/elasticsearch#97912 and https://github.com/elastic/ingest-dev/issues/2213).

We wish to report telemetry on certain information captured in this upgrade_details field (see the Details section below for the structure of this field), specifically on:

The target version that Agent was being upgraded to (upgrade_details.target_version).
The last upgrade state of the Agent (upgrade_details.state).
Any error that occurred during the upgrade process (upgrade_details.metadata.error_msg).

Describe a specific use case for the feature:

Understanding exactly where the Agent upgrade process gets stuck or fails and why.

Details

The proposed structure of the upgrade details field is:

{
  "upgrade_details": { // new field; present when upgrade is in progress
    "target_version": "8.12.0", // version being upgraded to; always present
    "action_id": "xxxxxxxx", // ID of the UPGRADE action
    "state": "UPG_*",
    "metadata": {
      "scheduled_at": "2023-08-09T10:11:12Z", // when state == "UPG_SCHEDULED"
      "download_percent": 16.4, // when state == "UPG_DOWNLOADING"
      "failed_state": "UPG_*" // when state == "UPG_FAILED"
      "error_msg": "" // when state == "UPG_FAILED"
    }
  }
}

Where upgrade_details.state is expected to hold one of the following values:

State	Meaning
`UPG_REQUESTED`	Upgrade requested by user
`UPG_SCHEDULED`	Upgrade scheduled for <date/time>
`UPG_DOWNLOADING`	Downloading new Agent artifact version
`UPG_EXTRACTING`	Extracting new Agent artifact version
`UPG_REPLACING`	Replacing old Agent artifact version with new one version
`UPG_RESTARTING`	Starting new Agent version
`UPG_WATCHING`	Monitoring new Agent version
`UPG_ROLLBACK`	Upgrade unsuccessful; rolling back to Agent version
`UPG_FAILED`	Upgrade failed due to error from state

Here is a fairly comprehensive list of errors (contents of the upgrade_details.metadata.error_msg field) that could be reported during an ongoing upgrade:

Agent is not upgradeable
Upgrade is already in progress
Unable to read downloads directory
Unable to clean downloads directory before update
Failed to initiate fetcher
Failed to create downloads directory
Download failed
Failed to initiate verifier
Failed to verify agent binary
Failed to unpack archive
Empty agent hash
Failed to copy action store
Failed to copy run directory
Failed to change symlink
Failed to start watcher
Unable to clean downloads directory after update
Failed to load marker
Failed to acquire lock
Agent error detected
Agent crash detected
Rollback failed
Failed to create installation marker

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-07-25T00:41:03Z

Pinging @elastic/fleet (Team:Fleet)

jlind23 · 2023-07-25T05:59:38Z

@ycombinator how is this one different from: https://github.com/elastic/ingest-dev/issues/1937 ?
cc @juliaElastic

juliaElastic · 2023-07-25T07:45:47Z

@jlind23 this is the telemetry piece of the upgrade improvements

joshdover · 2023-11-02T15:26:42Z

We should backport this to 8.11 if feasible

cmacknz · 2023-11-22T22:21:40Z

We should backport this to 8.11 if feasible

The upgrade details are only available in 8.12, so backporting the 8.11 won't do anything :)

cmacknz · 2023-12-12T14:41:24Z

When this is completed I assume we are backporting to 8.12 since we were previously going to backport to 8.11?

jlind23 · 2023-12-12T14:42:24Z

When this is completed I assume we are backporting to 8.12 since we were previously going to backport to 8.11?

That is correct.

jlind23 · 2023-12-13T14:15:24Z

@juliaElastic Could you please let me know when you start working on this? I'll be happy to start building dashboards based on that.

juliaElastic · 2023-12-13T14:24:52Z

@jlind23 I'm getting started today.

juliaElastic · 2023-12-14T11:26:43Z

@ycombinator when the error_msg is not there, is it set to an empty string or completely missing? If missing, I have to adjust the es aggregation.

ycombinator · 2023-12-14T11:31:43Z

@ycombinator when the error_msg is not there, is it set to an empty string or completely missing? If missing, I have to adjust the es aggregation.

Completely missing.

BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI.

juliaElastic · 2023-12-14T11:43:08Z

Completely missing.

Thanks. Is there a way to simulate an upgrade failure with upgrade_details now? I'm not sure if it's possible since the upgrade_details are available from agent 8.12.

BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI.

Yes please.

ycombinator · 2023-12-14T11:58:00Z

Thanks. Is there a way to simulate an upgrade failure with upgrade_details now? I'm not sure if it's possible since the upgrade_details are available from agent 8.12.

Yes, it should be possible. Essentially you'll need to run an 8.13.0-SNAPSHOT build of Kibana, then enroll a 8.12.0-SNAPSHOT build of Agent (so it has the Upgrade Details logic in it), then set the downloads URL in the Fleet settings to something bogus, then attempt an upgrade of that 8.12.0 agent to 8.13.0-SNAPSHOT.

BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI.

Yes please.

Done: #173370

juliaElastic · 2023-12-14T12:57:54Z

Thanks, I am seeing upgrade_details like this. How long does it take to go to failed state? It seems to retry for a long time (started it 13m ago).


elastic_agent
[elastic_agent][warn] download attempt 3753 failed: unable to download package: 3 errors occurred:
	* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory
	* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404
	* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404

; retrying in 0s.
13:56:51.892
elastic_agent
[elastic_agent][info] updated upgrade details
13:56:51.892
elastic_agent
[elastic_agent][info] download attempt 3754

ycombinator · 2023-12-14T13:13:01Z

In case of download failures, since download is a retryable step, upgrade details won't go into UPG_FAILED state. It will remain in UPG_DOWNLOADING step but instead metadata.retry_error and metadata.retry_until will get set.

Note that Upgrade Details are communicated to Fleet via the check-in API. The way this works is Agent will send a check-in API request to Fleet Server, then wait for Fleet Server's response, which could take up to 5 minutes. After receiving that response, Agent waits for 1 second and then sends the next check-in API request, and so on. So, in the worst case, new upgrade details could take 5 minutes + 1 second to be sent from Agent to Fleet.

BTW, it's VERY strange that you are seeing a download attempt number that is so high — 3753! The retrying in 0s is also doesn't look good. It seems we are retrying very aggressively, with no backoff at all! Thanks for catching this; I've filed elastic/elastic-agent#3915.

@jlind23

## Summary Relates #162448 Added upgrade details telemetry, publishing to `fleet-agents index` in telemetry cluster, each bucket as separate documents. Implemented by doing a `multi_terms` aggregation to group the same `target_version, state, error_msg` values together. Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator Note: since this task runs every hour, it will most likely capture the `UPG_FAILED` states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful. E.g. 2 docs like the below become one telemetry event ``` // .fleet-agents upgrade_details: { target_version: '8.12.0', state: 'UPG_FAILED', metadata: { error_msg: 'Download failed', }, }, // telemetry event { target_version: '8.12.0', state: 'UPG_FAILED', error_msg: 'Download failed', } ``` To verify: - start kibana 8.13-SNAPSHOT locally - set an invalid agent download source in Fleet Settings - enroll an agent version 8.12-SNAPSHOT - upgrade to 8.13-SNAPSHOT with the API ``` POST kbn:/api/fleet/agents/<agent_id>/upgrade { "version": "8.13.0-SNAPSHOT", "force": true } ``` - wait 15m so that the upgrade goes to failed state - wait up to 1h for the telemetry task to run (speed up locally by setting a shorter interval in FleetUsageSender in kibana) - verify in debug logs: ``` [2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}] ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

juliaElastic · 2023-12-18T10:42:17Z

I got a merge conflict on backporting to 8.12, because the presets telemetry was not backportet, is that intentional? #172838

@jlind23

Relates elastic#162448 Added upgrade details telemetry, publishing to `fleet-agents index` in telemetry cluster, each bucket as separate documents. Implemented by doing a `multi_terms` aggregation to group the same `target_version, state, error_msg` values together. Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator Note: since this task runs every hour, it will most likely capture the `UPG_FAILED` states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful. E.g. 2 docs like the below become one telemetry event ``` // .fleet-agents upgrade_details: { target_version: '8.12.0', state: 'UPG_FAILED', metadata: { error_msg: 'Download failed', }, }, // telemetry event { target_version: '8.12.0', state: 'UPG_FAILED', error_msg: 'Download failed', } ``` To verify: - start kibana 8.13-SNAPSHOT locally - set an invalid agent download source in Fleet Settings - enroll an agent version 8.12-SNAPSHOT - upgrade to 8.13-SNAPSHOT with the API ``` POST kbn:/api/fleet/agents/<agent_id>/upgrade { "version": "8.13.0-SNAPSHOT", "force": true } ``` - wait 15m so that the upgrade goes to failed state - wait up to 1h for the telemetry task to run (speed up locally by setting a shorter interval in FleetUsageSender in kibana) - verify in debug logs: ``` [2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}] ``` - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

jlind23 · 2023-12-18T17:33:15Z

@juliaElastic I really don't think it was intentional as the presets are included in 8.12

ycombinator added enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature labels Jul 25, 2023

kpollich assigned juliaElastic Nov 21, 2023

juliaElastic mentioned this issue Dec 14, 2023

[Fleet] Upgrade details telemetry #173356

Merged

1 task

juliaElastic closed this as completed Dec 18, 2023

jlind23 mentioned this issue Jan 8, 2024

[Fleet][Telemetry] Report upgrade details telemetry Step 2 #174436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet][Telemetry] Report upgrade details telemetry #162448

[Fleet][Telemetry] Report upgrade details telemetry #162448

ycombinator commented Jul 25, 2023

elasticmachine commented Jul 25, 2023

jlind23 commented Jul 25, 2023

juliaElastic commented Jul 25, 2023

joshdover commented Nov 2, 2023

cmacknz commented Nov 22, 2023

cmacknz commented Dec 12, 2023

jlind23 commented Dec 12, 2023

jlind23 commented Dec 13, 2023

juliaElastic commented Dec 13, 2023

juliaElastic commented Dec 14, 2023

ycombinator commented Dec 14, 2023

juliaElastic commented Dec 14, 2023

ycombinator commented Dec 14, 2023

juliaElastic commented Dec 14, 2023

ycombinator commented Dec 14, 2023 •

edited

Loading

juliaElastic commented Dec 18, 2023

jlind23 commented Dec 18, 2023

[Fleet][Telemetry] Report upgrade details telemetry #162448

[Fleet][Telemetry] Report upgrade details telemetry #162448

Comments

ycombinator commented Jul 25, 2023

elasticmachine commented Jul 25, 2023

jlind23 commented Jul 25, 2023

juliaElastic commented Jul 25, 2023

joshdover commented Nov 2, 2023

cmacknz commented Nov 22, 2023

cmacknz commented Dec 12, 2023

jlind23 commented Dec 12, 2023

jlind23 commented Dec 13, 2023

juliaElastic commented Dec 13, 2023

juliaElastic commented Dec 14, 2023

ycombinator commented Dec 14, 2023

juliaElastic commented Dec 14, 2023

ycombinator commented Dec 14, 2023

juliaElastic commented Dec 14, 2023

ycombinator commented Dec 14, 2023 • edited Loading

juliaElastic commented Dec 18, 2023

jlind23 commented Dec 18, 2023

ycombinator commented Dec 14, 2023 •

edited

Loading