Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet][Telemetry] Report upgrade details telemetry #162448

Closed
ycombinator opened this issue Jul 25, 2023 · 17 comments
Closed

[Fleet][Telemetry] Report upgrade details telemetry #162448

ycombinator opened this issue Jul 25, 2023 · 17 comments
Assignees
Labels
enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature

Comments

@ycombinator
Copy link
Contributor

Describe the feature:

Future versions of Elastic Agent will communicate details about an ongoing upgrade to Fleet Server. Fleet Server will persist these details, as an when it receives them from Elastic Agent, into that Agent's document in the .fleet-agents index under a new upgrade_details object field. (Related: elastic/elasticsearch#97912 and https://github.com/elastic/ingest-dev/issues/2213).

We wish to report telemetry on certain information captured in this upgrade_details field (see the Details section below for the structure of this field), specifically on:

  • The target version that Agent was being upgraded to (upgrade_details.target_version).
  • The last upgrade state of the Agent (upgrade_details.state).
  • Any error that occurred during the upgrade process (upgrade_details.metadata.error_msg).

Describe a specific use case for the feature:

Understanding exactly where the Agent upgrade process gets stuck or fails and why.

Details

The proposed structure of the upgrade details field is:

{
  "upgrade_details": { // new field; present when upgrade is in progress
    "target_version": "8.12.0", // version being upgraded to; always present
    "action_id": "xxxxxxxx", // ID of the UPGRADE action
    "state": "UPG_*",
    "metadata": {
      "scheduled_at": "2023-08-09T10:11:12Z", // when state == "UPG_SCHEDULED"
      "download_percent": 16.4, // when state == "UPG_DOWNLOADING"
      "failed_state": "UPG_*" // when state == "UPG_FAILED"
      "error_msg": "" // when state == "UPG_FAILED"
    }
  }
}

Where upgrade_details.state is expected to hold one of the following values:

State Meaning
UPG_REQUESTED Upgrade requested by user
UPG_SCHEDULED Upgrade scheduled for <date/time>
UPG_DOWNLOADING Downloading new Agent artifact version
UPG_EXTRACTING Extracting new Agent artifact version
UPG_REPLACING Replacing old Agent artifact version with new one version
UPG_RESTARTING Starting new Agent version
UPG_WATCHING Monitoring new Agent version
UPG_ROLLBACK Upgrade unsuccessful; rolling back to Agent version
UPG_FAILED Upgrade failed due to error from state

Here is a fairly comprehensive list of errors (contents of the upgrade_details.metadata.error_msg field) that could be reported during an ongoing upgrade:

  • Agent is not upgradeable
  • Upgrade is already in progress
  • Unable to read downloads directory
  • Unable to clean downloads directory before update
  • Failed to initiate fetcher
  • Failed to create downloads directory
  • Download failed
  • Failed to initiate verifier
  • Failed to verify agent binary
  • Failed to unpack archive
  • Empty agent hash
  • Failed to copy action store
  • Failed to copy run directory
  • Failed to change symlink
  • Failed to start watcher
  • Unable to clean downloads directory after update
  • Failed to load marker
  • Failed to acquire lock
  • Agent error detected
  • Agent crash detected
  • Rollback failed
  • Failed to create installation marker
@ycombinator ycombinator added enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature labels Jul 25, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jlind23
Copy link
Contributor

jlind23 commented Jul 25, 2023

@ycombinator how is this one different from: https://github.com/elastic/ingest-dev/issues/1937 ?
cc @juliaElastic

@juliaElastic
Copy link
Contributor

@jlind23 this is the telemetry piece of the upgrade improvements

@joshdover
Copy link
Contributor

We should backport this to 8.11 if feasible

@cmacknz
Copy link
Member

cmacknz commented Nov 22, 2023

We should backport this to 8.11 if feasible

The upgrade details are only available in 8.12, so backporting the 8.11 won't do anything :)

@cmacknz
Copy link
Member

cmacknz commented Dec 12, 2023

When this is completed I assume we are backporting to 8.12 since we were previously going to backport to 8.11?

@jlind23
Copy link
Contributor

jlind23 commented Dec 12, 2023

When this is completed I assume we are backporting to 8.12 since we were previously going to backport to 8.11?

That is correct.

@jlind23
Copy link
Contributor

jlind23 commented Dec 13, 2023

@juliaElastic Could you please let me know when you start working on this? I'll be happy to start building dashboards based on that.

@juliaElastic
Copy link
Contributor

@jlind23 I'm getting started today.

@juliaElastic
Copy link
Contributor

@ycombinator when the error_msg is not there, is it set to an empty string or completely missing? If missing, I have to adjust the es aggregation.

@ycombinator
Copy link
Contributor Author

@ycombinator when the error_msg is not there, is it set to an empty string or completely missing? If missing, I have to adjust the es aggregation.

Completely missing.

BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI.

@juliaElastic
Copy link
Contributor

Completely missing.

Thanks. Is there a way to simulate an upgrade failure with upgrade_details now? I'm not sure if it's possible since the upgrade_details are available from agent 8.12.

BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI.

Yes please.

@ycombinator
Copy link
Contributor Author

Thanks. Is there a way to simulate an upgrade failure with upgrade_details now? I'm not sure if it's possible since the upgrade_details are available from agent 8.12.

Yes, it should be possible. Essentially you'll need to run an 8.13.0-SNAPSHOT build of Kibana, then enroll a 8.12.0-SNAPSHOT build of Agent (so it has the Upgrade Details logic in it), then set the downloads URL in the Fleet settings to something bogus, then attempt an upgrade of that 8.12.0 agent to 8.13.0-SNAPSHOT.

BTW, note that we recently added two more optional fields to the upgrade details metadata object. Let me know if you want a separate Kibana issue to surface these in the UI.

Yes please.

Done: #173370

@juliaElastic
Copy link
Contributor

Thanks, I am seeing upgrade_details like this. How long does it take to go to failed state? It seems to retry for a long time (started it 13m ago).


elastic_agent
[elastic_agent][warn] download attempt 3753 failed: unable to download package: 3 errors occurred:
	* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory
	* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404
	* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404

; retrying in 0s.
13:56:51.892
elastic_agent
[elastic_agent][info] updated upgrade details
13:56:51.892
elastic_agent
[elastic_agent][info] download attempt 3754

@ycombinator
Copy link
Contributor Author

ycombinator commented Dec 14, 2023

In case of download failures, since download is a retryable step, upgrade details won't go into UPG_FAILED state. It will remain in UPG_DOWNLOADING step but instead metadata.retry_error and metadata.retry_until will get set.

Note that Upgrade Details are communicated to Fleet via the check-in API. The way this works is Agent will send a check-in API request to Fleet Server, then wait for Fleet Server's response, which could take up to 5 minutes. After receiving that response, Agent waits for 1 second and then sends the next check-in API request, and so on. So, in the worst case, new upgrade details could take 5 minutes + 1 second to be sent from Agent to Fleet.

BTW, it's VERY strange that you are seeing a download attempt number that is so high — 3753! The retrying in 0s is also doesn't look good. It seems we are retrying very aggressively, with no backoff at all! Thanks for catching this; I've filed elastic/elastic-agent#3915.

juliaElastic added a commit that referenced this issue Dec 18, 2023
## Summary

Relates #162448

Added upgrade details telemetry, publishing to `fleet-agents index` in
telemetry cluster, each bucket as separate documents.
Implemented by doing a `multi_terms` aggregation to group the same
`target_version, state, error_msg` values together.
Do we also want to include the agent count in each bucket in the
telemetry event? @jlind23 @ycombinator

Note: since this task runs every hour, it will most likely capture the
`UPG_FAILED` states, since the other (success) states are temporarily on
the agent docs, and removed if the upgrade is successful.

E.g. 2 docs like the below become one telemetry event
```
// .fleet-agents
   upgrade_details: {
            target_version: '8.12.0',
            state: 'UPG_FAILED',
            metadata: {
              error_msg: 'Download failed',
            },
          },

// telemetry event
{
      target_version: '8.12.0',
      state: 'UPG_FAILED',
      error_msg: 'Download failed',
    }
```

To verify:
- start kibana 8.13-SNAPSHOT locally
- set an invalid agent download source in Fleet Settings
- enroll an agent version 8.12-SNAPSHOT
- upgrade to 8.13-SNAPSHOT with the API
```
POST kbn:/api/fleet/agents/<agent_id>/upgrade
  {
    "version": "8.13.0-SNAPSHOT",
    "force": true
  }
```
- wait 15m so that the upgrade goes to failed state
- wait up to 1h for the telemetry task to run (speed up locally by
setting a shorter interval in FleetUsageSender in kibana)
- verify in debug logs:
```
[2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}]
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@juliaElastic
Copy link
Contributor

I got a merge conflict on backporting to 8.12, because the presets telemetry was not backportet, is that intentional? #172838

juliaElastic added a commit to juliaElastic/kibana that referenced this issue Dec 18, 2023
Relates elastic#162448

Added upgrade details telemetry, publishing to `fleet-agents index` in
telemetry cluster, each bucket as separate documents.
Implemented by doing a `multi_terms` aggregation to group the same
`target_version, state, error_msg` values together.
Do we also want to include the agent count in each bucket in the
telemetry event? @jlind23 @ycombinator

Note: since this task runs every hour, it will most likely capture the
`UPG_FAILED` states, since the other (success) states are temporarily on
the agent docs, and removed if the upgrade is successful.

E.g. 2 docs like the below become one telemetry event
```
// .fleet-agents
   upgrade_details: {
            target_version: '8.12.0',
            state: 'UPG_FAILED',
            metadata: {
              error_msg: 'Download failed',
            },
          },

// telemetry event
{
      target_version: '8.12.0',
      state: 'UPG_FAILED',
      error_msg: 'Download failed',
    }
```

To verify:
- start kibana 8.13-SNAPSHOT locally
- set an invalid agent download source in Fleet Settings
- enroll an agent version 8.12-SNAPSHOT
- upgrade to 8.13-SNAPSHOT with the API
```
POST kbn:/api/fleet/agents/<agent_id>/upgrade
  {
    "version": "8.13.0-SNAPSHOT",
    "force": true
  }
```
- wait 15m so that the upgrade goes to failed state
- wait up to 1h for the telemetry task to run (speed up locally by
setting a shorter interval in FleetUsageSender in kibana)
- verify in debug logs:
```
[2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}]
```

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@jlind23
Copy link
Contributor

jlind23 commented Dec 18, 2023

@juliaElastic I really don't think it was intentional as the presets are included in 8.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature
Projects
None yet
Development

No branches or pull requests

6 participants