Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Upgrade details telemetry #173356

Merged
merged 7 commits into from
Dec 18, 2023

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Dec 14, 2023

Summary

Relates #162448

Added upgrade details telemetry, publishing to fleet-agents index in telemetry cluster, each bucket as separate documents.
Implemented by doing a multi_terms aggregation to group the same target_version, state, error_msg values together.
Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator

Note: since this task runs every hour, it will most likely capture the UPG_FAILED states, since the other (success) states are temporarily on the agent docs, and removed if the upgrade is successful.

E.g. 2 docs like the below become one telemetry event

// .fleet-agents
   upgrade_details: {
            target_version: '8.12.0',
            state: 'UPG_FAILED',
            metadata: {
              error_msg: 'Download failed',
            },
          },

// telemetry event
{
      target_version: '8.12.0',
      state: 'UPG_FAILED',
      error_msg: 'Download failed',
    }

To verify:

  • start kibana 8.13-SNAPSHOT locally
  • set an invalid agent download source in Fleet Settings
  • enroll an agent version 8.12-SNAPSHOT
  • upgrade to 8.13-SNAPSHOT with the API
POST kbn:/api/fleet/agents/<agent_id>/upgrade
  {
    "version": "8.13.0-SNAPSHOT",
    "force": true
  }
  • wait 15m so that the upgrade goes to failed state
  • wait up to 1h for the telemetry task to run (speed up locally by setting a shorter interval in FleetUsageSender in kibana)
  • verify in debug logs:
[2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}]

Checklist

@juliaElastic juliaElastic added release_note:skip Skip the PR/issue when compiling release notes v8.12.0 labels Dec 14, 2023
@juliaElastic juliaElastic self-assigned this Dec 14, 2023
@juliaElastic juliaElastic requested a review from a team as a code owner December 14, 2023 10:37
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Dec 14, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

},
{
field: 'upgrade_details.metadata.error_msg.keyword',
missing: '',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ensures that documents are included where error_msg is missing, and uses empty string instead: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-multi-terms-aggregation.html#_missing_value_3

@@ -135,6 +141,22 @@ export const getAgentData = async (
],
},
},
upgrade_details: {
multi_terms: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic this will return only the first 10 items, it is what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! probably not, I'll increase this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased to 1000

@ycombinator
Copy link
Contributor

ycombinator commented Dec 14, 2023

Implemented by doing a multi_terms aggregation to group the same target_version, state, error_msg values together.
Do we also want to include the agent count in each bucket in the telemetry event? @jlind23 @ycombinator

Yes, I think so because this would tell us how many Agents of a certain version failed upgrading at a certain step/state because of a certain error, so it will give us a fairly specific idea of where to focus any efforts. But I will defer to @jlind23 @cmacknz @nimarezainia to help decide what's useful in terms of telemetry.

@jlind23
Copy link
Contributor

jlind23 commented Dec 14, 2023

Do we also want to include the agent count in each bucket in the telemetry event?

I would say yes, this will help us decide whether or not an issue is worth investigating based on the number of impacted Elastic Agent.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Dec 14, 2023

Do we also want to include the agent count in each bucket in the telemetry event?

I would say yes, this will help us decide whether or not an issue is worth investigating based on the number of impacted Elastic Agent.

Thanks, added a agent_count field to capture how many agents have the same upgrade details.

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM 🚀

@juliaElastic juliaElastic enabled auto-merge (squash) December 18, 2023 09:57
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic merged commit a61b864 into elastic:main Dec 18, 2023
39 checks passed
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.12 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 173356

Questions ?

Please refer to the Backport tool documentation

juliaElastic added a commit to juliaElastic/kibana that referenced this pull request Dec 18, 2023
Relates elastic#162448

Added upgrade details telemetry, publishing to `fleet-agents index` in
telemetry cluster, each bucket as separate documents.
Implemented by doing a `multi_terms` aggregation to group the same
`target_version, state, error_msg` values together.
Do we also want to include the agent count in each bucket in the
telemetry event? @jlind23 @ycombinator

Note: since this task runs every hour, it will most likely capture the
`UPG_FAILED` states, since the other (success) states are temporarily on
the agent docs, and removed if the upgrade is successful.

E.g. 2 docs like the below become one telemetry event
```
// .fleet-agents
   upgrade_details: {
            target_version: '8.12.0',
            state: 'UPG_FAILED',
            metadata: {
              error_msg: 'Download failed',
            },
          },

// telemetry event
{
      target_version: '8.12.0',
      state: 'UPG_FAILED',
      error_msg: 'Download failed',
    }
```

To verify:
- start kibana 8.13-SNAPSHOT locally
- set an invalid agent download source in Fleet Settings
- enroll an agent version 8.12-SNAPSHOT
- upgrade to 8.13-SNAPSHOT with the API
```
POST kbn:/api/fleet/agents/<agent_id>/upgrade
  {
    "version": "8.13.0-SNAPSHOT",
    "force": true
  }
```
- wait 15m so that the upgrade goes to failed state
- wait up to 1h for the telemetry task to run (speed up locally by
setting a shorter interval in FleetUsageSender in kibana)
- verify in debug logs:
```
[2023-12-14T14:26:28.832+01:00][DEBUG][plugins.fleet] Agents upgrade details telemetry: [{"target_version":"8.13.0-SNAPSHOT","state":"UPG_FAILED","error_msg":"failed download of agent binary: unable to download package: 3 errors occurred:\n\t* package '/Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' not found: open /Library/Elastic/Agent/data/elastic-agent-f383c6/downloads/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\t* call to 'https://artifacts.elastic.co/downloads/dummy/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-darwin-aarch64.tar.gz' returned unsuccessful status code: 404\n\n"}]
```

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
juliaElastic added a commit that referenced this pull request Dec 18, 2023
Backport #173356

I got a merge conflict on backporting to 8.12, because the presets
telemetry was not backportet, is that intentional?
#172838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.12.0 v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants