[Fleet] Package policy upgrade telemetry with sender #115180

juliaElastic · 2021-10-15T12:43:09Z

Summary

Collecting telemetry of package policy upgrades: #109870

See latest commits for Event Sender implementation (in this case we don't need collector).
Took it from security_solution and simplified for our use case.

See sender.ts
To test:
Install older version of packages, e.g. apache-0.3.3 (no conflicts), aws-0.6.1 (conflicts).
Add policies for each and update them.
Aws policy can only be upgraded manually by resolving conflicts.

For event sender:
Add to kibana.dev.yml to opt in telemetry and see debug logs:

telemetry.optIn: true
logging.loggers:
 - name: plugins.fleet
   level: debug

In case of success/failure of upgrades, the debug log should include events sent to telemetry v3 endpoint.

[2021-10-18T16:49:12.391+02:00][DEBUG][plugins.fleet.telemetry_events] Telemetry URL: https://telemetry-staging.elastic.co/v3-dev/send/fleet-stats
[2021-10-18T16:49:12.391+02:00][DEBUG][plugins.fleet.telemetry_events] cluster_uuid: vvUkWqalQJara7GMDQnE3Q cluster_name: elasticsearch
[2021-10-18T16:49:12.392+02:00][DEBUG][plugins.fleet.telemetry_events] [{"package_policy_upgrade":{"package_name":"aws","current_version":"0.6.1","new_version":"1.3.0","status":"failure","error":[{"key":"inputs.cloudtrail-aws-s3.streams.aws.cloudtrail.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.cloudwatch-aws-s3.streams.aws.cloudwatch_logs.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.ec2-aws-s3.streams.aws.ec2_logs.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.elb-aws-s3.streams.aws.elb_logs.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.s3-aws-s3.streams.aws.s3access.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.vpcflow-aws-s3.streams.aws.vpcflow.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.waf-aws-s3.streams.aws.waf.vars.queue_url","message":["Queue URL is required"]}]},"id":"aws_0.6.1_1.3.0_failure","license":{"uid":"c0f3d843-30fd-44f2-9774-95e02db80d70","status":"active","type":"trial","issued_to":"elasticsearch","issuer":"elasticsearch"},"cluster_uuid":"vvUkWqalQJara7GMDQnE3Q","cluster_name":"elasticsearch"}]
[2021-10-18T16:49:12.801+02:00][DEBUG][plugins.fleet.telemetry_events] Events sent!. Response: 200 {"status":"ok"}

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

…ibana into upgrade-telemetry

elasticmachine · 2021-10-19T13:47:22Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2021-10-20T14:31:33Z

@elasticmachine merge upstream

joshdover

The biggest architectural decision I'd like to discuss before moving forward with this is how we want to organize our different events. With the current state of this PR, we'll be adding additional channels for each event type that we implement. Concerns I have with this:

It requires that we create a new indexer for each channel since I believe indexers can only read from a single channel
Each channel is stored separately in the GCP storage bucket. Would this make other types of processing that we want to do in the future difficult? For example, using tools like Apache Spark.

Alternatively, we could use a single channel for all Fleet telemetry events and use a well-known field name, such as event.type (to borrow from ECS), to describe the schema of the event's properties. This would allow us to use a single indexer for all of our events, which I assume would involve less boilerplate in creating all the necessary bits to stand up a new indexer. Questions I have with this alternative approach:

If we use a single channel and single indexer, can we still ingest the documents into separate indices depending on the event.type?
How practical would it be to use a single index with dynamic mappings instead? We could even define in our collection API which fields will be dynamically mapped and which will not. This would allow us to more rapidly add new event types without having to touch the indexer at all. For instance, we could index events in this type of format:

{
  "event.type": "package_policy_upgrade_result",
  "cluster_uuid": "...",
  "cluster_name": "...",
  "event.properties.mapped": {
     // We could have `dynamic: true` specified for these fields, with the option to explicitly map them in the indexer when the dynamic mappings are not good enough
     // All field names have the event.type prepended to avoid overlapping field names
     "package_policy_upgrade_result.package_name": "system",
     "package_policy_upgrade_result.current_version": "1.1.0",
     ...
  },
  "event.properties.raw": {
    // In the mappings these fields would not be mapped by default, but could be mapped later or used by runtime fields
    "package_policy_upgrade_result.error": "...",
  }
}

The JavaScript API for queuing such events would be something like:

sender.queueEvent('package_policy_upgrade_result', {
  mapped: {
    package_version: 'system',
    current_version: '1.1.0',
  },
  raw: {
    error: '...' // just an example, we may want to actually map this particular field
  }
}

joshdover · 2021-10-27T10:35:07Z

x-pack/plugins/fleet/server/routes/package_policy/handlers.ts

@@ -146,7 +146,8 @@ export const updatePackagePolicyHandler: RequestHandler<
      esClient,
      request.params.packagePolicyId,
      { ...newData, package: pkg, inputs },
-      { user }
+      { user },
+      packagePolicy.package?.version


Should we switch to named parameters / object on this API? The list is quite long and it's not obvious what each arg is for.

I'm a bit cautious of changing this, the update method is being called from a few plugins outside of fleet like apm and security_solution.

joshdover · 2021-10-27T10:36:33Z

x-pack/plugins/fleet/server/services/package_policy.ts

-          `Package policy successfully upgraded ${JSON.stringify(upgradeMeta)}`,
-          upgradeMeta
+      if (packagePolicy.package.version !== currentVersion) {
+        appContextService.getLogger().info(`Package policy upgraded successfully`);


nit: let's go ahead and keep the detailed log info as well, but maybe it should be at the debug level instead?

joshdover · 2021-10-27T10:41:10Z

x-pack/plugins/fleet/server/plugin.ts

@@ -184,6 +193,8 @@ export class FleetPlugin
    this.kibanaBranch = this.initializerContext.env.packageInfo.branch;
    this.logger = this.initializerContext.logger.get();
    this.configInitialValue = this.initializerContext.config.get();
+    this.telemetryEventsSender = new TelemetryEventsSender(this.logger);


I think we should use a child logger here so the logs are clearly labeled as coming from our telemetry code. Right now this is done inside the TelemetryEventsSender class, but I think it's more clear and explicit if we do it from this level.

Suggested change

this.telemetryEventsSender = new TelemetryEventsSender(this.logger);

this.telemetryEventsSender = new TelemetryEventsSender(this.logger.get('telemetry_events'));

joshdover · 2021-10-27T10:43:35Z

x-pack/plugins/fleet/server/telemetry/sender.ts

+      return;
+    }
+
+    const clusterInfo = await this.receiver?.fetchClusterInfo();


I'm not sure I quite understand the purpose of the receiver class or it's naming (what is it receiving?). Should we just move this fetchClusterInfo function in this class?

security had a lot more functionality in receiver, I can simplify here since we have only fetchClusterInfo

joshdover · 2021-10-27T10:48:36Z

x-pack/plugins/fleet/server/telemetry/sender.ts

+    }
+  }
+
+  public queueTelemetryEvents(events: TelemetryEvent[], channel: string) {


I think it would be a good idea to improve the type safety here to ensure that events passed in conform to an expected type for the channel. I also slightly prefer having the channel name as the first argument to better match other popular telemetry APIs. Maybe something like this:

interface FleetTelemetryChannelEvents { // channel name => event type 'fleet-upgrades': PackagePolicyUpgradeUsage; } type FleetTelemetryChannel = keyof FleetTelemetryChannelEvents; public queueTelemetryEvents<T extends FleetTelemetryChannel>( channel: T, events: Array<FleetTelemetryChannelEvents[T]> ) { ... }

x-pack/plugins/fleet/server/telemetry/sender.ts

joshdover · 2021-10-27T10:56:04Z

x-pack/plugins/fleet/server/telemetry/sender.ts

+  private isOptedIn?: boolean = true; // Assume true until the first check
+
+  constructor(logger: Logger) {
+    this.logger = logger.get('telemetry_events');


See my comment in server/plugin.ts. I think we should create the child logger from the calling code rather than do this here.

joshdover · 2021-10-27T11:02:37Z

x-pack/plugins/fleet/server/telemetry/queue.ts

+    // do not add events with same id
+    events = events.filter(
+      (event) => !this.queue.find((qItem) => qItem.id && event.id && qItem.id === event.id)
+    );


Do we need this de-duplication functionality? Can we also support events without ids?

It's worth noting that this de-duplication only works if the duplicate event hasn't been sent yet.

I added this because dry run might run multiple times when someone is navigating to Integration details - Settings page or Upgrade a Policy. I added telemetry at dry run time because that is where the conflicts are checked. This logic is not filtering out events without an id.

It's true that this won't filter out duplicates if the events are already sent. I was thinking to group by cluster uuid when building dashboards in telemetry, because in real world there won't be more than one upgrade of the same package version twice.
We could remove this logic and see how many duplicates are sent in normal usage.

Got it, makes more sense now. I'm ok leaving this here in that case, as long as we consider it a best-effort de-duplication / 'spam' prevention and don't rely on it on the analysis side.

because in real world there won't be more than one upgrade of the same package version twice.

nit: since these failures happen on package policy upgrades there may actually be more than one per cluster if they have configured the same integration for multiple policies.

Apologies, I'm only seeing this now.
So I removed the id filtering logic. Instead, I have added an id function on telemetry side to create doc id based on package name, version, status, cluster id and date, so this way we are not going to get multiple docs with the same properties from a cluster each day. Even if there are multiple policies upgraded, they will return the same results on dry run, so hopefully we don't need those extra events.
https://github.com/elastic/telemetry/blob/main/kpi-engine/src/telemetry/fleet.clj#L38

joshdover · 2021-10-27T11:03:38Z

x-pack/plugins/fleet/server/telemetry/sender.ts

+    try {
+      this.logger.debug(`Telemetry URL: ${telemetryUrl}`);
+
+      const toSend: TelemetryEvent[] = cloneDeep(events).map((event) => ({


IMO we should do the cloning when the event is queued rather than when it is being sent. That way no modifications to the objects from calling code can cause any problems after they have been queued.

joshdover · 2021-10-27T11:04:13Z

x-pack/plugins/fleet/server/telemetry/types.ts

+export type SearchTypes = BaseSearchTypes | BaseSearchTypes[] | undefined;
+
+// For getting cluster info. Copied from telemetry_collection/get_cluster_info.ts
+export interface ESClusterInfo {


Is there a type available in @elastic/elasticsearch that we can just re-use?

yes, I'll update

juliaElastic · 2021-10-27T11:59:24Z

@joshdover
Thanks for the review! I'll get the pr updated based on your comments.

Regarding the support of different events:
I agree with the overhead of creating a new indexer+channel for each new event type, this is what security is doing and telemetry team suggested, though I'm in for a better approach.
See existing indexers here: https://github.com/elastic/telemetry/tree/e37eee1d5159983f7b166dae1f75268474fef8ce/kpi-engine/resources/indices/telemetry

If we use a single channel and single indexer, can we still ingest the documents into separate indices depending on the event.type?

AFAIK you need multiple indexers to ingest to separate indices e.g. you see the indexer config has to define the index/alias to ingest into, though you can have multiple indexers reading from the same channel
https://github.com/elastic/telemetry/blob/e37eee1d5159983f7b166dae1f75268474fef8ce/kpi-engine/resources/indices/telemetry/fleet-upgrades.edn

How practical would it be to use a single index with dynamic mappings instead?

I'm not sure whether we can specify dynamic mappings in indexer, I'll ask around.
Though we can map an object, I'm checking how this is mapped in elastic side.

 :field-map {
   [:event.properties :mapped] [:original-body :event.properties :mapped]

If we go with this unified channel approach, should we rename the channel to something more generic?

juliaElastic · 2021-10-27T17:48:46Z

@elasticmachine merge upstream

juliaElastic · 2021-10-28T11:19:26Z

As discussed on slack, we will stay with this solution for now, and come back to it when we have more use cases for telemetry to decide whether we want to create a generic message type or use different messages and channels.

kpollich · 2021-10-28T11:59:52Z

x-pack/plugins/fleet/server/telemetry/queue.ts

+ * 2.0.
+ */
+
+export const TELEMETRY_MAX_BUFFER_SIZE = 100;


(nit) It might be more accurate to name this TELEMETRY_MAX_QUEUE_SIZE

x-pack/plugins/fleet/server/telemetry/queue.ts

x-pack/plugins/fleet/server/telemetry/sender.test.ts

kpollich · 2021-10-28T12:04:52Z

x-pack/plugins/fleet/server/telemetry/sender.test.ts

+
+import { loggingSystemMock } from 'src/core/server/mocks';
+
+import axios from 'axios';


(nit) Group imports from installed packages above imports from local modules.

juliaElastic · 2021-11-01T13:29:45Z

@mostlyjason @joshdover
Take a look at this, the data is now flowing to the new telemetry cluster.
I pushed some data from local to demonstrate how the upgrade telemetry looks like.

Discover
https://telemetry-v2-staging.elastic.dev/goto/aaf2772eb54a6f16d34899dfb3fe7866
Dashboard
https://telemetry-v2-staging.elastic.dev/goto/83df7acd68bbe280c2c0525a3eaf39fe

joshdover · 2021-11-01T17:10:07Z

ack: will review tomorrow

juliaElastic · 2021-11-03T08:39:36Z

@elasticmachine merge upstream

kibanamachine · 2021-11-03T11:01:15Z

💛 Build succeeded, but was flaky

Test Failures

[job] [logs] OSS CI Group #7 / visualize app new charts library visualize ciGroup7 Timelion visualization expression typeahead dynamic suggestions for argument values .es() should show index pattern suggestions for index argument

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`fleet`	1120	1121	+1

Unknown metric groups

API count

id	before	after	diff
`fleet`	1222	1223	+1

History

💚 Build #3099 succeeded 1d876cd
💚 Build #2404 succeeded f0a8cd2
💔 Build #2172 failed bfc09a0
💚 Build #1686 succeeded 74b6c1f
💔 Build #1680 failed 5914a61
💚 Build #1323 succeeded ee2c916

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

* draft of upgrade usage collector * telemetry sender service * fixed tests and types * cleanup * type fix * removed collector * made required field message generic, added test * cleanup * cleanup * cleanup * removed v3-dev as outdated * removed conditional from telemetry url creation * supporting multiple channels in sender * fix types * refactor * using json content type * fix test * simplified telemetry url * fixed type * added back ndjson * moved telemetry to update, added dryrun * fix types * fix prettier * updated after review * fix imports * added error_message field * review fixes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

kibanamachine · 2021-11-03T12:33:18Z

💚 Backport successful

Status	Branch	Result
✅	8.0

This backport PR will be merged automatically after passing CI.

* draft of upgrade usage collector * telemetry sender service * fixed tests and types * cleanup * type fix * removed collector * made required field message generic, added test * cleanup * cleanup * cleanup * removed v3-dev as outdated * removed conditional from telemetry url creation * supporting multiple channels in sender * fix types * refactor * using json content type * fix test * simplified telemetry url * fixed type * added back ndjson * moved telemetry to update, added dryrun * fix types * fix prettier * updated after review * fix imports * added error_message field * review fixes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>

juliaElastic and others added 13 commits October 15, 2021 14:41

draft of upgrade usage collector

def871d

Merge branch 'elastic:master' into upgrade-telemetry

d17c611

telemetry sender service

4957690

fixed tests and types

1d2deff

cleanup

719be2d

type fix

0434d46

Merge branch 'elastic:master' into upgrade-telemetry

d4b01a3

removed collector

0e8fa09

Merge branch 'master' into upgrade-telemetry

d6d11df

made required field message generic, added test

316ff8b

Merge branch 'upgrade-telemetry' of https://github.com/juliaElastic/k…

8ff9797

…ibana into upgrade-telemetry

cleanup

217ef55

cleanup

d0620df

juliaElastic marked this pull request as ready for review October 19, 2021 12:11

juliaElastic requested a review from a team as a code owner October 19, 2021 12:11

juliaElastic added v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Oct 19, 2021

juliaElastic changed the title ~~draft of upgrade usage collector~~ [Fleet] Package policy upgrade telemetry with sender Oct 19, 2021

cleanup

03d39ad

botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 19, 2021

juliaElastic added 5 commits October 19, 2021 16:24

removed v3-dev as outdated

1fe58f1

removed conditional from telemetry url creation

9f18ad3

supporting multiple channels in sender

a3b1011

fix types

caef79a

refactor

ae3b673

kibanamachine and others added 3 commits October 20, 2021 10:31

Merge branch 'master' into upgrade-telemetry

0b8072e

using json content type

3798547

fix test

b210f64

joshdover reviewed Oct 27, 2021

View reviewed changes

updated after review

cb591ad

Merge branch 'master' into upgrade-telemetry

bfc09a0

juliaElastic added the v8.1.0 label Oct 28, 2021

fix imports

b4a89d9

juliaElastic added the auto-backport Deprecated - use backport:version if exact versions are needed label Oct 28, 2021

added error_message field

f0a8cd2

kpollich reviewed Oct 28, 2021

View reviewed changes

juliaElastic and others added 2 commits November 1, 2021 08:44

Merge branch 'main' into upgrade-telemetry

eafad0a

review fixes

1d876cd

joshdover self-requested a review November 1, 2021 13:18

Merge branch 'main' into upgrade-telemetry

f7dcd64

joshdover approved these changes Nov 3, 2021

View reviewed changes

juliaElastic merged commit 654e75a into elastic:main Nov 3, 2021

juliaElastic deleted the upgrade-telemetry branch November 3, 2021 12:30

kibanamachine mentioned this pull request Nov 3, 2021

[8.0] [Fleet] Package policy upgrade telemetry with sender (#115180) #117295

Merged

joshdover mentioned this pull request Nov 4, 2021

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

Closed

jportner mentioned this pull request Dec 29, 2021

Telemetry plugin is crashing functional test server #122146

Closed

patrykkopycinski mentioned this pull request Jan 9, 2022

[Osquery] Add telemetry for packs and saved queries #122501

Merged

juliaElastic mentioned this pull request Feb 11, 2022

[Fleet] Move to kibana Event-based Telemetry #125363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Package policy upgrade telemetry with sender #115180

[Fleet] Package policy upgrade telemetry with sender #115180

juliaElastic commented Oct 15, 2021 •

edited

Loading

elasticmachine commented Oct 19, 2021

juliaElastic commented Oct 20, 2021

joshdover left a comment •

edited

Loading

joshdover Oct 27, 2021

juliaElastic Oct 27, 2021

joshdover Oct 27, 2021

joshdover Oct 27, 2021

joshdover Oct 27, 2021

juliaElastic Oct 27, 2021

joshdover Oct 27, 2021

joshdover Oct 27, 2021

joshdover Oct 27, 2021

juliaElastic Oct 27, 2021

joshdover Oct 28, 2021

juliaElastic Nov 3, 2021

joshdover Oct 27, 2021

joshdover Oct 27, 2021

juliaElastic Oct 27, 2021

juliaElastic commented Oct 27, 2021 •

edited

Loading

juliaElastic commented Oct 27, 2021

juliaElastic commented Oct 28, 2021

kpollich Oct 28, 2021

kpollich Oct 28, 2021

juliaElastic commented Nov 1, 2021 •

edited

Loading

joshdover commented Nov 1, 2021

juliaElastic commented Nov 3, 2021

kibanamachine commented Nov 3, 2021

API count

kibanamachine commented Nov 3, 2021

	this.telemetryEventsSender = new TelemetryEventsSender(this.logger);
	this.telemetryEventsSender = new TelemetryEventsSender(this.logger.get('telemetry_events'));


		import { loggingSystemMock } from 'src/core/server/mocks';

		import axios from 'axios';

[Fleet] Package policy upgrade telemetry with sender #115180

[Fleet] Package policy upgrade telemetry with sender #115180

Conversation

juliaElastic commented Oct 15, 2021 • edited Loading

Summary

Checklist

Risk Matrix

For maintainers

elasticmachine commented Oct 19, 2021

juliaElastic commented Oct 20, 2021

joshdover left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic commented Oct 27, 2021 • edited Loading

juliaElastic commented Oct 27, 2021

juliaElastic commented Oct 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic commented Nov 1, 2021 • edited Loading

joshdover commented Nov 1, 2021

juliaElastic commented Nov 3, 2021

kibanamachine commented Nov 3, 2021

💛 Build succeeded, but was flaky

Test Failures

Metrics [docs]

Public APIs missing comments

API count

History

kibanamachine commented Nov 3, 2021

💚 Backport successful

juliaElastic commented Oct 15, 2021 •

edited

Loading

joshdover left a comment •

edited

Loading

juliaElastic commented Oct 27, 2021 •

edited

Loading

juliaElastic commented Nov 1, 2021 •

edited

Loading