Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] The caching mechanism also caches failed payloads #123021

Closed
afharo opened this issue Jan 14, 2022 · 6 comments · Fixed by #124253
Closed

[Telemetry] The caching mechanism also caches failed payloads #123021

afharo opened this issue Jan 14, 2022 · 6 comments · Fixed by #124253
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@afharo
Copy link
Member

afharo commented Jan 14, 2022

The caching mechanism introduced in #117084 caches the full report. This means that if a collector fails during that report generation, the incomplete report will be cached.

This is important because of the scenario detailed by @jportner in this comment #120422 (comment). It could result in a user with limited access could cache an incomplete report, and another user with the right permissions requesting the report would get the cached incomplete version (and vice-versa).

Potential solutions:

  1. Disable the caching mechanism since we now ensure that we only send 1 daily report. Although we may appreciate some caching for retries if there's a connection issue to the Remote Telemetry Service.
  2. Cache every collector's result individually (only when successful). The problem with this is that they may succeed with some limited visibility.
  3. Always by-pass the caching mechanism when requesting the unencrypted version. The unencrypted payload is generated with the user kibana_system, so it shouldn't have permissions issues.

I'd say option 3 is the best compromise for now.

@afharo afharo added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry labels Jan 14, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@Bamieh
Copy link
Member

Bamieh commented Jan 17, 2022

We are caching usage for 4 hours so any failed collectors will not be reported until the next caching cycle. This is not a bug but by how we've implemented the caching logic which worked at the top level of the service.

Always by-pass the caching mechanism when requesting the unencrypted version. The unencrypted payload is generated with the user kibana_system, so it shouldn't have permissions issues.

I'm also +1 on this approach.

Since we already sending 1 daily report the caching is only useful now for retry logic and such hence disabling it will resurface some issues there so I don't think we should disable it.

@Bamieh Bamieh self-assigned this Jan 17, 2022
@rudolf
Copy link
Contributor

rudolf commented Jan 17, 2022

3. Always by-pass the caching mechanism when requesting the unencrypted version. The unencrypted payload is generated with the user kibana_system, so it shouldn't have permissions issues.

This means metricbeat could still add a lot of usage collection load. Does metricbeat call this API when you enable monitoring for a Kibana cluster, or do users need to specifically configure this? How often will metricbeat hit this API? We should ensure that a common feature like monitoring a Kibana cluster doesn't end up adding a lot of additional load.

@Bamieh
Copy link
Member

Bamieh commented Jan 17, 2022

@rudolf metricbeat uses a different API (stats API) which is not cached. The caching mechanism we introduced is solely for our telemetry.

We'd need to coordinate with folks consuming this API before we introduce any caching mechanisms to the collectors there (CC @seanstory @yakhinvadim). Adding a caching layer on the stats API might be awkward since users might be expecting more 'real-time' data rather than flat graphs that change on every caching cycle.

We had a discussion last year about dropping the collectors from the stats API completely which might be worth re-exploring here.

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Jan 17, 2022
@Bamieh Bamieh added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Jan 17, 2022
@afharo
Copy link
Member Author

afharo commented Jan 17, 2022

@rudolf Metricbeat should not collect usage anymore starting 7.11.0: elastic/beats#22732

Prior to that version, it should collect it only once every 24h. So, even if the user runs a previous version of Metricbeat, it shouldn't cause too much of an issue (or we can suggest them to upgrade their Metricbeat agent).

@Bamieh I think that the change needs to occur to the GET /api/stats as well: it effectively collects telemetry with unencrypted: true. It doesn't make much sense to fail with 403 on the first request and succeed on the following ones.
EDIT: Scratch that! You are correct! The caching only occurs in the Telemetry plugin 😇

@seanstory
Copy link
Member

We'd need to coordinate with folks consuming this API before we introduce any caching mechanisms to the collectors there (CC @seanstory @yakhinvadim)

@yakhinvadim and I aren't actually making use of this. Enterprise Search was (but is no longer) using the ?expanded=true parameter, and since we were some of the first with M1 laptops, we noticed first that it wasn't behaving. Our fix in Enterprise Search was to just stop using that parameter. I can't speak for your other users, but we can give you the 👍 to do whatever. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants