[Telemetry] The caching mechanism also caches failed payloads #123021

afharo · 2022-01-14T09:09:27Z

The caching mechanism introduced in #117084 caches the full report. This means that if a collector fails during that report generation, the incomplete report will be cached.

This is important because of the scenario detailed by @jportner in this comment #120422 (comment). It could result in a user with limited access could cache an incomplete report, and another user with the right permissions requesting the report would get the cached incomplete version (and vice-versa).

Potential solutions:

Disable the caching mechanism since we now ensure that we only send 1 daily report. Although we may appreciate some caching for retries if there's a connection issue to the Remote Telemetry Service.
Cache every collector's result individually (only when successful). The problem with this is that they may succeed with some limited visibility.
Always by-pass the caching mechanism when requesting the unencrypted version. The unencrypted payload is generated with the user kibana_system, so it shouldn't have permissions issues.

I'd say option 3 is the best compromise for now.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-14T09:09:29Z

Pinging @elastic/kibana-core (Team:Core)

Bamieh · 2022-01-17T09:29:03Z

We are caching usage for 4 hours so any failed collectors will not be reported until the next caching cycle. This is not a bug but by how we've implemented the caching logic which worked at the top level of the service.

Always by-pass the caching mechanism when requesting the unencrypted version. The unencrypted payload is generated with the user kibana_system, so it shouldn't have permissions issues.

I'm also +1 on this approach.

Since we already sending 1 daily report the caching is only useful now for retry logic and such hence disabling it will resurface some issues there so I don't think we should disable it.

rudolf · 2022-01-17T10:06:45Z

3. Always by-pass the caching mechanism when requesting the unencrypted version. The unencrypted payload is generated with the user kibana_system, so it shouldn't have permissions issues.

This means metricbeat could still add a lot of usage collection load. Does metricbeat call this API when you enable monitoring for a Kibana cluster, or do users need to specifically configure this? How often will metricbeat hit this API? We should ensure that a common feature like monitoring a Kibana cluster doesn't end up adding a lot of additional load.

Bamieh · 2022-01-17T10:45:18Z

@rudolf metricbeat uses a different API (stats API) which is not cached. The caching mechanism we introduced is solely for our telemetry.

We'd need to coordinate with folks consuming this API before we introduce any caching mechanisms to the collectors there (CC @seanstory @yakhinvadim). Adding a caching layer on the stats API might be awkward since users might be expecting more 'real-time' data rather than flat graphs that change on every caching cycle.

We had a discussion last year about dropping the collectors from the stats API completely which might be worth re-exploring here.

afharo · 2022-01-17T11:10:07Z

@rudolf Metricbeat should not collect usage anymore starting 7.11.0: elastic/beats#22732

Prior to that version, it should collect it only once every 24h. So, even if the user runs a previous version of Metricbeat, it shouldn't cause too much of an issue (or we can suggest them to upgrade their Metricbeat agent).

@Bamieh I think that the change needs to occur to the GET /api/stats as well: it effectively collects telemetry with unencrypted: true. It doesn't make much sense to fail with 403 on the first request and succeed on the following ones.
EDIT: Scratch that! You are correct! The caching only occurs in the Telemetry plugin 😇

seanstory · 2022-01-18T21:37:29Z

We'd need to coordinate with folks consuming this API before we introduce any caching mechanisms to the collectors there (CC @seanstory @yakhinvadim)

@yakhinvadim and I aren't actually making use of this. Enterprise Search was (but is no longer) using the ?expanded=true parameter, and since we were some of the first with M1 laptops, we noticed first that it wasn't behaving. Our fix in Enterprise Search was to just stop using that parameter. I can't speak for your other users, but we can give you the 👍 to do whatever. :)

afharo added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry labels Jan 14, 2022

Bamieh self-assigned this Jan 17, 2022

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Jan 17, 2022

Bamieh added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Jan 17, 2022

Bamieh mentioned this issue Feb 1, 2022

refresh cache on unencrypted telemetry request #124253

Merged

Bamieh closed this as completed in #124253 Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Telemetry] The caching mechanism also caches failed payloads #123021

[Telemetry] The caching mechanism also caches failed payloads #123021

afharo commented Jan 14, 2022 •

edited

Loading

elasticmachine commented Jan 14, 2022

Bamieh commented Jan 17, 2022

rudolf commented Jan 17, 2022

Bamieh commented Jan 17, 2022

afharo commented Jan 17, 2022 •

edited

Loading

seanstory commented Jan 18, 2022

[Telemetry] The caching mechanism also caches failed payloads #123021

[Telemetry] The caching mechanism also caches failed payloads #123021

Comments

afharo commented Jan 14, 2022 • edited Loading

elasticmachine commented Jan 14, 2022

Bamieh commented Jan 17, 2022

rudolf commented Jan 17, 2022

Bamieh commented Jan 17, 2022

afharo commented Jan 17, 2022 • edited Loading

seanstory commented Jan 18, 2022

afharo commented Jan 14, 2022 •

edited

Loading

afharo commented Jan 17, 2022 •

edited

Loading