Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache usage data to prevent expensive collection queries #117084

Closed
rudolf opened this issue Nov 2, 2021 · 7 comments · Fixed by #119312
Closed

Cache usage data to prevent expensive collection queries #117084

rudolf opened this issue Nov 2, 2021 · 7 comments · Fixed by #119312
Labels
Feature:Telemetry impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Nov 2, 2021

Every browser session will request usage data every 24 hours. In large clusters collecting usage data leads to expensive Elasticsearch queries and large response payloads which affect the performance of both Elasticsearch and Kibana #93770

When Elasticsearch / Kibana performance is degraded these expensive queries take longer to complete causing timeouts and then each browser retries the request #115221.

Instead of collecting this usage data for every browser session requesting it, we should cache this data server side.

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry labels Nov 2, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor Author

rudolf commented Nov 2, 2021

As a first step, we could use an in-memory only cache, meaning if there's more than one Kibana server each server will repeat the usage collection.

It might be worth creating a usage collection task that runs every X hours, and stores its results in a document with a last updated value. Kibana could then just read whatever is in the document. But this document would be quite large (I've seen > 37MB) so we would have to exclude it from e.g. saved object migrations.

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort and removed impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 2, 2021
@rudolf
Copy link
Contributor Author

rudolf commented Nov 2, 2021

Beats might be using the API so check the impact it might have on that integration.

@rudolf rudolf changed the title Usage data is collected anew for every browser session request Cache usage data to prevent expensive collection queries Nov 2, 2021
@pgayvallet
Copy link
Contributor

As a first step, we could use an in-memory only cache, meaning if there's more than one Kibana server each server will repeat the usage collection.

Seems good enough for an initial implementation.

Every browser session will request usage data every 24 hours

I'm sorry, I tried to find where this was done in the code, but couldn't find it. Is this performed at a fixed time (e.g same time every day for each browser session), or is the browser sending the data 24h after their initial load? Asking because depending on this, caching the data on the server-side may be harder, or at least we need a lower TTL for the cache.

@rudolf
Copy link
Contributor Author

rudolf commented Nov 4, 2021

https://github.com/elastic/kibana/blob/master/src/plugins/telemetry/public/services/telemetry_sender.ts#L86

We try to send usage data and if successful store the timestamp in localstorage. Every minute that the browser is open we check if 24 hours has expired since the last success timestamp.

Given the urgency simple server-side caching is a good first step but I think we can do more here like ask browsers to report to the server if they successfully sent a payload. If one browser session sent usage data we don't need the other browsers to also send it. So once every 24hours the browser could ask the server "has anyone sent telemetry?" if the answer is no the browser tries, sometimes multiple browsers will do so in parallel but that's fine, at least not all e.g. 100 users of this cluster are trying to send the same payload.

@rudolf
Copy link
Contributor Author

rudolf commented Nov 4, 2021

Given that the browser retries every 60s and that usage collection could theoretically take longer than 60s to complete, we should also consider what will happen if the cache isn't primed. In such a case we would want to start usage collection only once regardless of how many browsers are requesting it.

@marxello
Copy link

marxello commented Nov 4, 2021

As a first step, we could use an in-memory only cache, meaning if there's more than one Kibana server each server will repeat the usage collection.

It might be worth creating a usage collection task that runs every X hours, and stores its results in a document with a last updated value. Kibana could then just read whatever is in the document. But this document would be quite large (I've seen > 37MB) so we would have to exclude it from e.g. saved object migrations.

Hi Rudolf, I highly recommend making the cache refresh time configurable so that everyone can adjust its value according to their environment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Telemetry impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants