[Elasticsearch] Optimize handling of large cluster payloads #33862

miltonhultgren · 2022-11-29T12:40:47Z

Background

In the past, Metricbeat's Elasticsearch module has created issues with performance because it consumes APIs that didn't scale well with the size of the monitored cluster. Thanks to a lot of effort by the Elasticsearch team, these APIs now perform much better.

Despite these improvements we still see issues in ESS but now it seems the problem is that Metricbeat is consuming too much CPU when parsing and processing the large responses that Elasticsearch returns. The effort for Elasticsearch to generate these responses is fairly small and thus if you look at the CPU usage of Elasticsearch itself it is low (on the master nodes where this happens), but we see performance issues because Metricbeat takes up the CPU trying to process the response, leaving little CPU for the master node to use which causes general instability.

A larger fix for this is outlined here, to make Metricbeat adopt it's resource usage based on available CPU to not crowd out the other processes that are running.

We may want to also consider revisiting elastic/kibana#130575 and seeing if we can get the same data through other APIs which may have smaller responses to process.

Short term improvement

We have gotten feedback that the code in the Elasticsearch module could be optimized to reduce the CPU/Memory usage as well as speed up the processing of responses.

The main culprits seem to be an excessive usage of mapstr and schema, as well as unmarshalling too much of the JSON response (more than we need to generate the event documents). We should also see if it's possible for us to reduce the amount of data we send to Elasticsearch since that also takes time when the cluster becomes larger.

Development tips

Metricbeat has cpuprofile and memprofile as flags you can use to enable resource profiling.

AC

Usage of mapstr is eliminated
Usage of schema is replaced with a hard coded Go struct that can be used for JSON parsing but only for the exact data we need
Documents are trimmed to only send fields that are indexed
A noticeable improvement in CPU usage is measured for large clusters

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2022-11-30T09:57:57Z

If Elasticsearch is sending data which is not important to Metricbeat, it would be worth trying to make use of the ?filter_path query option to drop the unimportant bits before they even leave Elasticsearch, which should save CPU on both sides. This won't solve the O(#shards) scaling issues ofc but it might extend the runway a bit.

botelastic · 2023-11-30T10:55:21Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

miltonhultgren added module Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring Feature:Stack Monitoring labels Nov 29, 2022

botelastic bot added the Stalled label Nov 30, 2023

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring labels Jan 26, 2024

botelastic bot removed the Stalled label Jan 26, 2024

miltonhultgren mentioned this issue Nov 14, 2024

Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources elastic/elastic-agent#5991

Closed

consulthys mentioned this issue Dec 11, 2024

Additional stats fields for Elasticsearch #41944

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elasticsearch] Optimize handling of large cluster payloads #33862

[Elasticsearch] Optimize handling of large cluster payloads #33862

miltonhultgren commented Nov 29, 2022 •

edited

Loading

DaveCTurner commented Nov 30, 2022

botelastic bot commented Nov 30, 2023

[Elasticsearch] Optimize handling of large cluster payloads #33862

[Elasticsearch] Optimize handling of large cluster payloads #33862

Comments

miltonhultgren commented Nov 29, 2022 • edited Loading

Background

Short term improvement

Development tips

AC

DaveCTurner commented Nov 30, 2022

botelastic bot commented Nov 30, 2023

miltonhultgren commented Nov 29, 2022 •

edited

Loading