[Meta] Implement shipper performance testing #57

cmacknz · 2022-06-15T13:17:42Z

The Elastic agent data shipper is actively under development and we need a way to benchmark its performance as part of the agent system. Specifically we are interested in benchmarking the achievable throughput of a single agent using the shipper along with its CPU, memory, and disk IOPS overhead. Users care about the performance of the agent and we need a way to measure and improve it.

Design

The proposed solution is to develop a new load generating input for the agent, which can be installed and configured as a standard agent integration. The test scenario can be changed by modifying the integration configuration or agent policy. Metrics will be collected using the existing agent monitoring features. Where the existing agent monitoring is not adequate, it should be enhanced so that all data necessary to diagnose performance issues is also available in the field. For example, all performance data should be available in the existing agent metrics dashboard.

The new load generating input should be developed as one of the first non-beat inputs in the V2 agent input architecture. The load generator should be packaged into an agent load testing integration developed using the existing Elastic package tooling. Any agent is then capable of being load tested via installing the necessary integration.

Automated deployment and provisioning can ideally reuse the same tools used to provision Fleet managed agents for end-to-end testing with minimal extra work. When testing Elasticsearch, ideally the instance used for fleet and monitoring data is separate from the instance receiving data from the shipper to avoid introducing instability into Fleet itself during stress tests.

The performance metrics resulting from each test can be queried out of the agent monitoring indices at the conclusion of each test. Profiles can be periodically collected via agent diagnostics or the /debug/pprof endpoint of the shipper.

The initial version of the agent load testing package will implement only a shipper client which it will use to write simulated or pre-recorded events at a configurable rate. Multiple tools exist that could be integrated into the load generator input to generate data on demand: stream, integration corpos generator, spigot, or flog.

Future versions of the load testing package can be developed with the load generator input configured to act as the data source for other inputs to pull from. For example a filebeat instance could be started and configured to consume data from the load generator using the syslog protocol, enabling tests of the entire agent ingestion system. Stream is already used to test integrations with elastic-package today and could serve as the starting point for this functionality.

Implementation Plan

TBD. Insert a development plan with linked issues, including at least the following high level tasks:

Develop a load generator agent input, possibly based on https://github.com/elastic/stream and integrating synthetic data generation.
Develop and publish an agent load testing integration. Allow local testing of the load generator input using the elastic-package tool (see https://github.com/elastic/integrations/blob/main/CONTRIBUTING.md).
Allow running performance tests locally, and collecting test results into a report document that can be ingested into Elasticsearch and tracked over time. Use the APM benchmark output format as a reference: Benchmark 2.0 production ready apm-server#7540
Update the existing agent metrics dashboard to include all relevant performance metrics if they are not already present.
Automate running performance tests on a daily basis. The key to integrating performance testing into CI will be creating repeatable hardware conditions, something several teams in Elastic have already solved.
Allow running performance tests on a PR basis, possibly triggered a dedicated label or as part of the existing E2E test suite.

The text was updated successfully, but these errors were encountered:

joshdover · 2022-10-12T14:36:55Z

I started writing an issue that is quite similar to this one, but the further I get the less I see a reason to distinguish between the two. One goal I want to ensure is captured here is the ability to compare the current architecture with the shipper architecture. Here was my thinking on this:

As we move to this new architecture, we need to be able to measure the impact on performance at the edge as well as total ingest performance to the output destination. It's important that we do not introduce any significant regressions. Having a simple way to run a benchmark that can surface the performance differences between these architectures is critical to making the shipper GA.

Longer-term we also need to be able to benchmark and optimize our total ingest throughput, which has many variables:

The input doing the collection + its configuration
The shipper and its configuration
Resources available on the host running Agent (CPU, memory, network bandwidth, disk perf)
Elasticsearch cluster sizing (number ingest nodes, cpu, ram, etc.)
Overhead from the ingest pipeline for the destination data stream
Shard and ILM configuration for the destination data stream

We will need to be able to run an end-to-end benchmark that encapsulates all of these parameters and produces reliable results. This will likely involve tying together several tools and will require integration work for each input type that Agent supports. This is all out of scope for this issue and will be discussed in further issues. For now we just want to something we can use to inform current changes and use this as a start place for a more complete testing bench.

Goals

Establish a baseline for current architecture performance
Produce a comparison of the new shipper architecture to the baseline
Allow developers to run these benchmarks quickly and easily during development

Implementation

I have some questions & thoughts about the current proposal above, which involves writing a new load generation input:

Will writing a new input be able to use the existing ES output in libbeat so we could compare differences between the old and shipper architecture?
Are there additional queues or buffers in Filebeat or Metricbeat that are changing as part of the integration with the shipper that would not be captured by the load generation input?
I think we should discuss the scope of how we get this running on a ad-hoc basis first, and plan for CI integration later, see my thoughts below.
Would it be faster to skip building a new input and instead use the existing filestream input with a big file?

My basic idea would be to run a test that runs Agent with a single filestream input configured to ingest a large log file (at least 1GB) to a local ES instance and measure the throughput, total time, CPU consumption, and memory usage. Details:

A docker-compose script that will start Elasticsearch and standalone Agent
An index template should be installed for the destination data stream, using the same defaults we use in production (1 shard, 0-1 replicas). These could use the same mappings we use in the default logs index template today (see below)
Metrics should be collected from the containers and written to a file on the host's disk. For now this could probably be implemented using Agent's built in monitoring or a Metricbeat container with the docker module enabled
- This could easily be changed via config to output to a destination cluster whenever we want to hook this up to CI.
- I think it may be preferable to use a separate container to do the monitoring to reduce the interference of the monitoring on the test itself.
We should probably use cpu and memory limits on both the ES and Agent containers to help improve reproducibility of the test
Ability to run the same test with the shipper enabled, instead of the Filebeat ES output.
We need to be able to run a custom image for Elastic Agent to do comparisons against main before merging. This doesn't need to be automatic, but some documentation (or link) about how to build an custom image is necessary for those unfamiliar.

Logs index template

{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "logs"
        },
        "codec": "best_compression",
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "query": {
          "default_field": [
            "message"
          ]
        }
      }
    },
    "mappings": {
      "dynamic_templates": [
        {
          "match_ip": {
            "match": "ip",
            "match_mapping_type": "string",
            "mapping": {
              "type": "ip"
            }
          }
        },
        {
          "match_message": {
            "match": "message",
            "match_mapping_type": "string",
            "mapping": {
              "type": "match_only_text"
            }
          }
        },
        {
          "strings_as_keyword": {
            "match_mapping_type": "string",
            "mapping": {
              "ignore_above": 1024,
              "type": "keyword"
            }
          }
        }
      ],
      "date_detection": false,
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "data_stream": {
          "properties": {
            "dataset": {
              "type": "constant_keyword"
            },
            "namespace": {
              "type": "constant_keyword"
            },
            "type": {
              "type": "constant_keyword",
              "value": "logs"
            }
          }
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "host": {
          "type": "object"
        }
      }
    },
    "aliases": {}
  }
}

cmacknz · 2022-10-12T16:07:59Z

Will writing a new input be able to use the existing ES output in libbeat so we could compare differences between the old and shipper architecture?

Are there additional queues or buffers in Filebeat or Metricbeat that are changing as part of the integration with the shipper that would not be captured by the load generation input?

The intent was to test the shipper first, and then allow the load generator input to serve as a data source for the Beat or future v2 inputs afterwards.

I think we should discuss the scope of how we get this running on a ad-hoc basis first, and plan for CI integration later, see my thoughts below.

Would it be faster to skip building a new input and instead use the existing filestream input with a big file?

Agreed, setting up something with Filebeat would be faster at this point. The intent of the original design was to allow us to drive the input v2 project in a way that supports the shipper, and also to create a reusable integration that can serve as a test data source for input development.

I still like the original idea and think it has value, but it isn't the fastest path at this point in time. Among other things, the load testing input is blocked on integration with the agent release process (this is the largest blocker), adding shipper support to the agent, and completion of the shipper itself. The skeleton of the load generator input itself already exists.

We can come back to the idea of a load testing integration once we get a basic proof of concept running using pieces that already exist.

cmacknz · 2022-10-12T16:50:36Z

With a log file based test we may end up simply measuring the speed of the underlying disk, and not the true event per second limit of the shipper. I don't mind it as a starting point though.

cmacknz · 2022-10-13T14:40:41Z

@leehinman has done some similar basic Filebeat throughput tests in the past (see #30 (comment) for one example).

@leehinman is there anything you can add to Josh's comment above on getting a basic test set up to quickly compare the performance of the agent with and without the shipper? I assume you were using https://github.com/leehinman/spigot to generate the data for your tests?

joshdover · 2022-10-14T11:42:04Z

With a log file based test we may end up simply measuring the speed of the underlying disk, and not the true event per second limit of the shipper. I don't mind it as a starting point though.

This is my biggest concern with my proposal as well at the higher worker and bulk_max_size configurations. That said, at the default of 1 worker we shouldn't be saturating the disk capacity and could still make some useful comparisons.

Definitely agree on the long-term value of the load generator input. My goal right now is finding a shorter path forward to ensure we're not introducing regressions end-to-end, including the integration code on the (real) input side, aka libbeat.

leehinman · 2022-10-14T14:27:26Z

@joshdover we could easily put spigot in a Docker container. Then it can produce the logs (File, S3-bucket, or Syslog TCP/UDP). That way you don't have to have large files hanging around.

I wouldn't worry about measuring the speed of the underlying disk on the input side, it is more likely that you will be measuring the index, ingest pipeline and write speed on the Elasticsearch side.

The elastic-package system tests already do almost everything you are proposing. Even if we don't want to use elastic-package, we should be able to borrow liberally to get this up and running quickly.

cmacknz · 2023-01-26T01:24:02Z

Closing this. We will use the our new agent performance testing framework to perform this testing.

cmacknz added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team 8.5-candidate labels Jun 15, 2022

cmacknz mentioned this issue Jun 15, 2022

[Meta] Shipper 8.5 - Experimental integration with Filebeat and Metricbeat #15

Closed

29 tasks

cmacknz changed the title ~~[Meta] Implement shipper performance testing~~ [Meta][Project] Implement shipper performance testing Jun 17, 2022

cmacknz added estimation:Month Task that represents a month of work. and removed estimation:Month Task that represents a month of work. labels Jun 27, 2022

cmacknz mentioned this issue Jun 29, 2022

[Meta] Elastic Agent Inputs elastic/elastic-agent-inputs#1

Closed

33 tasks

jlind23 added v8.5.0 and removed 8.5-candidate labels Jul 8, 2022

jlind23 assigned leehinman Jul 8, 2022

leehinman added the estimation:Month Task that represents a month of work. label Jul 19, 2022

belimawr mentioned this issue Jul 25, 2022

Implement the load generator core functionality elastic/elastic-agent-inputs#27

Closed

6 tasks

jlind23 added v8.6.0 and removed v8.5.0 labels Aug 22, 2022

cmacknz mentioned this issue Aug 22, 2022

[Meta] Elastic Agent Shipper Project #16

Open

100 tasks

cmacknz changed the title ~~[Meta][Project] Implement shipper performance testing~~ [Meta] Implement shipper performance testing Sep 20, 2022

cmacknz mentioned this issue Sep 23, 2022

Build Tool to performance test shipper with & without disk queue #124

Closed

cmacknz mentioned this issue Oct 5, 2022

[Meta] disk queue journey to GA #118

Open

13 tasks

jlind23 added the Meta label Jan 3, 2023

cmacknz closed this as completed Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Implement shipper performance testing #57

[Meta] Implement shipper performance testing #57

cmacknz commented Jun 15, 2022 •

edited

Loading

joshdover commented Oct 12, 2022

cmacknz commented Oct 12, 2022 •

edited

Loading

cmacknz commented Oct 12, 2022

cmacknz commented Oct 13, 2022

joshdover commented Oct 14, 2022

leehinman commented Oct 14, 2022

cmacknz commented Jan 26, 2023

[Meta] Implement shipper performance testing #57

[Meta] Implement shipper performance testing #57

Comments

cmacknz commented Jun 15, 2022 • edited Loading

Design

Implementation Plan

joshdover commented Oct 12, 2022

Goals

Implementation

Logs index template

cmacknz commented Oct 12, 2022 • edited Loading

cmacknz commented Oct 12, 2022

cmacknz commented Oct 13, 2022

joshdover commented Oct 14, 2022

leehinman commented Oct 14, 2022

cmacknz commented Jan 26, 2023

cmacknz commented Jun 15, 2022 •

edited

Loading

cmacknz commented Oct 12, 2022 •

edited

Loading