Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

[Meta] Implement shipper performance testing #57

Closed
Tracked by #118
cmacknz opened this issue Jun 15, 2022 · 7 comments
Closed
Tracked by #118

[Meta] Implement shipper performance testing #57

cmacknz opened this issue Jun 15, 2022 · 7 comments
Assignees
Labels
estimation:Month Task that represents a month of work. Meta Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Jun 15, 2022

The Elastic agent data shipper is actively under development and we need a way to benchmark its performance as part of the agent system. Specifically we are interested in benchmarking the achievable throughput of a single agent using the shipper along with its CPU, memory, and disk IOPS overhead. Users care about the performance of the agent and we need a way to measure and improve it.

Design

The proposed solution is to develop a new load generating input for the agent, which can be installed and configured as a standard agent integration. The test scenario can be changed by modifying the integration configuration or agent policy. Metrics will be collected using the existing agent monitoring features. Where the existing agent monitoring is not adequate, it should be enhanced so that all data necessary to diagnose performance issues is also available in the field. For example, all performance data should be available in the existing agent metrics dashboard.

Data Shipper Performance Testing@2x

The new load generating input should be developed as one of the first non-beat inputs in the V2 agent input architecture. The load generator should be packaged into an agent load testing integration developed using the existing Elastic package tooling. Any agent is then capable of being load tested via installing the necessary integration.

Automated deployment and provisioning can ideally reuse the same tools used to provision Fleet managed agents for end-to-end testing with minimal extra work. When testing Elasticsearch, ideally the instance used for fleet and monitoring data is separate from the instance receiving data from the shipper to avoid introducing instability into Fleet itself during stress tests.

The performance metrics resulting from each test can be queried out of the agent monitoring indices at the conclusion of each test. Profiles can be periodically collected via agent diagnostics or the /debug/pprof endpoint of the shipper.

The initial version of the agent load testing package will implement only a shipper client which it will use to write simulated or pre-recorded events at a configurable rate. Multiple tools exist that could be integrated into the load generator input to generate data on demand: stream, integration corpos generator, spigot, or flog.

Future versions of the load testing package can be developed with the load generator input configured to act as the data source for other inputs to pull from. For example a filebeat instance could be started and configured to consume data from the load generator using the syslog protocol, enabling tests of the entire agent ingestion system. Stream is already used to test integrations with elastic-package today and could serve as the starting point for this functionality.

Implementation Plan

TBD. Insert a development plan with linked issues, including at least the following high level tasks:

  1. Develop a load generator agent input, possibly based on https://github.com/elastic/stream and integrating synthetic data generation.
  2. Develop and publish an agent load testing integration. Allow local testing of the load generator input using the elastic-package tool (see https://github.com/elastic/integrations/blob/main/CONTRIBUTING.md).
  3. Allow running performance tests locally, and collecting test results into a report document that can be ingested into Elasticsearch and tracked over time. Use the APM benchmark output format as a reference: Benchmark 2.0 production ready apm-server#7540
  4. Update the existing agent metrics dashboard to include all relevant performance metrics if they are not already present.
  5. Automate running performance tests on a daily basis. The key to integrating performance testing into CI will be creating repeatable hardware conditions, something several teams in Elastic have already solved.
  6. Allow running performance tests on a PR basis, possibly triggered a dedicated label or as part of the existing E2E test suite.
@cmacknz cmacknz added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team 8.5-candidate labels Jun 15, 2022
@cmacknz cmacknz changed the title [Meta] Implement shipper performance testing [Meta][Project] Implement shipper performance testing Jun 17, 2022
@cmacknz cmacknz added estimation:Month Task that represents a month of work. and removed estimation:Month Task that represents a month of work. labels Jun 27, 2022
@leehinman leehinman added the estimation:Month Task that represents a month of work. label Jul 19, 2022
@jlind23 jlind23 added v8.6.0 and removed v8.5.0 labels Aug 22, 2022
@cmacknz cmacknz changed the title [Meta][Project] Implement shipper performance testing [Meta] Implement shipper performance testing Sep 20, 2022
@joshdover
Copy link

I started writing an issue that is quite similar to this one, but the further I get the less I see a reason to distinguish between the two. One goal I want to ensure is captured here is the ability to compare the current architecture with the shipper architecture. Here was my thinking on this:


As we move to this new architecture, we need to be able to measure the impact on performance at the edge as well as total ingest performance to the output destination. It's important that we do not introduce any significant regressions. Having a simple way to run a benchmark that can surface the performance differences between these architectures is critical to making the shipper GA.

Longer-term we also need to be able to benchmark and optimize our total ingest throughput, which has many variables:

  • The input doing the collection + its configuration
  • The shipper and its configuration
  • Resources available on the host running Agent (CPU, memory, network bandwidth, disk perf)
  • Elasticsearch cluster sizing (number ingest nodes, cpu, ram, etc.)
  • Overhead from the ingest pipeline for the destination data stream
  • Shard and ILM configuration for the destination data stream

We will need to be able to run an end-to-end benchmark that encapsulates all of these parameters and produces reliable results. This will likely involve tying together several tools and will require integration work for each input type that Agent supports. This is all out of scope for this issue and will be discussed in further issues. For now we just want to something we can use to inform current changes and use this as a start place for a more complete testing bench.

Goals

  • Establish a baseline for current architecture performance
  • Produce a comparison of the new shipper architecture to the baseline
  • Allow developers to run these benchmarks quickly and easily during development

Implementation

I have some questions & thoughts about the current proposal above, which involves writing a new load generation input:

  1. Will writing a new input be able to use the existing ES output in libbeat so we could compare differences between the old and shipper architecture?
  2. Are there additional queues or buffers in Filebeat or Metricbeat that are changing as part of the integration with the shipper that would not be captured by the load generation input?
  3. I think we should discuss the scope of how we get this running on a ad-hoc basis first, and plan for CI integration later, see my thoughts below.
  4. Would it be faster to skip building a new input and instead use the existing filestream input with a big file?

My basic idea would be to run a test that runs Agent with a single filestream input configured to ingest a large log file (at least 1GB) to a local ES instance and measure the throughput, total time, CPU consumption, and memory usage. Details:

  • A docker-compose script that will start Elasticsearch and standalone Agent
  • An index template should be installed for the destination data stream, using the same defaults we use in production (1 shard, 0-1 replicas). These could use the same mappings we use in the default logs index template today (see below)
  • Metrics should be collected from the containers and written to a file on the host's disk. For now this could probably be implemented using Agent's built in monitoring or a Metricbeat container with the docker module enabled
    • This could easily be changed via config to output to a destination cluster whenever we want to hook this up to CI.
    • I think it may be preferable to use a separate container to do the monitoring to reduce the interference of the monitoring on the test itself.
  • We should probably use cpu and memory limits on both the ES and Agent containers to help improve reproducibility of the test
  • Ability to run the same test with the shipper enabled, instead of the Filebeat ES output.
  • We need to be able to run a custom image for Elastic Agent to do comparisons against main before merging. This doesn't need to be automatic, but some documentation (or link) about how to build an custom image is necessary for those unfamiliar.

Logs index template

{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "logs"
        },
        "codec": "best_compression",
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "query": {
          "default_field": [
            "message"
          ]
        }
      }
    },
    "mappings": {
      "dynamic_templates": [
        {
          "match_ip": {
            "match": "ip",
            "match_mapping_type": "string",
            "mapping": {
              "type": "ip"
            }
          }
        },
        {
          "match_message": {
            "match": "message",
            "match_mapping_type": "string",
            "mapping": {
              "type": "match_only_text"
            }
          }
        },
        {
          "strings_as_keyword": {
            "match_mapping_type": "string",
            "mapping": {
              "ignore_above": 1024,
              "type": "keyword"
            }
          }
        }
      ],
      "date_detection": false,
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "data_stream": {
          "properties": {
            "dataset": {
              "type": "constant_keyword"
            },
            "namespace": {
              "type": "constant_keyword"
            },
            "type": {
              "type": "constant_keyword",
              "value": "logs"
            }
          }
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "host": {
          "type": "object"
        }
      }
    },
    "aliases": {}
  }
}

@cmacknz
Copy link
Member Author

cmacknz commented Oct 12, 2022

  1. Will writing a new input be able to use the existing ES output in libbeat so we could compare differences between the old and shipper architecture?
  2. Are there additional queues or buffers in Filebeat or Metricbeat that are changing as part of the integration with the shipper that would not be captured by the load generation input?

The intent was to test the shipper first, and then allow the load generator input to serve as a data source for the Beat or future v2 inputs afterwards.

  1. I think we should discuss the scope of how we get this running on a ad-hoc basis first, and plan for CI integration later, see my thoughts below.
  2. Would it be faster to skip building a new input and instead use the existing filestream input with a big file?

Agreed, setting up something with Filebeat would be faster at this point. The intent of the original design was to allow us to drive the input v2 project in a way that supports the shipper, and also to create a reusable integration that can serve as a test data source for input development.

I still like the original idea and think it has value, but it isn't the fastest path at this point in time. Among other things, the load testing input is blocked on integration with the agent release process (this is the largest blocker), adding shipper support to the agent, and completion of the shipper itself. The skeleton of the load generator input itself already exists.

We can come back to the idea of a load testing integration once we get a basic proof of concept running using pieces that already exist.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 12, 2022

With a log file based test we may end up simply measuring the speed of the underlying disk, and not the true event per second limit of the shipper. I don't mind it as a starting point though.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 13, 2022

@leehinman has done some similar basic Filebeat throughput tests in the past (see #30 (comment) for one example).

@leehinman is there anything you can add to Josh's comment above on getting a basic test set up to quickly compare the performance of the agent with and without the shipper? I assume you were using https://github.com/leehinman/spigot to generate the data for your tests?

@joshdover
Copy link

With a log file based test we may end up simply measuring the speed of the underlying disk, and not the true event per second limit of the shipper. I don't mind it as a starting point though.

This is my biggest concern with my proposal as well at the higher worker and bulk_max_size configurations. That said, at the default of 1 worker we shouldn't be saturating the disk capacity and could still make some useful comparisons.

Definitely agree on the long-term value of the load generator input. My goal right now is finding a shorter path forward to ensure we're not introducing regressions end-to-end, including the integration code on the (real) input side, aka libbeat.

@leehinman
Copy link
Contributor

@joshdover we could easily put spigot in a Docker container. Then it can produce the logs (File, S3-bucket, or Syslog TCP/UDP). That way you don't have to have large files hanging around.

I wouldn't worry about measuring the speed of the underlying disk on the input side, it is more likely that you will be measuring the index, ingest pipeline and write speed on the Elasticsearch side.

The elastic-package system tests already do almost everything you are proposing. Even if we don't want to use elastic-package, we should be able to borrow liberally to get this up and running quickly.

@jlind23 jlind23 added the Meta label Jan 3, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 26, 2023

Closing this. We will use the our new agent performance testing framework to perform this testing.

@cmacknz cmacknz closed this as completed Jan 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
estimation:Month Task that represents a month of work. Meta Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0
Projects
None yet
Development

No branches or pull requests

4 participants