-
Notifications
You must be signed in to change notification settings - Fork 17
[Meta] Implement shipper performance testing #57
Comments
I started writing an issue that is quite similar to this one, but the further I get the less I see a reason to distinguish between the two. One goal I want to ensure is captured here is the ability to compare the current architecture with the shipper architecture. Here was my thinking on this: As we move to this new architecture, we need to be able to measure the impact on performance at the edge as well as total ingest performance to the output destination. It's important that we do not introduce any significant regressions. Having a simple way to run a benchmark that can surface the performance differences between these architectures is critical to making the shipper GA. Longer-term we also need to be able to benchmark and optimize our total ingest throughput, which has many variables:
We will need to be able to run an end-to-end benchmark that encapsulates all of these parameters and produces reliable results. This will likely involve tying together several tools and will require integration work for each input type that Agent supports. This is all out of scope for this issue and will be discussed in further issues. For now we just want to something we can use to inform current changes and use this as a start place for a more complete testing bench. Goals
ImplementationI have some questions & thoughts about the current proposal above, which involves writing a new load generation input:
My basic idea would be to run a test that runs Agent with a single
Logs index template{
"template": {
"settings": {
"index": {
"lifecycle": {
"name": "logs"
},
"codec": "best_compression",
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_hot"
}
}
},
"query": {
"default_field": [
"message"
]
}
}
},
"mappings": {
"dynamic_templates": [
{
"match_ip": {
"match": "ip",
"match_mapping_type": "string",
"mapping": {
"type": "ip"
}
}
},
{
"match_message": {
"match": "message",
"match_mapping_type": "string",
"mapping": {
"type": "match_only_text"
}
}
},
{
"strings_as_keyword": {
"match_mapping_type": "string",
"mapping": {
"ignore_above": 1024,
"type": "keyword"
}
}
}
],
"date_detection": false,
"properties": {
"@timestamp": {
"type": "date"
},
"data_stream": {
"properties": {
"dataset": {
"type": "constant_keyword"
},
"namespace": {
"type": "constant_keyword"
},
"type": {
"type": "constant_keyword",
"value": "logs"
}
}
},
"ecs": {
"properties": {
"version": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"host": {
"type": "object"
}
}
},
"aliases": {}
}
} |
The intent was to test the shipper first, and then allow the load generator input to serve as a data source for the Beat or future v2 inputs afterwards.
Agreed, setting up something with Filebeat would be faster at this point. The intent of the original design was to allow us to drive the input v2 project in a way that supports the shipper, and also to create a reusable integration that can serve as a test data source for input development. I still like the original idea and think it has value, but it isn't the fastest path at this point in time. Among other things, the load testing input is blocked on integration with the agent release process (this is the largest blocker), adding shipper support to the agent, and completion of the shipper itself. The skeleton of the load generator input itself already exists. We can come back to the idea of a load testing integration once we get a basic proof of concept running using pieces that already exist. |
With a log file based test we may end up simply measuring the speed of the underlying disk, and not the true event per second limit of the shipper. I don't mind it as a starting point though. |
@leehinman has done some similar basic Filebeat throughput tests in the past (see #30 (comment) for one example). @leehinman is there anything you can add to Josh's comment above on getting a basic test set up to quickly compare the performance of the agent with and without the shipper? I assume you were using https://github.com/leehinman/spigot to generate the data for your tests? |
This is my biggest concern with my proposal as well at the higher worker and bulk_max_size configurations. That said, at the default of 1 worker we shouldn't be saturating the disk capacity and could still make some useful comparisons. Definitely agree on the long-term value of the load generator input. My goal right now is finding a shorter path forward to ensure we're not introducing regressions end-to-end, including the integration code on the (real) input side, aka libbeat. |
@joshdover we could easily put spigot in a Docker container. Then it can produce the logs (File, S3-bucket, or Syslog TCP/UDP). That way you don't have to have large files hanging around. I wouldn't worry about measuring the speed of the underlying disk on the input side, it is more likely that you will be measuring the index, ingest pipeline and write speed on the Elasticsearch side. The |
Closing this. We will use the our new agent performance testing framework to perform this testing. |
The Elastic agent data shipper is actively under development and we need a way to benchmark its performance as part of the agent system. Specifically we are interested in benchmarking the achievable throughput of a single agent using the shipper along with its CPU, memory, and disk IOPS overhead. Users care about the performance of the agent and we need a way to measure and improve it.
Design
The proposed solution is to develop a new load generating input for the agent, which can be installed and configured as a standard agent integration. The test scenario can be changed by modifying the integration configuration or agent policy. Metrics will be collected using the existing agent monitoring features. Where the existing agent monitoring is not adequate, it should be enhanced so that all data necessary to diagnose performance issues is also available in the field. For example, all performance data should be available in the existing agent metrics dashboard.
The new load generating input should be developed as one of the first non-beat inputs in the V2 agent input architecture. The load generator should be packaged into an agent load testing integration developed using the existing Elastic package tooling. Any agent is then capable of being load tested via installing the necessary integration.
Automated deployment and provisioning can ideally reuse the same tools used to provision Fleet managed agents for end-to-end testing with minimal extra work. When testing Elasticsearch, ideally the instance used for fleet and monitoring data is separate from the instance receiving data from the shipper to avoid introducing instability into Fleet itself during stress tests.
The performance metrics resulting from each test can be queried out of the agent monitoring indices at the conclusion of each test. Profiles can be periodically collected via agent diagnostics or the /debug/pprof endpoint of the shipper.
The initial version of the agent load testing package will implement only a shipper client which it will use to write simulated or pre-recorded events at a configurable rate. Multiple tools exist that could be integrated into the load generator input to generate data on demand: stream, integration corpos generator, spigot, or flog.
Future versions of the load testing package can be developed with the load generator input configured to act as the data source for other inputs to pull from. For example a filebeat instance could be started and configured to consume data from the load generator using the syslog protocol, enabling tests of the entire agent ingestion system. Stream is already used to test integrations with elastic-package today and could serve as the starting point for this functionality.
Implementation Plan
TBD. Insert a development plan with linked issues, including at least the following high level tasks:
elastic-package
tool (see https://github.com/elastic/integrations/blob/main/CONTRIBUTING.md).The text was updated successfully, but these errors were encountered: