Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we allow telemetry devices to output high-cardinality data to the results? #1431

Closed
pquentin opened this issue Feb 3, 2022 · 2 comments
Labels
discuss Needs further clarification from the team
Milestone

Comments

@pquentin
Copy link
Member

pquentin commented Feb 3, 2022

During a race, Rally stores information in multiple Elasticsearch indices:

  • rally-races-YYYY-MM: metadata about the race, including results, in a single doc
  • rally-results-YYYY-MM: individual results docs: typically one hundred per race. Results are written in the textual summary report at the end of each race, where we show up to 15 lines per task (error rate, min/mean/media/max throughput and p50/p90/p99/p99.9/p100 latency/service time). We also have tooling to compare results.
  • rally-metrics-YYYY-MM: individual metrics docs: typically multiple millions (!) for long-running races. Metrics are... less pleasant to work with. You need access to the metrics store and then need to figure out your own queries or visualization. There's also no tool to compare metrics between races.

Given this, when working on #1428, @nik9000 decided to store the collected data of his new telemetry devices in the results. That way, it shows up in the summary report, it's easy to compare and he does not have to worry about the metrics store. In his own words: "i'd love an option I think to get all the info to print. in my normal workflow I don't touch the metric store and really just want to print things."

So, what we should do about it?

  1. Allow non-default telemetry devices to somehow put their metrics in the report?
  2. Add a command to show data for a specific telemetry device in text form? After all, if Rally can write to it, it can read from it.
  3. Special case the disk usage telemetry device and just dump its metrics? @jpountz mentioned this specific device was interesting to him to diagnose our nightly benchmarks.

I'm personally more in favor of option 2.

@pquentin pquentin added the discuss Needs further clarification from the team label Feb 3, 2022
@pquentin pquentin added this to the 2.x milestone Feb 3, 2022
@nik9000
Copy link
Member

nik9000 commented Feb 3, 2022

It's important to me at least to be able to compare the results here. Yesterday I put together esrally compare support for the field-disk-usage prototype I'm working on and the output was quite educational:

|     tsdb @timestamp doc values |  | 409.3 MB | 386.3 MB |  -23.0 MB | |  -5.61% |
|         tsdb @timestamp points |  | 423.0 MB | 397.9 MB |  -25.0 MB | |  -5.91% |
|          tsdb @timestamp total |  | 832.2 MB | 784.2 MB |  -48.0 MB | |  -5.77% |
|        tsdb _seq_no doc values |  | 409.3 MB | 386.3 MB |  -23.0 MB | |  -5.61% |
|            tsdb _seq_no points |  | 590.1 MB | 557.5 MB |  -32.6 MB | |  -5.52% |
|             tsdb _seq_no total |  | 999.4 MB | 943.8 MB |  -55.6 MB | |  -5.56% |
| tsdb event.duration doc values |  | 555.4 MB | 548.4 MB |   -7.1 MB | |  -1.27% |
|     tsdb event.duration points |  | 637.4 MB | 629.7 MB |   -7.7 MB | |  -1.21% |
|      tsdb event.duration total |  |   1.2 GB |   1.2 GB |  -14.8 MB | |  -1.24% |
|        tsdb _id inverted index |  | 863.5 MB |   1.0 GB | +176.4 MB | | +20.42% |
|         tsdb _id stored fields |  | 632.4 MB | 610.9 MB |  -21.5 MB | |  -3.40% |
|                 tsdb _id total |  |   1.5 GB |   1.6 GB | +154.9 MB | | +10.35% |
|     tsdb _source stored fields |  |  25.9 GB |  24.1 GB |   -1.8 GB | |  -6.94% |
|             tsdb _source total |  |  25.9 GB |  24.1 GB |   -1.8 GB | |  -6.94% |

@pquentin
Copy link
Member Author

Allow non-default telemetry devices to somehow put their metrics in the report?

After thinking about it and discussing it with @danielmitterdorfer, this is fine for useful non-default telemetry devices in general and #1428 in particular. As Nik showed, this is really useful when you care about it. For more exotic cases, then #1224 would be the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Needs further clarification from the team
Projects
None yet
Development

No branches or pull requests

2 participants