Enhance usability of runtime telemetry devices #1248

dliappis · 2021-04-21T11:31:23Z

Problem statement

Currently Rally's telemetry devices run throughout the entire duration of a benchmark and we can only influence the sampling interval.

This has a number of problems, especially when a user needs to "zoom" into telemetry data for a chosen subset of benchmark tasks:

It's difficult to locate the start and end time for one or more benchmark tasks from the entirety of collected telemetry stats.
If a task is short lived (e.g. a query that finished in <1s) even using the shortest sampling interval (1s) won't help and it's likely that only parts (or none) of telemetry stats are collected during the execution of the task.
Benchmark including one or more long running tasks (e.g. recovering a large snapshot) waste metrics store storage by collecting telemetry for a long period of time, whereas usually we are interested in telemetry during the execution of specific tasks.

In this (meta) ticket we'll brainstorm ideas and come up with the list of tasks to solve, or at least mitigate this problem.

Relates (somewhat): #1224

Task breakdown

TBD

The text was updated successfully, but these errors were encountered:

dliappis · 2021-04-21T12:28:50Z

This was discussed on a recent team meeting and I'll summarize the discussion so far, so that we can keep discussing approaches async and hash out the desired approach.

Short term (workarounds)

One possibly low hanging fruit, is to ensure that all metrics include a task (and maybe also operation and operation-type) field, similarly to the properties pushed for latency / service_time etc.
This will ease life re: item 1 in the issue description but doesn't solve items 2. and 3.

Open question: what should be recorded for composite operations or when using the parallel element?

Longer term (enhancements)

Add distinction between setup level and runtime level telemetry devices. This is required as a foundational block, to allow specifying task level telemetry devices (see Update telemetry device docs #1247 for more details).

Add optional task properties to specify what telemetry data to collect during the execution of a task. Rally would initiate collection just before the task starts and ends right after the task end. Metrics collected should include the task/operation/operation-type as mentioned above.

As an example a task could look like:

{
  "name": "simple-query",
  "operation": {
    "operation-type": "search",
    "index": "elasticlogs-2021-04-21",
    "body": {
      "query": {
        "term": {
          "nginx.access.remote_ip": "192.168.4.4"
        }
      }
    }
  },
  "telemetry": ["node-stats", "searchable-snapshots-stats"],
  "telemetry-params": {
    "searchable-snapshots-stats-sample-interval": 10,
    "searchable-snapshots-stats-indices": {
      "default": ["elasticlogs*"]
    },
    "node-stats-sample-interval": 10
  }
}

Pros:

Intuitive
Very explicit, allowing the use of different telemetry params per task

Cons:

Repetition if >1 tasks require telemetry data

An alternative to 2. would be to extend the definition of telemetry-params and allow to specify a list of tasks. Telemetry data will be collected only during the execution of those tasks. Here we'd need to clarify which name to use i.e. task name or operation name (maybe even operation-type? or tag?) including the specifics of composite operations and parallel elements.

Pros:
- No need to replicate definitions in tasks (but this means that same telemetry params need to be applied for all listed tasks ...)
Cons:
- Unnecessary room for mistakes (and more effort) due to the need to go back and forth in the track (or use esrally info --track=...) to consult the name to use.
- Ambiguity with task name/operation name/operation type?
Alternatively to 2. and 3. we could support two separate new administrative operations: start-telemetry and stop-telemetry. These would be administrative tasks and used right before and right after the tasks we want to collect telemetry from respectively.

Example:
```
{
  "operation": {
    "operation-type": "start-telemetry",
    "telemetry-params": { ... }
  }
},
{
  "name": "simple-query-1",
  "operation": {
    ...
  }
},
{
  "name": "simple-query-2",
  "operation": {
    ...
  }
},
{
  "operation": {
    "operation-type": "stop-telemetry",
  }
}
```
Pros:
- Avoids repetition problem of 2.
- Still very intuitive
Cons:
- "Twists" the concept of an operation/runner; it's not an operation executed against Elasticsearch.
Something else?

dliappis · 2021-04-21T12:29:18Z

@elastic/es-perf / @ywelsch for discussion

dliappis added meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics labels Apr 21, 2021

dliappis self-assigned this Apr 21, 2021

dliappis mentioned this issue Apr 21, 2021

Collect stats telemetry at start and end of each operation #1246

Closed

dliappis removed their assignment Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance usability of runtime telemetry devices #1248

Enhance usability of runtime telemetry devices #1248

dliappis commented Apr 21, 2021 •

edited

Loading

dliappis commented Apr 21, 2021

dliappis commented Apr 21, 2021

Enhance usability of runtime telemetry devices #1248

Enhance usability of runtime telemetry devices #1248

Comments

dliappis commented Apr 21, 2021 • edited Loading

Problem statement

Task breakdown

dliappis commented Apr 21, 2021

Short term (workarounds)

Longer term (enhancements)

dliappis commented Apr 21, 2021

dliappis commented Apr 21, 2021 •

edited

Loading