Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance usability of runtime telemetry devices #1248

Open
dliappis opened this issue Apr 21, 2021 · 2 comments
Open

Enhance usability of runtime telemetry devices #1248

dliappis opened this issue Apr 21, 2021 · 2 comments
Labels
meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics

Comments

@dliappis
Copy link
Contributor

dliappis commented Apr 21, 2021

Problem statement

Currently Rally's telemetry devices run throughout the entire duration of a benchmark and we can only influence the sampling interval.

This has a number of problems, especially when a user needs to "zoom" into telemetry data for a chosen subset of benchmark tasks:

  1. It's difficult to locate the start and end time for one or more benchmark tasks from the entirety of collected telemetry stats.
  2. If a task is short lived (e.g. a query that finished in <1s) even using the shortest sampling interval (1s) won't help and it's likely that only parts (or none) of telemetry stats are collected during the execution of the task.
  3. Benchmark including one or more long running tasks (e.g. recovering a large snapshot) waste metrics store storage by collecting telemetry for a long period of time, whereas usually we are interested in telemetry during the execution of specific tasks.

In this (meta) ticket we'll brainstorm ideas and come up with the list of tasks to solve, or at least mitigate this problem.

Relates (somewhat): #1224

Task breakdown

TBD

@dliappis dliappis added meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics labels Apr 21, 2021
@dliappis dliappis self-assigned this Apr 21, 2021
@dliappis
Copy link
Contributor Author

This was discussed on a recent team meeting and I'll summarize the discussion so far, so that we can keep discussing approaches async and hash out the desired approach.

Short term (workarounds)

One possibly low hanging fruit, is to ensure that all metrics include a task (and maybe also operation and operation-type) field, similarly to the properties pushed for latency / service_time etc.
This will ease life re: item 1 in the issue description but doesn't solve items 2. and 3.

Open question: what should be recorded for composite operations or when using the parallel element?

Longer term (enhancements)

  1. Add distinction between setup level and runtime level telemetry devices. This is required as a foundational block, to allow specifying task level telemetry devices (see Update telemetry device docs #1247 for more details).

  2. Add optional task properties to specify what telemetry data to collect during the execution of a task. Rally would initiate collection just before the task starts and ends right after the task end. Metrics collected should include the task/operation/operation-type as mentioned above.

    As an example a task could look like:

    {
      "name": "simple-query",
      "operation": {
        "operation-type": "search",
        "index": "elasticlogs-2021-04-21",
        "body": {
          "query": {
            "term": {
              "nginx.access.remote_ip": "192.168.4.4"
            }
          }
        }
      },
      "telemetry": ["node-stats", "searchable-snapshots-stats"],
      "telemetry-params": {
        "searchable-snapshots-stats-sample-interval": 10,
        "searchable-snapshots-stats-indices": {
          "default": ["elasticlogs*"]
        },
        "node-stats-sample-interval": 10
      }
    }

    Pros:

    • Intuitive
    • Very explicit, allowing the use of different telemetry params per task

    Cons:

    • Repetition if >1 tasks require telemetry data
  3. An alternative to 2. would be to extend the definition of telemetry-params and allow to specify a list of tasks. Telemetry data will be collected only during the execution of those tasks. Here we'd need to clarify which name to use i.e. task name or operation name (maybe even operation-type? or tag?) including the specifics of composite operations and parallel elements.

    Pros:

    • No need to replicate definitions in tasks (but this means that same telemetry params need to be applied for all listed tasks ...)

    Cons:

    • Unnecessary room for mistakes (and more effort) due to the need to go back and forth in the track (or use esrally info --track=...) to consult the name to use.
    • Ambiguity with task name/operation name/operation type?
  4. Alternatively to 2. and 3. we could support two separate new administrative operations: start-telemetry and stop-telemetry. These would be administrative tasks and used right before and right after the tasks we want to collect telemetry from respectively.

    Example:

    {
      "operation": {
        "operation-type": "start-telemetry",
        "telemetry-params": { ... }
      }
    },
    {
      "name": "simple-query-1",
      "operation": {
        ...
      }
    },
    {
      "name": "simple-query-2",
      "operation": {
        ...
      }
    },
    {
      "operation": {
        "operation-type": "stop-telemetry",
      }
    }

    Pros:

    • Avoids repetition problem of 2.
    • Still very intuitive

    Cons:

    • "Twists" the concept of an operation/runner; it's not an operation executed against Elasticsearch.
  5. Something else?

@dliappis
Copy link
Contributor Author

@elastic/es-perf / @ywelsch for discussion

@dliappis dliappis removed their assignment Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics
Projects
None yet
Development

No branches or pull requests

1 participant