Add initial experimental .NET CLR runtime metrics #1035

stevejgordon · 2024-05-14T10:23:35Z

Fixes #956

Changes

Adds proposed experimental .NET CLR runtime metrics to the semantic conventions. Based on discussions with the .NET runtime team, the implementation plan will be to port the existing metrics from OpenTelemetry.Instrumentation.Runtime as directly as possible into the runtime itself. The names have been modified to align with the runtime environment metrics conventions.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
~~schema-next.yaml updated with changes to existing conventions.~~

stevejgordon · 2024-05-14T10:36:25Z

I'm not entirely sure why markdownlint fails on the autogenerated tables.

model/metrics/clr-metrics.yaml

trisch-me · 2024-05-14T14:58:38Z

Have you run both attribute-registry-generation and table-generation?

stevejgordon · 2024-05-14T15:27:28Z

Have you run both attribute-registry-generation and table-generation?

I thought I had done so as the contents were updated, but I can try those again. I'm having to hack around the tooling a bit to get things running on Windows.

stevejgordon · 2024-05-14T15:37:02Z

@trisch-me I've updated the formatting per your suggestion and rerun both of those make targets. Neither made any changes to the markdown files.

docs/runtime/clr-metrics.md

model/metrics/clr-metrics.yaml

trisch-me · 2024-05-15T11:58:20Z

Alternatively you could wait until #1000 will be merged, it has fix for this bug

trisch-me · 2024-05-15T12:40:28Z

@stevejgordon The #1000 is merged so please re-run code generation locally and update your files. Thanks

stevejgordon · 2024-05-15T13:19:49Z

Thanks, @trisch-me. Looks good!

gregkalapos

Happy to see this 🎉

Left some comments below.

model/metrics/clr-metrics.yaml

docs/attributes-registry/clr.md

docs/runtime/clr-metrics.md

linux-foundation-easycla · 2024-05-17T09:21:39Z

The committers listed above are authorized under a signed CLA.

✅ login: stevejgordon / name: Steve Gordon (54833c4, cdbef6a, 630ee50, 1a27404, a4ad895, 6b0a6e9, 1aefc29, dc63af3, a5da35c, 5d63a8f, c80a739, 8717a40, 3d14868, b4b98c0, d08792a, aa4eb33, c48510f, 4ae73ed, 88fd58c, 72c1657, fe47a31, 90a9356, f8866a1, 866a85c, 8f95fd1, c6b3628, 6b7a0ed, 8cc3175, 8e90ea0, f1dd3dd, eab22b2, 085e331, fbd75f2)
✅ login: lmolkova / name: Liudmila Molkova (d104216, 47ed007, 6dcfd10)

lmolkova

Thank you for working on this!

The main concerns from my side:

it seems we're designing 'proper' CLR metrics based on the information we can get today, but native runtime instrumentation can do much better and provide:
- GC duration histogram
- lock contention duration
- ...
I'm no expert in all of the runtime details, but I assume that some metrics are used much more that others (e.g. CPU, GC, heap size, thread pool) while things like JIT metrics could be more advanced and specialized. I wonder if it's possible to start with basic-perf analysis (e.g. CPU, memory/GC) and then move on to more specific metrics in a follow up PRs?

.chloggen/clr-runtime.yaml

model/metrics/clr-metrics.yaml

trask · 2024-05-28T18:11:08Z

/easycla

lmolkova · 2024-06-03T18:55:27Z

Looking at the discussions in this PR, I want to reiterate the proposal #1035 (review):

Let's think about user experience - which metrics users want to see first - my assumption is CPU, memory (from all sources, ideally in one/few metrics, GC, maybe threads. Let try to come up with a few metrics that would not require users to have a deep prior expertise in .NET memory management or know subtle differences between different .NET flavors.

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things.
Let's focus on basics first.

noahfalk · 2024-06-03T23:39:02Z

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things.
Let's focus on basics first.

Are you suggesting lets review some things first and some things later? Or do you mean "Lets see if we can avoid including some of these metrics in .NET 9 at all?" If its just review ordering I'd say no problem. If the goal is to have these metrics not contain all the metrics in the current OTel .NET runtime instrumentation I think that leads to trouble. One of the major scenarios will be people migrating from using older metrics to these metrics and every metric we remove makes that migration harder or discourages them from migrating at all. If we want to have a simple set of metrics for folks just getting started I think a good approach to that will be docs or a pre-made dashboard, not by excluding metrics from the underlying instrumentation.

EDIT: Just to add I know a few of the metrics may feel a bit advanced or niche, but they are there because customer feedback wanted them to be there. We took a bunch of things out during the move from Windows Performance Counters -> EventCounters, but customers gave us feedback that we had cut too deep and please restore some of the metrics that were important to them.

Co-authored-by: Noah Falk <noahfalk@users.noreply.github.com>

Co-authored-by: Liudmila Molkova <limolkova@microsoft.com>

Co-authored-by: Noah Falk <noahfalk@users.noreply.github.com>

github-actions · 2024-08-23T03:20:15Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Co-authored-by: Noah Falk <noahfalk@users.noreply.github.com>

lmolkova · 2024-08-23T04:52:15Z

I took a liberty to resolve last two discussions, took @noahfalk suggestion on one of them to match what's documented in runtime and regenerated tables.

With this I believe this is ready to go.

## Summary Create a dedicated "portable dashboard" for OTel .NET. This uses metrics available in the [contrib](https://github.com/open-telemetry/opentelemetry-dotnet-contrib) runtime metrics library. These metrics are opt-in and not enabled by default in the vanilla SDK. Our Elastic distro brings in the package and enables them by default. Therefore, the dashboard will only work if a) the customer uses our distro or b) they enable the metrics themselves when using the vanilla SDK. Further, work is ongoing to define [semantic conventions for .NET runtime metrics](open-telemetry/semantic-conventions#1035). Once complete, the metrics will be implemented directly in the .NET runtime BCL and be available with no additional dependencies. The goal is to achieve that by .NET 9, which is not guaranteed. At that point, the metric names will change to align with the semantic conventions. This is not ideal, but it is our only option if we want to provide some form of runtime dashboard with the current metrics and OTel distro. As with #182107, this dashboard uses a table for some of the data and this table doesn't seem to reflect the correct date filtering. Until there is a solution, this PR will remain in the draft, or we can consider dropping the table for the initial dashboard. ![image](https://github.com/elastic/kibana/assets/3669103/0be46495-e09f-4f4e-81e1-5f69361d5781)

## Summary Create a dedicated "portable dashboard" for OTel .NET. This uses metrics available in the [contrib](https://github.com/open-telemetry/opentelemetry-dotnet-contrib) runtime metrics library. These metrics are opt-in and not enabled by default in the vanilla SDK. Our Elastic distro brings in the package and enables them by default. Therefore, the dashboard will only work if a) the customer uses our distro or b) they enable the metrics themselves when using the vanilla SDK. Further, work is ongoing to define [semantic conventions for .NET runtime metrics](open-telemetry/semantic-conventions#1035). Once complete, the metrics will be implemented directly in the .NET runtime BCL and be available with no additional dependencies. The goal is to achieve that by .NET 9, which is not guaranteed. At that point, the metric names will change to align with the semantic conventions. This is not ideal, but it is our only option if we want to provide some form of runtime dashboard with the current metrics and OTel distro. As with elastic#182107, this dashboard uses a table for some of the data and this table doesn't seem to reflect the correct date filtering. Until there is a solution, this PR will remain in the draft, or we can consider dropping the table for the initial dashboard. ![image](https://github.com/elastic/kibana/assets/3669103/0be46495-e09f-4f4e-81e1-5f69361d5781) (cherry picked from commit 0600309)

stevejgordon requested review from a team May 14, 2024 10:23

github-actions bot assigned arminru May 14, 2024

stevejgordon mentioned this pull request May 14, 2024

Metrics for System.Runtime dotnet/runtime#85372

Closed

trisch-me reviewed May 14, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

stevejgordon requested a review from trisch-me May 14, 2024 16:03

tarekgh reviewed May 14, 2024

View reviewed changes

docs/runtime/clr-metrics.md Outdated Show resolved Hide resolved