Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Initial Metrics #2499

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Add Initial Metrics #2499

wants to merge 5 commits into from

Conversation

eerhardt
Copy link
Contributor

(Note: Leaving in Draft until #2497 is merged as this needs a net6.0 target. You can skip the first commit in the review for now.)

This adds the initial implementation of using System.Diagnostics.Metrics in the StackExchange.Redis client. For this first round, the following metrics are tracked:

  • redis-operation-count
  • redis-completed-asynchronously
  • redis-completed-synchronously
  • redis-failed-asynchronously
  • redis-failed-synchronously
  • redis-non-preferred-endpoint-count

Other metrics in ConnectionCounters need to use UpDownCounter, which is only available in net7.0.

cc @NickCraver @mgravell @JamesNK @noahfalk

@NickCraver
Copy link
Collaborator

Starting early global discussions here:

  • What's the metric convention here on names? Based on other top level for process/rpc/etc. I'd naively expect redis.<name> (rather than redis-<name>), am I way off there?
  • For what's effectively a singleton top to bottom, can we simplify to static meter and properties, eliminating a lot of code and .Instance everywhere? Or is this a stepping stone? (I'd imagine since the receiver is pluggable, e.g. memory OTel collector, this doesn't need to be) If not, I'd like to consolidate this down to just a bunch of static props (rather than constructor) and methods. Depending on area, nested classes would look the same (putting ideas below)
  • Dimensions is really good point...we're going to need to think about this heavily. Some are per-endpoint, but we need to consider aggregation behavior later
  • Should we be incrementing completion latency? (cc @mgravell)

Counter ideas:

  • We should have counters on sent, not just completed (so we can see stalls via metrics)
  • Payload size, if cheap (need to look at this)
    • ...or possibly just bytes written to/read from pipe in general
  • Connection counters:
    • Connected seconds
    • Connections created
    • Connections failures
    • Server errors
    • Internal errors
    • Reconfigurations
  • Things I'd love, but maybe add too much volume (verbosity setting on whether we enable these? Or some flags enum? How do other handle this for mass scale?)
    • Internal metrics of basically everything in an error message that's us: byte buffers, backlogged items, etc.

Tip: GitHub will auto redirect PRs, so you can PR to a PR and when the first PR is merged downstream PRs automatically get redirected to their merge target, e.g. this would change to look at main upon merge, showing only relevant diff the whole time...very handy for these layering cases :)

Happy to hop on a call next week if y'all want to talk though, just getting initial ideas out there. Thoughts?

@JamesNK
Copy link

JamesNK commented Jul 2, 2023

What's the metric convention here on names? Based on other top level for process/rpc/etc. I'd naively expect redis. (rather than redis-), am I way off there?

There isn't a proscribed format from OTel about what format counter names should look like. OTel has some counters they're defining for certain domains and they use dots to separate words. .NET counters use dashes. When exported to Prometheus, they're both turned into underscores.

For what's effectively a singleton top to bottom

Meter + counters don't necessarily have to be a singleton. In ASP.NET Core, the meter is created from IMeterFactory. That allows a different meter per-container and easy unit testing of metrics. We're looking at making HttpClient have an extension point for configuring a meter to allow its values to be isolated.

A singleton is difficult because multiple tests can run in parallel, or work may still be happening from a finished test, which interferes with results.

See https://github.com/JamesNK/MeterFactoryDemo/blob/d0660fb70ea54bfa358606830a4e02f507fc89f6/src/MeterFactoryDemo.Tests/BasicTests.cs#L14-L41 for an example of how easy it can be to test metrics once the meter is injected. I'm guessing the Redis client doesn't have close integration with DI so a more manual approach like what we're doing with HttpClient would be required: dotnet/runtime#86961

Dimensions + other counters

Check out dotnet/aspnetcore#47536 for ideas of the kind of counters that we're adding to ASP.NET Core + dimensions.

@JamesNK
Copy link

JamesNK commented Jul 2, 2023

  • redis-operation-count
  • redis-completed-asynchronously
  • redis-completed-synchronously
  • redis-failed-asynchronously
  • redis-failed-synchronously

You could consider combining these into one histogram counter: redis-operation-duration. Then have a dimension for whether the operation was async/sync and a dimension for the success/failure result.

Comment on lines 51 to 55
if (_operationCount.Enabled)
{
_operationCount.Add(1,
new KeyValuePair<string, object?>("endpoint", endpoint));
}
Copy link

@JamesNK JamesNK Jul 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter.Add tests for enabled internally. Having an extra check like this is really only useful if some work happens when building up the tag list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - we should have a string version of the endpoint memoized so that this KVP is as cheap as possible if not already, no objections adding that.

@NickCraver
Copy link
Collaborator

@JamesNK great points all around (and I really appreciate the expertise and time!)

I agree with collapsing and using dimensions for types (sync/async) and the testing story.

I think my confusion on instance intent here is the direct usage of .Instance on the call sites - if instead the Multiplexer had a RedisMetrics Metrics property (replaceable, per test) which defaulted to the RedisMetrics.Instance: I mostly get it. I think this would have to be surfaced on ConfigurationOptions because access isn't early enough otherwise. But I'm still fuzzy how for example the memory test collector would differentiate given the namespace. Do we do something like making an internal constructor of RedisMetrics take a namespace so a test can override random GUID style for the Meter or is there a simpler solution to this sans dependency? Thoughts?

Also agree on DI assessment: we don't expose the ConnectionMultiplexer constructor, and overloading the static .Connect(Async) methods creates a real maintenance headache (I have regrets there), so IMeterFactory isn't probably something we'd do given the dependency, but I do want to make this open/testable as much as possible.

@noahfalk
Copy link

noahfalk commented Jul 4, 2023

What's the metric convention here on names? Based on other top level for process/rpc/etc. I'd naively expect redis. (rather than redis-), am I way off there?

There isn't a proscribed format from OTel about what format counter names should look like

Just as a heads up (and not at all specific to Redis) I've been questioning lately whether .NET should continue to use and recommend dashes in these names. I sent some email earlier that touched on it. I can't find an explicit written recommendation for '.' in OTel's guidance, yet '.' is used uniformly in every OTel example, every experimental semantic convention, and every implementation I can find. The fact that the OTel text doesn't explicitly say "we recommend using dot" feels more like an oversight rather than OTel deliberately not having a recommendation/preference. Some other other generalized guidance from OTel on naming is here.

Do we do something like making an internal constructor of RedisMetrics take a namespace so a test can override random GUID style for the Meter or is there a simpler solution to this sans dependency? Thoughts?

If you can create an internal constructor for RedisMetrics and get your code under test to use it then I'd recommend making that internal constructor take a Meter object:

public static readonly RedisMetrics Instance = new RedisMetrics(new Meter("StackExchange.Redis"));

internal RedisMetrics(Meter meter)
{
    _meter = meter;
    _operationCount = _meter.CreateCounter<long>(
            "redis-operation-count",
            description: "The number of operations performed.");
    ...
}

And then use it something like:

// arrange
// the name of this Meter doesn't matter and it doesn't need to be unique across tests
var meter = new Meter("StackExchange.Redis");
// This collector will only collect metrics recorded by this specific Meter object.
// Even if other tests use Meters with the same name, they will not be captured by this collector
using var collector = new MetricCollector(meter, "redis-operation-count");
var redisMetrics = new RedisMetrics(meter);

// act
DoSomeWorkThatEmitsMetrics(redisMetrics);

// assert
Assert.Equal(1, collector.GetMeasurementSnapshot().Count);
...

Its certainly possible to inject strings and use them to create custom Meter names, but I think that winds up being more complex than injecting a Meter object and just using the object identity as the thing that keeps metrics for each test isolated.

The other option if you can't inject anything (or prefer not to), is to create MetricCollector(scope:null, meterName:"StackExchange.Redis", instrumentName:"redis-operation-count") and then don't run these tests in parallel with each other. Binding based on name is going to capture data from every Meter+Instrument in the process that matches this name.

@NickCraver
Copy link
Collaborator

@noahfalk The dash feels weird to me too, everything was namespace delimited with dots when we did Bosun's system design and metrics because that's what we saw everywhere else. Dashes weren't a thing at all, and spaces were underscores, e.g. redis.connection_restores. If we're headed that way overall, I'd definitely like to start here. But above all: I don't want to stand out in a metrics list anywhere as the oddball. From what I see with 1P at least, most people have started with the OTel naming because of using the Runtime packages and such....so the first other addition is the oddball. I agree with your direction thinking dropping the dash, let me know if I can help.

As for testing: great suggestion, I love it. I knew we could narrow in on a custom Meter.Name attachment but hadn't seen the "taking a Meter" overload. Thanks as always for advice <3

@noahfalk
Copy link

noahfalk commented Jul 6, 2023

But above all: I don't want to stand out in a metrics list anywhere as the oddball.

Yeah, I don't want Redis to look like an oddball either. If we go through with this I'm going to adjust names for all the metrics being added to ASP.NET and HttpClient for .NET 8, as well as change the written naming guidance in our public docs. I'm not proposing we change names that already got shipped in past releases as part of EventCounters API, but I expect the usage of those metrics will diminish over time and we'd converge towards a place where all the metrics people care about are using dot.

@eerhardt
Copy link
Contributor Author

Thanks for all the feedback so far! I still need to address the feedback to make this testable by allowing the unit tests to inject a Metrics object into the Multiplexer. I'll work on that next week.

In the meantime, I've updated for better naming to align with OpenTelemetry (no more dash - characters). I've also prefixed things with db.redis following the naming patterns in https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/redis.md. I also took @JamesNK's excellent advice and collapsed the 4 counters into a single histogram. Thanks!

Counter ideas:

@NickCraver @mgravell - my intention is for this PR to get the infrastructure and an initial set of metrics added to the library. And then add more metrics in separate PRs as we get the need for them. How does that approach sound? Do you have an idea for what a "minimal set" of initial metrics would be?

@NickCraver
Copy link
Collaborator

@eerhardt I get the intent, but all past experience has taught me not to put a few metrics in and come back later. In every case we've done that, when we go from say the first 3 to the first 20 we found out the scheme/name/dimensions/etc. for the first 3 wasn't right and stands out. IMO, we need to figure out a larger set together.

Looking at current, I see db.redis.operation.count and and db.redis.duration, wouldn't the second be db.redis.opration.duration? (I'm also not sure operation is the right word - we do a lot of operations, in Redis these are typically "commands" and I'd expect that here. This could be bias, but connections, reconnectiions, reconfigurations, etc. are all "operations".

I'm not 100% sure operations/commands makes overall sense as an item. For example, we have commands sent and received. I'd guess based on most of our debugging these should be separate, e.g. the first being steady state and the second pausing indicates a server stall. On commands we probably want the command (e.g. SET) as a tag.

Overall, I think we've got the approach figured out, we should now make an overall list of metrics to get a decent API/name/pattern/tag design that's a lot more comprehensive than starting with just a few. Happy to do a call this week if it helps, I just need to get some active blocking release stuff out the door to get more time on SE.Redis. Let me try and get a decent start on the full list here during the Tuesday call.

@eerhardt
Copy link
Contributor Author

For what's effectively a singleton top to bottom

Ok, I've updated the latest code to allow for injection of the Meter object, so tests can be written effectively. I've also added a single test to show how testing works. Let me know what you think of the current structure.

If we like how this is structured, I can move forward with adding/removing/refactoring the metrics we really want to design. And then add more tests for them.

@pantosha
Copy link

pantosha commented Jan 2, 2024

@eerhardt, #2497 is merged. Could you prepare the current PR for the merge? 😊

var now = Stopwatch.GetTimestamp();
var duration = new TimeSpan((long)((now - message.CreatedTimestamp) * s_tickFrequency));

var tags = new TagList

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for upto 3 tags, its faster to pass them directly.
https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#instruments
When reporting measurements with 3 tags or less, pass the tags directly to the instrument API.

@boika
Copy link

boika commented Oct 2, 2024

It's pity to see it inactive. Are there any plans to include metrics into near future releases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants