Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Diagnostics.Activity Perf Improvement #40362

Merged
merged 2 commits into from
Aug 5, 2020

Conversation

CodeBlanch
Copy link
Contributor

@CodeBlanch CodeBlanch commented Aug 5, 2020

Added a struct enumerator for the LinkedList used internally by Activity. The idea is performance-sensitive callers can use that to avoid allocations when enumerating TagObjects, Events, & Links.

Performance numbers: #40362 (comment)

/cc @tarekgh @noahfalk @cijothomas

@ghost
Copy link

ghost commented Aug 5, 2020

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@CodeBlanch
Copy link
Contributor Author

Some other areas I noticed:

  • CreateAndStart method takes in IEnumerable Tags & Links, which it will enumerate. Those will allocate. If we changed the API to accept IList instead of IEnumerable we could do a more simple for loop which wouldn't allocate?
  • IEnumerable<KeyValuePair<string, string?>> Tags property has some special logic to only return the strings in the LinkedList. I couldn't figure out a way to solve that without an allocation. Probably not a big deal, if caller is worried about the perf they can use TagObjects and just exclude the non-string items?
  • IEnumerable<KeyValuePair<string, string?>> Baggage property is an interesting enumeration. It goes up the Parent chain enumerating over any items it finds along the way. Might be able to craft something for that. For now I left it alone because I don't think in OpenTelemetry we export Baggage meaning there isn't a foreach in the hot path.

@tarekgh
Copy link
Member

tarekgh commented Aug 5, 2020

@CodeBlanch thanks for submitting this. in general LGTM.
I want to call out one thing which is, this enumeration now will not be thread safe. I think this is fine but we should clarify that in the doc.

CreateAndStart method takes in IEnumerable Tags & Links, which it will enumerate. Those will allocate. If we changed the API to accept IList instead of IEnumerable we could do a more simple for loop which wouldn't allocate?

I don't think using IList will be a good idea. This will restrict the users of this API to some specific collections and will not be able to use the other collections (e.g. Dictionary<,>)

IEnumerable<KeyValuePair<string, string?>> Tags property has some special logic to only return the strings in the LinkedList. I couldn't figure out a way to solve that without an allocation. Probably not a big deal, if caller is worried about the perf they can use TagObjects and just exclude the non-string items?

I think this can still be done by having the Enumerator<>.MoveNext to do this logic. no?

IEnumerable<KeyValuePair<string, string?>> Baggage property is an interesting enumeration. It goes up the Parent chain enumerating over any items it finds along the way. Might be able to craft something for that. For now I left it alone because I don't think in OpenTelemetry we export Baggage meaning there isn't a foreach in the hot path.

agree, we can optimize this later if we need to.

@tarekgh
Copy link
Member

tarekgh commented Aug 5, 2020

@noahfalk I'll wait your review before I merge it.

@CodeBlanch
Copy link
Contributor Author

I want to call out one thing which is, this enumeration now will not be thread safe. I think this is fine but we should clarify that in the doc.

Can you go into a little more detail on that for me? It looks essentially the same as before to me, so I'm just curious what I'm missing.

I don't think using IList will be a good idea. This will restrict the users of this API to some specific collections and will not be able to use the other collections (e.g. Dictionary<,>)

Good point. It would be nice to have a solve here, but I can't think of anything. Tags you can apply after creation, so you could workaround it if you were so inclined. But Links, you can only pass on ctor. If we added Add/Remove Link, we could workaround for that too.

IEnumerable<KeyValuePair<string, string?>> Tags property has some special logic to only return the strings in the LinkedList. I couldn't figure out a way to solve that without an allocation. Probably not a big deal, if caller is worried about the perf they can use TagObjects and just exclude the non-string items?

I think this can still be done by having the Enumerator<>.MoveNext to do this logic. no?

It seems like it should be possible, doesn't it? I tried a couple of ways, and just tried again, but I can't get it without an allocation. The method needs to return an IEnumerable which exposes the GetEnumerator. So if I do return new StringEnumerableTagThing(_tags) no problem, but there's an allocation. What I really want to do is add IEnumerable<KeyValuePair<string, string>> on TagsLinkedList but it already has IEnumerable<KeyValuePair<string, object?>> so they collide on the IEnumerable.GetEnumerator method.

@tarekgh
Copy link
Member

tarekgh commented Aug 5, 2020

Can you go into a little more detail on that for me? It looks essentially the same as before to me, so I'm just curious what I'm missing.

Sorry I was not clear. you are not changing the thread safety issue. I was just calling out the fact that the enumeration is not thread safe even before your change.

For Tags enumeration, it is ok to keep it like that for now even if it is allocating.

one last ask here, I am seeing the issue #40366 which need a little change. can you add this change with your here? at least to avoid the conflicts? thanks.

@CodeBlanch
Copy link
Contributor Author

one last ask here, I am seeing the issue #40366 which need a little change. can you add this change with your here? at least to avoid the conflicts? thanks.

Done

@tarekgh
Copy link
Member

tarekgh commented Aug 5, 2020

@noahfalk I am merging this but let me know if you have anything you want to change so we can still do it.

@CodeBlanch thanks for your help with this issue.

@tarekgh tarekgh merged commit b62f482 into dotnet:master Aug 5, 2020
@CodeBlanch CodeBlanch deleted the activity-enumerators branch August 5, 2020 22:27
@noahfalk
Copy link
Member

noahfalk commented Aug 6, 2020

Usually when we make a perf change we also have result from BenchmarkDotNet showing the improvement we got from the change. Can we do that here?

Also just to confirm the new code still allocates, but I think it allocates less than the previous did (old code allocated an IEnumerable and an IEnumerator, new code only allocates the IEnumerator). Does that sound accurate?

@CodeBlanch
Copy link
Contributor Author

@noahfalk Agreed. For most callers the struct Enumerator will be boxed up/allocated because it is accessed through the IEnumerable/IEnumerator interface. The reason I asked for these is we have a hack helper in OTel which will lookup the struct GetEnumerator and build a DynamicMethod for it. So, if you are determined enough, these do actually allow you to get there.

Regarding the perf tests, won't really look as good without the helper. You still want me to do them? Or would you like to see before/after from OTel perspective maybe?

@noahfalk
Copy link
Member

noahfalk commented Aug 6, 2020

Regarding the perf tests, won't really look as good without the helper. You still want me to do them?

Yes please. I want to make sure that when people look at this change and it claims "we made perf better" then there is some data to substantiate the claim. It doesn't have to look amazing and you are welcome to include extra OTel specific cases that have even larger gains if you want to show that off : )

@CodeBlanch
Copy link
Contributor Author

@noahfalk Working on the perf tests. It looks like the GetEnumerator I added is being removed. I'm guessing because nothing is referencing it in the code base and it is internal?

If I do this, it seems to work:

        [DynamicDependency(DynamicallyAccessedMemberTypes.PublicMethods, typeof(TagsLinkedList))]
        private TagsLinkedList? _tags;
        [DynamicDependency(DynamicallyAccessedMemberTypes.PublicMethods, typeof(LinkedList<ActivityLink>))]
        private LinkedList<ActivityLink>? _links;
        [DynamicDependency(DynamicallyAccessedMemberTypes.PublicMethods, typeof(LinkedList<ActivityEvent>))]
        private LinkedList<ActivityEvent>? _events;

Correct approach? OK to PR this change?

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

@eerhardt

@CodeBlanch has been optimizing the Activity.TagObject which return IEnumerable<...>. He provided GetEnumerator there. you can look at this PR. He notice GetEnumerator is removed from the code. Is this because of the linker? notice this is in the System.Diagnostics.DiagnosticSource library.

@CodeBlanch are you using the official build of this library, or you are building yours?

@eerhardt
Copy link
Member

eerhardt commented Aug 7, 2020

Is this because of the linker?

Yes. You have a non-public, unused method. This is exactly what the linker removes when it is run.

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

Thanks @eerhardt. what is the best way to force the linker to not remove them?

@eerhardt
Copy link
Member

eerhardt commented Aug 7, 2020

Using DynamicDependency as listed above will work. But honestly, the use of private reflection here is the issue IMO.

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

That is why I asked what is the best way to do that. Why we don't have specific attribute to tell the linker ignore this type/method/field instead of depending on the Reflection?

@eerhardt
Copy link
Member

eerhardt commented Aug 7, 2020

The use of private reflection is coming from open telemetry. Even @CodeBlanch calls it a hack above.

The reason I asked for these is we have a hack helper in OTel which will lookup the struct GetEnumerator and build a DynamicMethod for it.

To answer the question:

what is the best way to do that.

The "best" way to do that is to make the method you want exposed publicly - that way callers can take advantage of it without using private reflection.

In this specific situation, if I write the code:

Activity a = ...;
foreach (KeyValuePair<string, object?> tag in a.TagObjects)
{
}

I won't be able to take advantage of the perf improvement added here.


If you just want a way to tell the linker not to remove these methods, then the DynamicDependencyAttribute listed above will work.

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

Theoretically and practically GetEnumerator code is reachable from the public API.

TagObjects property is a public returning the _tags object which is internal type implementing IEnumerable<> interface and part of this implementation is GetEnumerator(). Isn't the linker is too aggressive here?

@CodeBlanch
Copy link
Contributor Author

Let me make this change locally and put up the performance numbers. I think when you guys see them, you will understand why we are doing the hack. I'm open to doing a more proper fix, which would be to modify the public API to return concrete type(s) that exposes the struct GetEnumerator properly (eg: public TagLinkedList TagObjects { get; }), but I don't think there will be much appetite for such a change.

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

I am not seeing this a hack. this is legitimate thing to do. I don't think we can expose any new types for that in current time.

@CodeBlanch
Copy link
Contributor Author

Below are the perf numbers. I'm going to open a PR to get DynamicDependency in here.

@noahfalk

  • As we expected, minor improvement for callers doing foreach. Massive improvement for OTel using the struct Enumerator helper. No change to EnumerateActivityTags.
  • Do you want me to PR my perf tests into dotnet/performance?

Without perf improvements (BEFORE):

Method NumberOfActivities Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
EnumerateActivityTags 5000 313.6 us 9.05 us 10.43 us 312.3 us 300.9 us 334.0 us 33.1633 - - 273.44 KB
EnumerateActivityTagObjects 5000 385.1 us 7.35 us 7.55 us 383.1 us 373.1 us 403.1 us 33.1230 - - 273.44 KB
OTelHelperActivityTagObjects 5000 467.0 us 9.09 us 9.34 us 465.0 us 449.6 us 488.3 us - - - 273.49 KB
EnumerateActivityLinks 5000 546.1 us 7.17 us 6.36 us 545.1 us 539.0 us 558.2 us 47.4138 - - 390.63 KB
OTelHelperActivityLinks 5000 618.9 us 4.77 us 4.23 us 618.5 us 613.3 us 625.2 us 45.6731 - - 390.7 KB
EnumerateActivityEvents 5000 442.1 us 4.00 us 3.55 us 442.3 us 437.2 us 448.8 us 41.6667 - - 351.56 KB
OTelHelperActivityEvents 5000 500.5 us 8.67 us 8.11 us 500.4 us 488.6 us 515.2 us 41.6667 - - 351.63 KB

With perf improvements (AFTER):

Method NumberOfActivities Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
EnumerateActivityTags (Note: No change to this code path) 5000 334.5 us 16.09 us 18.53 us 328.2 us 314.3 us 376.7 us 32.6705 - - 280000 B
EnumerateActivityTagObjects 5000 449.7 us 7.18 us 6.37 us 448.9 us 440.0 us 461.9 us 27.7778 - - 240000 B
OTelHelperActivityTagObjects 5000 233.1 us 11.44 us 13.17 us 227.6 us 219.5 us 259.3 us - - - 1 B
EnumerateActivityLinks 5000 625.3 us 17.45 us 19.39 us 616.3 us 605.3 us 664.4 us 40.8654 - - 360000 B
OTelHelperActivityLinks 5000 258.5 us 4.89 us 4.80 us 257.0 us 252.7 us 268.6 us - - - -
EnumerateActivityEvents 5000 473.1 us 27.85 us 32.07 us 461.3 us 441.2 us 539.3 us 37.5000 - - 320000 B
OTelHelperActivityEvents 5000 251.7 us 4.69 us 4.16 us 250.9 us 247.0 us 262.6 us - - - -

@eerhardt
Copy link
Member

eerhardt commented Aug 7, 2020

Theoretically and practically GetEnumerator code is reachable from the public API.

This is not true.

public Enumerator<KeyValuePair<string, object?>> GetEnumerator() => new Enumerator<KeyValuePair<string, object?>>(_first);

That method has no callers.

TagObjects property is a public returning the _tags object which is internal type implementing IEnumerable<> interface and part of this implementation is GetEnumerator(). Isn't the linker is too aggressive here?

TagsLinkedList implements IEnumerable<> expicitly. So the "public Enumerator GetEnumerator()" method doesn't implement that interface.

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

thanks @eerhardt for explaining it.

@noahfalk
Copy link
Member

noahfalk commented Aug 7, 2020

As we expected, minor improvement for callers doing foreach

My read of the numbers is showing that callers doing foreach got 6-16% worse, not better?
313.6 -> 334.5 (6.6% slower)
385.1 -> 449.7 (16.7% slower)
546.1 -> 625.3 (14.5% slower)
442.1 -> 473.1 (7% slower)

The OTel gains are great but I hope not to be in the position where we are making one set of users get worse performance so that only OTel gets better performance. Am I interpretting these numbers correctly, and if so do we understand why the change is making the foreach case slower?

Do you want me to PR my perf tests into dotnet/performance?

Yeah that would be helpful to ensure we can keep measuring this in the future and don't accidentally regress the performance.

@CodeBlanch
Copy link
Contributor Author

@noahfalk Sorry I was talking specifically about the allocations, not speed. I don't know what's up with the differences in mean. I just ran a new set...

Before:

Method NumberOfActivities Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
EnumerateActivityTags 5000 311.6 μs 13.76 μs 15.85 μs 311.8 μs 291.2 μs 344.9 μs 32.4519 - - 280001 B
EnumerateActivityTagObjects 5000 413.6 μs 8.77 μs 9.75 μs 414.5 μs 398.8 μs 432.0 μs 31.9865 - - 280003 B
OTelHelperActivityTagObjects 5000 491.5 μs 24.52 μs 28.23 μs 485.2 μs 460.4 μs 551.7 μs 31.8352 - - 280057 B
EnumerateActivityLinks 5000 595.1 μs 18.87 μs 21.73 μs 597.7 μs 560.3 μs 637.7 μs 46.2963 - - 400000 B
OTelHelperActivityLinks 5000 640.8 μs 19.84 μs 22.85 μs 639.6 μs 611.1 μs 686.9 μs 46.1957 - - 400080 B
EnumerateActivityEvents 5000 452.7 μs 7.69 μs 6.42 μs 453.0 μs 440.6 μs 465.2 μs 42.2794 - - 360000 B
OTelHelperActivityEvents 5000 556.5 μs 27.76 μs 29.71 μs 562.1 μs 491.9 μs 601.4 μs 42.9688 - - 360072 B
EnumerateActivityLinkTags 5000 387.8 μs 12.29 μs 13.66 μs 385.4 μs 368.2 μs 417.2 μs 27.9605 - - 240000 B
OTelHelperActivityLinkTags 5000 196.9 μs 4.99 μs 5.74 μs 195.6 μs 188.7 μs 209.1 μs - - - -

After:

Method NumberOfActivities Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
EnumerateActivityTags 5000 315.9 μs 7.30 μs 8.41 μs 315.4 μs 299.6 μs 332.2 μs 33.0882 - - 280001 B
EnumerateActivityTagObjects 5000 414.6 μs 4.45 μs 3.95 μs 414.8 μs 406.8 μs 420.3 μs 27.0270 - - 240000 B
OTelHelperActivityTagObjects 5000 222.6 μs 16.22 μs 18.68 μs 223.2 μs 191.9 μs 255.6 μs - - - 1 B
EnumerateActivityLinks 5000 663.1 μs 9.76 μs 9.13 μs 663.7 μs 647.7 μs 677.7 μs 40.7609 - - 360000 B
OTelHelperActivityLinks 5000 273.8 μs 5.85 μs 6.51 μs 273.1 μs 264.3 μs 289.4 μs - - - -
EnumerateActivityEvents 5000 461.4 μs 9.92 μs 11.02 μs 462.0 μs 443.7 μs 484.4 μs 37.1094 - - 320000 B
OTelHelperActivityEvents 5000 254.4 μs 5.46 μs 6.28 μs 254.6 μs 245.8 μs 270.3 μs - - - -
EnumerateActivityLinkTags 5000 385.2 μs 5.47 μs 4.57 μs 386.4 μs 373.4 μs 390.0 μs 27.4390 - - 240000 B
OTelHelperActivityLinkTags 5000 195.5 μs 2.28 μs 1.90 μs 195.7 μs 191.9 μs 199.4 μs - - - -

Some of them are more or less identical (311.6 μs vs 315.9 μs), some are a bit off. I don't have an explanation for this currently. Could just be what my system is doing while the benchmarks run, could be a real issue? The change itself is essentially returning an explicit enumerator vs letting the compiler make one, any idea why there would be a measurable difference in that?

I'm not an expert in statistics, but if the variance falls within the error, they should be considered ~equal? 🤷

@tarekgh
Copy link
Member

tarekgh commented Aug 7, 2020

@CodeBlanch can you share your test code so I may try running on my machine?

@CodeBlanch
Copy link
Contributor Author

@noahfalk
Copy link
Member

noahfalk commented Aug 7, 2020

I'm not an expert in statistics, but if the variance falls within the error, they should be considered ~equal? 🤷

I think if we were being good statisticians a t-test does that, but its been admittedly a good while since my last stats class ; ) In practice even if there was a statistically significant regression we still might decide it is acceptably small that it isn't worth the time to pursue it or that the value of other perf improvements outweighs it.

any idea why there would be a measurable difference in that?

A good way to get more data is to use the EtwProfiler or EventPipeProfiler attributes on your benchmark. You can then analyze the traces in PerfView or share the traces with us. There is also the Disassembler which is sometimes useful to analyze and compare the code being generated.

Another area to be wary of are source of background noise for measurements such as CPUs that throttle performance in response to heat, background activities such as downloads, installations, virus scans, and other services or VMs. For everything that BenchmarkDotNet does trying to reduce noise, it still is only effective against noise that varies within the time interval it is measuring. For lower frequency noise like a background download that runs for 5 minutes it might encompass an entire BDN run. In an ideal world we'd be able to eliminate all those sources of noise, but often it is easier to control for it a bit by running the baseline and new version of the benchmark in alternating order several times in a row. In the easy case the numbers for each run are repeatable. You might also suddenly see the some runs are notably slower/faster in which case I usually assume the higher performing runs are the ones that had least interference from background load or hardware throttling and these are the ones that will be most fairly comparable.

Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
@karelz karelz added this to the 5.0.0 milestone Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants