Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐇 Use a segmented list to avoid LOH allocations in the formatter #43464

Closed
wants to merge 4 commits into from

Conversation

sharwell
Copy link
Member

Two passes prior to this change:

Found 0 diagnostics in 64232ms (78658653664 bytes allocated)
Execution time (ms):            1023796.7926

Found 0 diagnostics in 65474ms (79440019016 bytes allocated)
Execution time (ms):            943722.7411

Two passes after this change:

Found 0 diagnostics in 56274ms (83220563552 bytes allocated)
Execution time (ms):            883581.9364

Found 0 diagnostics in 59726ms (83103917688 bytes allocated)
Execution time (ms):            779339.9673

@sharwell sharwell requested a review from a team as a code owner April 17, 2020 23:20
@sharwell sharwell changed the title Use a segmented list to avoid LOH allocations in the formatter 🐇 Use a segmented list to avoid LOH allocations in the formatter Apr 17, 2020
@sharwell sharwell marked this pull request as draft April 18, 2020 00:36
}

/// <summary>
/// Segmented list implementation, copied from Microsoft.Exchange.Collections.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe to license as per dotnet/MIT license?

Copy link
Member Author

@sharwell sharwell Apr 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming yes since we put it in PerfView already.

/// </summary>
/// <typeparam name="T">The type of the list element.</typeparam>
/// <remarks>
/// This class implement a list which is allocated in segments, to avoid large lists to go into LOH.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid large lists to go into LOH. is clunky sounding :)

/// Copy to Array
/// </summary>
/// <returns>Array copy</returns>
public T[] UnderlyingArray => ToArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need both UnderlyingArray and ToArray? The name is also misleading. Underlying makes it sounds like the actual internal array is being returned.


if (newSegmentIndex != oldSegmentIndex)
{
_items[oldSegmentIndex] = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pity to lose the array to GC. should it be pooled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely gen 0 by the time it's collected. Pooling may be more expensive with write barriers, plus we don't need to worry about LOH like before.

/// <remarks>
/// This class implement a list which is allocated in segments, to avoid large lists to go into LOH.
/// </remarks>
internal sealed class SegmentedList<T> : ICollection<T>, IReadOnlyList<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super surprising that this a mutable list, but doesn't implement IList

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would also be ok with this not implementing any interfaces. it would be nice to ensure that things like iteration are alloc-free and that sort of thing.

/// <param name="index"></param>
/// <param name="slot"></param>
/// <returns></returns>
public T[] GetSlot(int index, out int slot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer to make thsi as private as possible and not expose internals liek this unless we def have a consumer for it.

}

/// <summary>
/// Appends a range of elements from anothe list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Appends a range of elements from anothe list.
/// Appends a range of elements from another list.


if (_capacity < minCapacity)
{
EnsureCapacity(minCapacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we could just have callers call EnsureCapacity, and EnsureCapacity can early bail if necessary.

/// <summary>
/// Returns the enumerator.
/// </summary>
IEnumerator<T> IEnumerable<T>.GetEnumerator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IEnumerator<T> IEnumerable<T>.GetEnumerator()
IEnumerator<T> IEnumerable<T>.GetEnumerator() => GetEnumerator();

/// <param name="item">Element to check.</param>
bool ICollection<T>.Contains(T item)
{
throw new NotImplementedException("This method of ICollection is not implemented");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

}

Array.Copy(_items[lastSegment], 0, _items[lastSegment], 1, lastOffset);
_items[lastSegment][0] = save;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm basically glossing over these sections. but i assume they're correct :)

Copy link
Member

@CyrusNajmabadi CyrusNajmabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. but i would prefer SegmentedList is as minimal as possible. if we can ifdef/remove as much of it's public surface area that we don't need, that would make me happy :)

@mjsabby
Copy link
Contributor

mjsabby commented Apr 18, 2020

@danmosemsft It'd be great if corefx would just provide something like this (collections that don't allocate on the LOH; a reality of running on .net). This code was copied from perfview which was copied from exchange, which was copied from sharepoint, which was copied from office shared ...

@danmoseley
Copy link
Member

That's interesting. @stephentoub has this been discussed before?

@sharwell
Copy link
Member Author

@danmosemsft I discussed this a few times casually with Stephen Toub. The thing that's most interesting to me is the variety of data types which can be ideal for different applications. The two of most interest to me currently are segmented arrays and B+-trees.

@CyrusNajmabadi
Copy link
Member

@danmosemsft I discussed this a few times casually with Stephen Toub. The thing that's most interesting to me is the variety of data types which can be ideal for different applications. The two of most interest to me currently are segmented arrays and B+-trees.

I feel like it's important to have this data type. But not because it's a particularly important one, but primarily to just deal with the LOH problem. i.e. an array truly is hte best representation here for our needs, but this is a workaround to very poor behavior of something out of our control. That seems like a pity as there are many cases where "large" does not equate to "long lived".

@stephentoub
Copy link
Member

stephentoub commented Apr 19, 2020

@Maoni0, do you have an opinion on such data structures working around the LOH vs anything we may be able to improve in the LOH?

@mjsabby
Copy link
Contributor

mjsabby commented Apr 19, 2020

If we're taking data structure requests. I'd like to +1 the B+-Trees, and also add readonly NativeHashTable from R2R images (@jkotas may know more about that), and build-time generation of perfect hashing functions for use in read-only hash tables.

@sharwell
Copy link
Member Author

@mjsabby One of my side projects is work on B+-trees. tunnelvisionlabs/dotnet-trees

@jkotas
Copy link
Member

jkotas commented Apr 20, 2020

Is the improvement caused by avoiding LOH, or is it the improvement actually coming from reduced copying? I suspect that it is the latter.

The threshold for LOH is configurable using COMPlus_GCLOHThreshold. You can set this to above the maximum array byte size in your benchmark to see the impact of LOH policies alone. I am curious what it is going to tell us.

@jkotas
Copy link
Member

jkotas commented Apr 20, 2020

reduced copying

Note that copying of large arrays that contain object references has several cost:

  1. Copying the memory alone
  2. Write barrier (setting the cards for the copied memory)
  3. Gen0/1 GC having to visit the cards.

In certain situations, the cost of 3. can be more than the cost of 1. + 2. It is not easy to see the true cost of 3 because of it is hidden in the GC pause time that is hard to attribute back to lines of code.

We may consider shifting some of the cost from 3. to 2. to make it easier to attribute by having more precise bulk write barriers (e.g. under config switch). It is probably not a good tradeoff for typical app, but it may be useful for performance analysis or in environments that are willing to trade raw throughput for smaller GC pause times.

@jkotas
Copy link
Member

jkotas commented Apr 20, 2020

Also, it may be useful to do more than 2 iterations. Benchmarks that are sensitive to GC behavior tend to be very noisy.

@sharwell
Copy link
Member Author

sharwell commented Apr 20, 2020

Is the improvement caused by avoiding LOH, or is it the improvement actually coming from reduced copying? I suspect that it is the latter.

The LOH improvements tend to impact downstream scenarios more than localized testing. For example, LOH allocations inside Visual Studio (which tends to run with less than 500M free VM) often result in frequent Gen 2 GC with observable pauses, while SOH allocations are better able to avoid that.

Also, it may be useful to do more than 2 iterations. Benchmarks that are sensitive to GC behavior tend to be very noisy.

Each iteration runs the formatter on ~35000 files. The allocation numbers tend to be more predictable than the CPU numbers though.

@jkotas
Copy link
Member

jkotas commented Apr 20, 2020

Each iteration runs the formatter on ~35000 files

So why is the execution time for the second iteration significantly lower? I would expect the two numbers to be much closer to each if they are averages.

I assume that you are running this on .NET Framework. It may be interesting to run it on .NET Core 3.1 too.

@CyrusNajmabadi
Copy link
Member

but it may be useful for performance analysis or in environments that are willing to trade raw throughput for smaller GC pause times.

Our primary environment is Visual Studio. This is a user facing app where latency is far preferred to throughput to keep the experience responsive. GC pauses (esp LOH ones) are particularly devastating to the experience. This has been one of the main reason we've been moving things out of proc over the years. The GC times in-situ are just so problematic. By moving to another process, we can have independent GCs where the ones in our OOP server don't cause pauses in the VS proc.

@CyrusNajmabadi
Copy link
Member

Note: having a latency tuned GC (like what Go has) would be absolutely fantastic for part of our domain.

@mjsabby
Copy link
Contributor

mjsabby commented Apr 20, 2020

LOH tuning is not always possible, so having these data structures standardized would be helpful.

@sharwell sharwell closed this Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants