Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SequenceReader<T>: Round 2 #69

Merged
merged 4 commits into from
Oct 23, 2018
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 48 additions & 3 deletions 2018/System.Buffers.SequenceReader/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ Status: **Needs more work** |
[Issue](https://github.com/dotnet/corefx/issues/32588) |
[Video](https://www.youtube.com/watch?v=0WN1OXKBMl8)

## Notes
## Session 1

### Notes

* Two types are being proposed here:
- `SequenceReader<T>`: wraps a `ReadOnlySequence<T>`
Expand All @@ -27,7 +29,7 @@ Status: **Needs more work** |
* None of this is accessible to languages that do not support `ref structs`,
such as Visual Basic.

## API feedback
### API feedback

* `Rewind` seems a bit weird
* Can we unify `Advance` and `Rewind` into `Seek()` where the offset can be
Expand All @@ -38,7 +40,7 @@ Status: **Needs more work** |
cost of zeroing out the passed in scratch buffer as it's rarely used. Counter
argument from @stephentoub:
> If `Peek` returned the number of `T`s actually peeked, why would the span
> need to be zero'd out beforehand? (It looks like it's currently defined to
> need to be zeroed out beforehand? (It looks like it's currently defined to
> return a `ReadOnlySpan<T>`, and presumably that logically incorporates that
> by being of the correct length?)
* `IsNext` is a bit weird. Maybe we should we use `StartsWith` and
Expand All @@ -49,3 +51,46 @@ Status: **Needs more work** |
- We shouldn't name them `Advance` as they could conflict with the existing
`Advance(long)` method when `T` is `long`.

## Session 2

### SequenceReader\<T>

* Should we remove the `unamanged` constrained and replace with a `.ctor` check?
- Done for pinning and being able to take a pointer?
* Can we have some sample code that uses a `ReadOnlySpan<T>`-based reader vs. a
`ReadOnlySequence<T>`-based reader?
- We can then compare the per gains
- We can then compare the usability gains
terrajobst marked this conversation as resolved.
Show resolved Hide resolved
* `AtLastSegment`. The logic outlined in the comment is wrong, you need to honor
position, and not check whether there is next segment.
* `Allocator` seems odd. If we need it, should we promote the delegate or move
it elsewhere? We can't use a `Func` due to the generic instantiation issue
with `Span<T>`.
terrajobst marked this conversation as resolved.
Show resolved Hide resolved
- Should the allocator be passed in as argument? Storing it as a field might
cause additional memory barrier
terrajobst marked this conversation as resolved.
Show resolved Hide resolved
* `AreAnyNext` -> `IsNextAny`
* Suggestion from last time was to rename `IsNext` to `StartsWith` but the
latter isn't clear whether it's relative to the current position vs. the
buffer's start.
* We may want to make `advancePastDelimiter` different methods to avoid the perf
issue due to the extra branches.
* The struct has to be passed by `ref` or you have bugs. Also, since it's large,
passing it by value would also tank perf.

### SequenceReaderExtensions

* All methods have a limit of reading 128 bytes maximum
terrajobst marked this conversation as resolved.
Show resolved Hide resolved
- This allows stack allocated buffers to be used and seemed rationale
- Might be too small for specialized scenarios, such as many leading zeros
- We cannot really expose APIs with sizes as this would likely destroy perf
* No APIs can produce strings, we should probably add that as dealing with
things going across buffers is hard and we don't want folks to go through an
intermediary array.

## BufferExtensions

* Can't we make this an instance method?
Copy link
Member

@benaadams benaadams Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why force a defensive copy span allocation if its not required?

// returns false if data is more than one span
public bool TryAsSpan<T>(out ReadOnlySpan<T> span);
// copyBuffer should be >= .Length
public ReadOnlySpan<T> AsSpan<T>(Span<T> copyBuffer);

Copy link
Member

@benaadams benaadams Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should there be a return flag to indicate if the copyBuffer was used?

Otherwise there is ambiguity. Is there is a writable version (or copy) in the copyBuffer? Can the copyBuffer be reused prior to consuming the ReadOnlySpan (i.e. is it part of the ReadOnlySpan or was it unused).

Does it throw an exception if the copyBuffer is too small (assume so); but what if the copy buffer is not needed because the data is a single span, but the copy buffer is too small; does that still throw an exception?

Copy link
Member

@benaadams benaadams Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To resolve the ambiguity; especially around exceptions, maybe it should be:

// returns false if data is more than one span
public bool TryAsSpan<T>(out ReadOnlySpan<T> span);
// destination should be >= .Length or exception 
public void CopyTo(Span<T> destination);

Then the pattern would be something like:

T[] scratch = null;

if (!seq.TryAsSpan(out ReadOnlySpan<T> roSpan)
{
    scratch = ArrayPool<T>.Shared.Rent(seq.Length)
    seq.CopyTo(scratch);
    roSpan = scratch;
}

// Do something with roSpan

if (scratch != null) ArrayPool<T>.Shared.Return(scratch);

Copy link
Member

@stephentoub stephentoub Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should there be a return flag to indicate if the copyBuffer was used? Otherwise there is ambiguity

It seems really rare that you'd care, though, right? Presumably you'd just assume you can't use the supplied scratch buffer again until you're done with the returned span. And in the rare case where you actually cared, you could use Span.Overlaps to determine whether the returned span is part of the scratch buffer or not.

That said, I agree in general that this pattern is confusing, and it'd be better not to force the caller to allocate until they know they have to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is if you want to modify the returned data (mask/unmask etc) then with

public ReadOnlySpan<T> AsSpan<T>(Span<T> copyBuffer);

You need one scratch buffer for the call (which may or may not be used); then a second scratch buffer to copy the returned span into; whereas it would be easier just to jump straight to .CopyTo

For just reading; then the scratch buffer is unnecessary for the fast-path; but you need a .Length sized buffer allocated just in case and if the whole ReadOnlySequence is being read then the scratch buffer is unlikely to be used for more than a single read (e.g. returned to pool prior to awaiting for more data).

Also question on what happens if .Length is greater than 2Bn elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branchless made it worse:

BEFORE
           TwoStage_Array | 17.87 ns | 0.3825 ns | 0.6994 ns | 18.21 ns |  

AFTER
           TwoStage_Array | 18.73 ns | 0.3999 ns | 0.7894 ns | 19.18 ns |

Not terribly surprising- ripping the span out being costly is the only reason SequenceReader is ref after all. :) Flipping to return a span made it slightly better, but still slower than TryAsSpan and is pretty awkard imho:

           TwoStage_Array | 17.03 ns | 0.3595 ns | 0.5597 ns | 17.24 ns |
            ReadOnlySpan<byte> span = _arraySequence.TryAsSpan(out bool success);
            if (!success)
            {

why would we even need SliceOrCopy

I'm not sure what you're asking here? Having the SliceOrCopy methods period is due to the performance improvement in the normal single segment case. Having one without a delegate is again about performance- as crossing segments is so unlikely, new T[] is a good-enough solution for many. Given that it is expected to be common and is faster it makes sense to have an "even-faster" overload for "default" behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the results I get copying and pasting the benchmarks you shared:

            Method |     Mean |     Error |    StdDev |   Median |
------------------ |---------:|----------:|----------:|---------:|
 SliceOrCopy_Array | 18.25 ns | 0.3942 ns | 0.6695 ns | 18.04 ns |
    TwoStage_Array | 17.58 ns | 0.1277 ns | 0.1132 ns | 17.57 ns |

I ran it multiple times, the results always come out relatively the same, with the TwoStage_Array being faster than SliceOrCopy_Array. Note that that's with:

private static ReadOnlySequence<byte> _arraySequence = new ReadOnlySequence<byte>(new byte[10_000]);

since it wasn't clear to me from what you wrote what _arraySequence was. If instead change it to be an instance, the versions end up being almost the same perf-wise for me:

            Method |     Mean |     Error |    StdDev |
------------------ |---------:|----------:|----------:|
 SliceOrCopy_Array | 17.47 ns | 0.3707 ns | 0.3468 ns |
    TwoStage_Array | 17.31 ns | 0.3631 ns | 0.4036 ns |

We're talking nanosecond differences here, in either direction. Maybe it's just minor differences on the machine. Maybe it's the processor involved. Maybe it's something else. But let's be very careful not to add lots of APIs because we found a microbenchmark showed an improvement on one machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're asking here?

I mean, why have these helpers at all? They're one-liners, using public APIs that anyone can use. When I run the same test with:

[Benchmark]
public ReadOnlySpan<byte> OpenCoded()
{
    ref ReadOnlySequence<byte> r = ref _arraySequence;
    return r.IsSingleSegment ? r.First.Span : r.ToArray();
}

for example, I get:

            Method |     Mean |     Error |    StdDev |
------------------ |---------:|----------:|----------:|
 SliceOrCopy_Array | 17.25 ns | 0.3703 ns | 0.6770 ns |
         OpenCoded | 16.41 ns | 0.1735 ns | 0.1538 ns |

So, if we're going for utmost in throughput, worried about developers concerned about this level of microoptimization, and if these numbers are to be believed, then why have the method at all? Let the developer use the ReadOnlySequence APIs however they need, rather than trying to come up with the various "helper" APIs on top of them that are all effectively one-liners.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the complete code:

https://gist.github.com/JeremyKuhne/2eb803b80058fa8b108c81352bc31ba0

I ran it multiple times, the results always come out relatively the same,

Ditto:

                     Method |     Mean |     Error |    StdDev |   Median |
--------------------------- |---------:|----------:|----------:|---------:|
          SliceOrCopy_Array | 14.87 ns | 0.3224 ns | 0.7343 ns | 15.21 ns |
 SliceOrCopy_AllocatorArray | 16.30 ns | 0.3458 ns | 0.4247 ns | 16.41 ns |
             TwoStage_Array | 17.82 ns | 0.3756 ns | 0.5736 ns | 18.05 ns |

But let's be very careful not to add lots of APIs because we found a microbenchmark showed an improvement on one machine.

Well, sure, let's not be blind. But lets also not throw this out because some machines don't get a boost. For the record, my run is on a 6 core i8700K with a 2.1 .NET Core console app and the current BenchmarkDotNet (CTRL-F5ed of course) on Windows.

They're one-liners, using public APIs that anyone can use.

They are simple because these are meant to validate the overhead involved. The whole point of the Allocator is to allow plugging in whatever you want- which may be significantly more complicated. The ArrayPool tracker I mentioned above is one such example. I plan to move onto

The simple SliceOrCopy_Array with no arguments as it makes writing the typical case easy. It also helps discoverability of the fact that .ToArray() is potentially inefficient. I wouldn't toss it simply because someone can easily rewrite the code.

For the record, here is what I get plugging in your OpenCoded:

                     Method |     Mean |     Error |    StdDev |   Median |
--------------------------- |---------:|----------:|----------:|---------:|
                  OpenCoded | 14.52 ns | 0.3108 ns | 0.5018 ns | 14.72 ns |
          SliceOrCopy_Array | 14.95 ns | 0.3185 ns | 0.5904 ns | 15.21 ns |

Not as much as an improvement for me.

Having these methods does not preclude you from writing your own logic. Having them, however, allows easy piping of the Allocator approach (e.g. I can take one as input to SequenceReader), facilitates reusability of solutions, and makes things significantly faster on some machines. All of that makes it pretty compelling to me.

If you could try to run again with the code in the gist and share your configuration perhaps we might find out what it is that makes your run slower than mine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having them, however

And is slower on some machines apparently, and clutters the surface area (which can actually make it harder for a consumer of the API to do the "right thing'"), and adds surface area that may encourage developers to allocate when they don't actually need to, and a developer who's really concerned about eeking out the best possible throughout arguably won't use them, anyway, and for everyone else a nanosecond difference doesn't matter, and they're barely shorter than just writing out the ternary manually. As for "piping the allocator", that could still be used by ToArray if it's desirable, so that doesn't seem to actually change anything.

I will try the shared code again this weekend on a few machines. But at the moment I'm not seeing the benefit here (for ToArray, yes, for SliceOrCopy, no). I would rather we teach developers about the few core APIs that are there and how to best use thm efficiently, saving additional APIs we add for when they're truly meaningful (like the general reader APIs you've been working on).


## UnsafeGetPositionWithinSegment

* Move to `SequenceMarshal`