-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'split' support for ReadOnlySpan<char> similar to string #934
Comments
Presumed usage would be like this? var splitter = span.Split(',', StringSplitOptions.RemoveEmptyEntries);
while (splitter.TryMoveNext(out var slice))
{
//....
} Yeah, looks good. |
Would it be possible to implement the enumerable pattern, even if it's not possible to implement I think the interface would look something like: public static SpanSplitEnumerable Split(this ReadOnlySpan<char> span, char seperator);
public ref struct SpanSplitEnumerable
{
public SpanSplitEnumerator GetEnumerator();
}
public ref struct SpanSplitEnumerator
{
public bool MoveNext();
public ReadOnlySpan<char> Current { get; }
} That way, the usage could be: foreach (var slice in span.Split(','))
{
…
} If indexing is important, something like |
Sorry, didn't see this issue before posting https://github.com/dotnet/corefx/issues/21395#issuecomment-359802015. As mentioned, I'd love to see something like Google Guava's |
It would be great if the split worked also with Span and returned ReadOnlyMemory<ReadOnlyMemory>, ReadOnlySpan<ReadOnlyMemory> |
Clarifying the comment so the generic types show up: |
Simple implementation for https://gist.github.com/LordJZ/92b7decebe52178a445a0b82f63e585a The sentinel is a "clever trick" to avoid having a boolean field, can be replaced with a backing field. |
We needed features like this on top of ASP.NET's It would be great to point customers to something in the BCL instead of (say) adding /cc @davidfowl |
@ahsonkhan if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it? |
cc @JeremyKuhne who has been doing thinking about low allocation string operations. |
StringBuilder added an enumerator for the next version that we should consider when deciding on a pattern for enumerating spans. https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs#L587 |
Let me make sure the API shape is clear. Incorporating the recent feedback (to get foreach support, make it generic), here is the API: public static SpanSplitEnumerable<T> Split(this ReadOnlySpan<T> span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public ref struct SpanSplitEnumerable<T> where T : IEquatable<T>
{
public SpanSplitEnumerator<T> GetEnumerator();
}
public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
public bool MoveNext();
public ReadOnlySpan<T> Current { get; }
} |
@KrzysztofCwalina this is the enumerating spans issue we discussed yesterday. |
Does |
I suppose a |
Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types? |
@KrzysztofCwalina I think it makes the API cleaner and easier to understand. The It there any advantage in combining them, apart from decreasing the size of the API surface? |
We don't think we want these APIs:
If you care about allocations, then you want a different API. And if you don't care about allocations, well, then you can just use So what API would we like to see? It could be a struct-based enumerator that allows the consumer to |
I readily concur with this. If I don't care about allocations, I'll simply be using Suggest If I want to use |
Can you use |
|
@terrajobst, the API that @svick proposed above does not seem to allocate. Am I missing something? Or are you saying that we want both convenience and no allocations? @svick, I wonder how important it is for this enumerator to be easy to understand. The classic enumerator pattern has two types for many reasons that are less and less applicable once you deal with by ref structs that cannot implement enumerator interfaces and get copied when passed around. I do agree we should discuss pros and cons and maybe indeed it's better to have two types, but I wanted to open the discussion as it seems such an overkill to add two by ref types just so we can split. |
@KrzysztofCwalina I agree that my argument of "this version of the API is slightly easier to understand" is fairly weak in this case. But in my opinion, the argument of "this version of the API is slightly smaller" is even weaker, which is why I prefer to have two types. But ultimately it doesn't make that much of a difference, as long as I'll be able to have |
@svick I'm suspect a normal And what is the purpose of this again? That's to prevent allocations in parsing code in tight loops, where all we're doing is manipulating streams of text without ever changing them. |
@schungx There's nothing really new to come up with. For example, |
The APIs, as suggested here, don't allocate and are essentially what was recommended. Maybe it was missed since it wasn't at the top. Will copy it over to the top post. Making it into a single type: public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
public SpanSplitEnumerator GetEnumerator() { return this; }
public bool MoveNext();
public ReadOnlySpan<T> Current { get; }
} |
|
Are we going to have an overload that takes ReadOnlySpan as the delimiter as well? |
I tried implementing the APIs above, and additionally a two more split methods in my personal repository. The types and methods themselves can be found in public static ReadOnlySpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> span, T delimiter,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitBySequenceEnumerator<T> Split<T>(this ReadOnlySpan<T> span,
ReadOnlySpan<T> delimiter, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitByAnyEnumerator<T> SplitByAny<T>(this ReadOnlySpan<T> span,
ReadOnlySpan<T> delimiters, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {} The first one replaces It's not a great code perhaps, but I'd love to open a PR for this issue (with the code above). Although the last two methods are not approved, at least I can open one for the regular |
A slightly different take on this. Considering that foreach (var range in span.Split(','))
{
var x = span.Slice(range);
} Perhaps that's too abstract for a public api. |
Did some experimenting with this as well. |
@jeffhandley, this has been languishing and I'd like for us to ship a solution in .NET 9. Can you help? |
Adding some comments here that I mentioned to folks offline. If we want to use the same enumerator type as a return value from all the different overloads, then we're essentially inventing a polymorphic system. (The data storage and lookup algorithm will depend on what overload is called, and we're abstracting all of those away behind a single projection.) Polymorphism in .NET is typically implemented using a base class and derived types, but since people in this thread have said that non-allocation is vital, we can't use the typical .NET mechanisms here where state is captured by derived types which contain the concrete implementations. So there are a few options:
There are pros and cons to all of these. My previous holdup was based on a concern that whatever enumerator shape we choose for v1 of this API will have consequences for overloads and capabilities we may wish to add in the future. That is, if we decide to ship a "minimal" v1 API and we ship whatever API shape happens to fall out of that, that may unintentionally bind our ability to improve this area in v2. So I want to be sure that when the v1 API is added, the choice of shape is deliberate and is done with an eye toward future proofing. As long as that is done, I'm happy. :) |
The most common use case I've seen here is to want to take an existing use of string.Split and be able to iterate through the results in a non-allocating manner. I think we should expose the APIs to do that in .NET 9, following the same structure as the Split span-based APIs we added in .NET 8, plus handling generics (in particular in support of public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, T separator);
public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator);
public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, params ReadOnlySpan<T> separators);
public ref struct StringSplitEnumerator<T>
{
StringSplitEnumerator<T> GetEnumerator();
bool MoveNext();
Range Current { get; }
} To fully optimize that to the max we would likely need several different return types, and I don't believe that's warranted. The size of the returned enumerator isn't particularly important, as it's generally going to be at most one copy and it won't be put on the heap because it's a ref struct. The relevant overheads then will likely be for extra branches on each MoveNext to determine which path to take, and those should be few. Given all the work associated with split, I'm not particularly concerned about those costs. So I think we should keep this API simple and just have the single return type for all the overloads. We've been punting on this for years. We should just do it now. (Note the one use case the above generic overloads lack is support for splitting chars based on multiple strings... if we want to support that, we could also have a non-generic |
Any chance I could pick this up for implementation once the API is reviewed? I have a lot of free time to crack this out pretty quickly, especially given I'd just base in what was already (temporarily) merged. Spent a lot of time on this over the years :') |
Yes, thanks :) |
Is there a reason this isn’t called “EnumerateSplits”? Having “Split” return an enumerator, while it matches what many do with the result, doesn’t match the existing API of returning a complete result. |
I'd be fine calling it EnumerateSplits, matching what we exposed on Regex. But I don't have a good answer for what the corresponding SplitAny would be called. |
EnumerateAllSplits, EnumerateAnySplits... Although it looks too long. |
EnumerateSplitResult(s) ? |
public static class MemoryExtensions
{
public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, T separator);
public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator);
public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, params ReadOnlySpan<T> separators);
public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, SearchValues<T> separators);
public ref struct SpanSplitEnumerator<T>
{
public SpanSplitEnumerator<T> GetEnumerator();
public bool MoveNext();
public Range Current { get; }
}
} |
I'll get on it 🙂 |
Should I would like to use it like T[] Collect<T, TEnum>(TEnum e) where TEnum : IEnumerator<T>
{
var list = new List<T>();
while (e.MoveNext()) list.Add(e.Current);
return list.ToArray();
} |
It can't because it's a |
It's not being removed. It will be shipped as a preview feature instead, and several APIs in BCL have already adopted this feature. |
We have not implemented interfaces on any public ref structs, and we won't on non-experimental public ref structs as long as the language feature is in preview. We have no way to mark just the interface inheritance as experimental / preview, which means if the language feature were to change or disappear, we could be left with non-preview surface area we're unable to maintain. |
+1 on this overall proposal, I’m a fan of making splits into enumerators, and (hopefully) allocation-free. I’ve implemented similar here: https://github.com/clipperhouse/uax29.net |
Having watched the design review above, on Though I am not sure that satisfies the requirements mentioned in the discussion. IIUC, the goal was to a) allow (I appreciate that you likely feel the API has been sufficiently debated and thanks for indulging.) |
Fixed by #104534 |
Edited by @stephentoub on 6/26/2024:
Older proposals:
Previously approved API Proposal
Split off from https://github.com/dotnet/corefx/issues/21395#issuecomment-359342832
From @Joe4evr on January 21, 2018 23:16
From @ahsonkhan on January 22, 2018 01:36
From @Joe4evr on January 22, 2018 08:35
From @stephentoub on January 22, 2018 08:41
cc @KrzysztofCwalina, @stephentoub, @Joe4evr
The text was updated successfully, but these errors were encountered: