-
Notifications
You must be signed in to change notification settings - Fork 345
New UTF8String APIs #1751
Comments
Thanks. Some questions/comments:
|
I think Utf8String should be as close as possible to String. Strings are units of data - want to store/compose them, use as keys in hashtables and so on. Also it would be useful to have Utf8Extensions that work with ReadOnlySpan and Span, if In-place mutating like Replace. Maybe add similar Utf16Extensions over spans of Char for completeness. |
@KrzysztofCwalina Answers below:
public Utf8String Remove(int startIndex);
public Utf8String Remove(int startIndex, int count);
public Utf8String Replace(Utf8String oldValue, Utf8String newValue);
The suggestions in 4. and 7. could bloat `Utf8Helper` a bit. Currently it handles code point en- and decoding, but also has a `IsWhiteSpace` method. Would it make sense to spin off methods dealing with operations on an existing `Utf8String` into a separate class? |
I wrote my own version precisely to work around that current limitation: https://github.com/migueldeicaza/NStack/blob/master/NStack/strings/ustring.cs The above implementation (ustring) is to a large extent a port of the Go string, and contains some useful features. Docs here: https://migueldeicaza.github.io/NStack/api/NStack/NStack.ustring.html
The |
From: https://blog.golang.org/strings
That is an interesting term for code point. cc @tarekgh |
Code point is the phrase in the standards. Just because someone else made something up is not a good reason to use it. Especially as there is an actual runic block. |
Some of us lack taste, but we try to learn from the masters. The same people that came up with "rune" came up with the encoding that revolutionized the Unicode world by introducing the UTF encoding. This is a great -and completely unrelated- read: |
I had a somewhat similar argument about bigendian network encoding when almost everything is litte endian. The only good reason seems to be "because a million years ago someone said it would be so". And now we have to flip everything that comes off the network...... |
So, what do people think about the following: // Ideally, the representation would be just like System.String,
// but this cannot be done without runtime support, and so would not work
// on existing runtimes.
// This representation is close. The main difference is that it can box.
public struct Utf8String : IEquatable<Utf8String>
{
byte[] _codingPoints;
public Utf8Span Slice(...) ...
}
public ref struct Utf8Span {
ReadOnlySpan<byte> _codingPoints;
}
// Existing type.
public class String {
public Utf16Span Slice(...)
}
public ref struct Utf16Span {
ReadOnlySpan<char> _codingPoints;
} |
If I follow the proposal, is the idea that we would have two data types, one heap-friendly, one stack-only? I like that plan. I would like to steal the idea that was used in So the idea would be to make Utf8String a class and runtime optimize it so:
|
Yes.
What would be the representation of the naive representation? An array inside a class? It would mean two GC objects per naive string. |
Yes, for the native representation it would cause an extra allocation. |
The difference between fast and slow span is very small. The difference between these two string representations would be very significant. I worry that people would just not want to use it. |
Is mapping an array inside a struct wrapper to a class too much of a stretch? e.g. public struct Utf8String : IEquatable<Utf8String>
{
byte[] _codingPoints;
public Utf8Span Slice(...) ...
} Newer public class Utf8String : IEquatable<Utf8String>
{
byte _firstCodingPoint;
public Utf8Span Slice(...) ...
} |
@benaadams, I am not sure I follow. |
Type forwarding from struct to class; whether it would make things too weird |
@KrzysztofCwalina we have a high-performance version, it just does not go into the heap ;-) In a moment of insanity, but mostly because I am going to sleep now and what better end of the day than throwing a grenade and leaving, what if we had "interface IUtf8String" and have "struct SUtf8String : IUtf8String" and "Utf8String : IUtf8String" /me runs from @benaadams |
Need a |
I would prefer Utf8String to be a class if at all possible. ImmutableArray could not be no-indirection class without VM support since elements may contain references that must be reported to GC. Utf8String does not have this problem so, perhaps with some unsafe trickery it can be done without direct VM help. I am thinking of something like: [StructLayout(LayoutKind.Sequential)]
public sealed class Utf8String : IEquatable<Utf8String> //, other interfaces
{
// same field layout as in String
private int m_stringLength;
private byte m_firstChar;
public static readonly Utf8String Empty = new Utf8String();
// never call publicly
private Utf8String() { }
// factory allocates System.String instance with the right data,
// but then unsafely casts it to Utf8String and patches its VTable.
public static Utf8String(byte[] data)
{
string stringInst;
fixed (char* charPtr = &data[0])
{
stringInst = new String(charPtr, 0, data.Length / 2);
}
Utf8String inst = Unsafe.Cast<Utf8String>(stringInst);
inst.m_stringLength = data.Length;
//unsafely replace VTable of inst with one from Utf8String.Empty
fixed (void** p1 = &inst.m_stringLength)
{
fixed (void** p2 = &Empty.m_stringLength)
{
// no idea if this is the right offset
p1[-1] = p2[-1];
}
}
return inst;
}
//... the rest of implementation assumes that "chars" follow m_firstChar contiguously.
} @jkotas - can something like the above work? |
This is what @migueldeicaza suggested above - modern runtimes could create Utf8Strings that have the string data inlined in the body of the object. It can be done, but it requires runtime support. I do not think we would implement it by patching the vtable - it is dirty and it would not actually work as written above.
I think it would be about the same as difference between slow span and fast span overall. Depends on microbenchmarks you pick to make the point. |
E.g. The difference on .NET Core x64 is: The fast version allocates |
Of course the new runtimes can mplement it natively. The suggestion above is for "slow" Utf8String on legacy runtimes. |
See SixLabors/ImageSharp#327 for an example how some people dislike small differences in performance |
The GC needs to be able to compute size of an object. Variable sized objects have to have special vtable that make them variable sized.
Real-world portable libraries tend to have workarounds for both functional and performance issues in older runtimes. There is nothing to like about those, but it is the reality. |
Yeah, I am not saying that two object string is a no-go, just pointing out that some people are very sensitive about it. |
Not naive at all, I probably worded it too strong. I was thinking that if people are willing to live with the limitations of a stack only string, they probably would not be happy with concat allocating like crazy. |
I did not know that variable-sized objects have special vtable to indicate size. I thought GC knows the size and only needs info about layout to know what objects are reachable from current. |
Another option I thought about for slow string: abstract class with multiple implementations like follows: public abstract Utf8String {
int _length;
protected abstract Span<byte> Bytes;
public static Utf8String Create(bytes[] utf8){
if(utf8.Length == 0) return Utf8String.Empty;
if(utf8.Length == 1) return new Utf8String1(utf8[0]);
if(utf8.Length == 2) return new Utf8String2(utf8[0], utf8[1]);
...
else return new Utf8StringArray(utf8.Clone());
}
}
internal class Utf8String1 : Utf8String {
byte _b0;
protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
internal class Utf8String2 : Utf8String {
byte _b0;
byte _b1;
protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
internal class Utf8String4 : Utf8String {
byte _b0;
byte _b1;
byte _b2;
byte _b3;
protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
///...
internal class Utf8StringArray : Utf8String {
byte[] _bytes;
protected override Span<byte> Bytes => _bytes;
} But it will make calls on the fast string virtual |
Anyway, I think we all agree that we want Utf8String to be heapable. @A-And, if you'd like to keep running with this project, please:
|
@KrzysztofCwalina Great! I'll get on it. Also glad this caused so much discussion. |
As a reference, I have a proposal to support declaring UTF8String constants using data declarations here: dotnet/csharplang#909 I believe the only requirement would be that |
I am so happy to hear about the move for Utf8String to become heapable @KrzysztofCwalina! |
I am so happy you are pushing for UTF8 string in general @migueldeicaza! :-) |
Why are we using |
I agree. We should consider Memory. |
Memory would add 8 more bytes to the instance footprint and it would make primitive operations like indexing significantly slower... |
We need to decide if we copy the data in the ctor or not. If we don't copy, Utf8String will not be immutable. If we do copy, we can take Memory<byte>, or even Span<byte>. If we don't copy, I agree with @jkotas that the overhead would be too high. |
Copying leads to user behavior side effects. For example with string, creating N size and mutating in place with Shady non-(additionally)-allocating approach (data written 1 time):
Alternative non-(additionally)-allocating approach (data including zeroinit written 3 times, read 2 times):
Copy in char* string .ctor - isn't a great sell; especially if you have to go unsafe anyway to call the method (safe stackalloc -> |
we could solve the problem with something like: https://github.com/dotnet/corefx/issues/21472 var text = Utf8String.Create(20, (utf8Buffer)=> {
for(int i=0; i<data.Length; i++) utf8Buffer[i] = 65+i;
}); |
If we only had function pointers, i.e. allocation free delegates :-) |
@KrzysztofCwalina, there is a championed proposal that would add such functionality: dotnet/csharplang#302 I just don't know where it is in the priority queue for language features |
My take is that we should not copy, this is what I did for ustring. A convenience method could be used by those that desire a copy of the data to take place. That design decision has the side effect of making |
I think forced copying would also provide a perf hit for UTF8 string literals (dotnet/csharplang#909), if they (or some variation) get championed/approved in the future. I will say, no matter which is the default (copying or not), there should be an API that explicitly allows the other to be possible. |
cc @GrabYourPitchforks, fyi. |
Rationale
As it stands UTF8String lacks full feature parity with System.String. To help alleviate this and taking Issue #358 as a guideline, below are a number of proposed methods to be added to UTF8String.
API Proposal
Discussion
In the original proposal @krwq mentioned that
+
operator as well asRemove
andReplace
methods may cause issues with views of strings. However, currently functionality referring to substrings create new Utf8String instances instead of referring to the original. It's entirely possible I may have overlooked something, but as it stands the methods add a lot of convenience without an obvious clear detriment.The text was updated successfully, but these errors were encountered: