New UTF8String APIs #1751

A-And · 2017-09-11T20:59:40Z

Rationale

As it stands UTF8String lacks full feature parity with System.String. To help alleviate this and taking Issue #358 as a guideline, below are a number of proposed methods to be added to UTF8String.

API Proposal

public partial ref struct Utf8String {

    public static Utf8String operator+(Utf8String s1, Utf8String s2);

    public bool IsNullOrWhiteSpace();

    public Utf8String Remove(int startIndex);
    public Utf8String Remove(int startIndex, int count);

    public Utf8String Replace(Utf8String oldValue, Utf8String newValue);

}

partial static class Utf8Helper {

    public static bool IsDigit(uint codePoint);

}

Discussion

In the original proposal @krwq mentioned that + operator as well as Remove and Replace methods may cause issues with views of strings. However, currently functionality referring to substrings create new Utf8String instances instead of referring to the original. It's entirely possible I may have overlooked something, but as it stands the methods add a lot of convenience without an obvious clear detriment.

The text was updated successfully, but these errors were encountered:

A-And · 2017-09-14T09:34:01Z

@KrzysztofCwalina @ahsonkhan @shiftylogic

KrzysztofCwalina · 2017-09-14T15:22:42Z

Thanks. Some questions/comments:

System.String is immutable. Do we want Utf8String to be immutable too?
System.String can be put on the heap. Utf8String currently cannot. Should Utf8String be heapable?
- BTW, I think concatenation only makes sense for heapable strings.
Do we have APIs to do ToUpper/ToLower?
We should consider adding something like String.Split. It would be great if the API was non allocating, i.e. it would return struct collection like type.
Do you think Utf8Helper would have more members? If not, it might not be worth having a type with one member.
Would IsDigit be culture-sensitive, or invariant? I think most of the existing Utf8String APIs are invariant. If yes, how will we add support for culture-dependent operations?
The proposal above shows Utf8String as a class. Is that intentional?

@migueldeicaza, @jaredpar, @VSadov

VSadov · 2017-09-14T16:34:34Z

I think Utf8String should be as close as possible to String. Strings are units of data - want to store/compose them, use as keys in hashtables and so on.

Also it would be useful to have Utf8Extensions that work with ReadOnlySpan and Span, if In-place mutating like Replace.

Maybe add similar Utf16Extensions over spans of Char for completeness.

A-And · 2017-09-14T19:38:20Z

@KrzysztofCwalina Answers below:

It probably should be. I think deviating behavior between Utf8String and System.String is unnecessary and confusing. To that end it would be worth it to change Remove and Replace to fit with the System.String API, i.e.

    public Utf8String Remove(int startIndex);
    public Utf8String Remove(int startIndex, int count);

    public Utf8String  Replace(Utf8String oldValue, Utf8String newValue);

Heap allocation for Utf8String would definitely fit its performance focus, but might be outside of the scope of the issue. With the risk of sounding naïve - what would be the risks of concatenation on stack allocated Utf8Strings?
Not yet, but I think they would be a good addition to this issue.
Useful, but that would quite possibly either make Remove and Replace functionality unfeasible or requiring a redesign to fit within memory constraints. Working off of @VSadov 's suggestion - implementations of Remove and Replace could be added in Utf8Helper that work in-place on Span<T> and ReadOnlySpan<T>. I'm a bit unsure about maintaining separate implementations, particularly because of degrading usability - the two would need to be clearly distinguishable.
IsDigit would be invariant. Culture-sensitivity could be implemented similarly to System.Char, i.e. use Unicode Categories for classification of characters. I don't think culture-sensitive implementation would necessarily make sense in the Utf8String, but Utf8Helper could be a good place.
& 7. Both are oversights and now fixed above. IsDigit should be correctly marked as static and Utf8String as a partial struct.

The suggestions in 4. and 7. could bloat `Utf8Helper` a bit. Currently it handles code point en- and decoding, but also has a `IsWhiteSpace` method. Would it make sense to spin off methods dealing with operations on an existing `Utf8String` into a separate class?

migueldeicaza · 2017-09-14T20:39:24Z

Utf8String needs to be usable on the heap, otherwise its usefulness greatly decreases

I wrote my own version precisely to work around that current limitation:

https://github.com/migueldeicaza/NStack/blob/master/NStack/strings/ustring.cs

The above implementation (ustring) is to a large extent a port of the Go string, and contains some useful features. Docs here:

https://migueldeicaza.github.io/NStack/api/NStack/NStack.ustring.html

Use the naming "rune" instead of codePoint, as I think the Go guys make a compelling case for this.
Supports Utf8 strings from C# strings, blobs of memory, byte arrays
Has nice Range API that returns a subset of the string
Does not enforce that the contents of Utf8 string are valid - key for dealing later with Unix filenames. This permeates the surface of the API.
Rune-ification of a utf8string

The IsDigit implementation in Go and NStack are Unicode aware, and they are not limited to the ascii values.

ahsonkhan · 2017-09-14T20:50:35Z

Use the naming "rune" instead of codePoint, as I think the Go guys make a compelling case for this.

From: https://blog.golang.org/strings

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.
The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression
'⌘'
is rune with integer value 0x2318.

That is an interesting term for code point. cc @tarekgh

blowdart · 2017-09-14T20:53:36Z

Code point is the phrase in the standards. Just because someone else made something up is not a good reason to use it. Especially as there is an actual runic block.

migueldeicaza · 2017-09-16T00:01:37Z

Some of us lack taste, but we try to learn from the masters.

The same people that came up with "rune" came up with the encoding that revolutionized the Unicode world by introducing the UTF encoding. This is a great -and completely unrelated- read:

http://doc.cat-v.org/bell_labs/utf-8_history

Drawaes · 2017-09-16T00:29:43Z

I had a somewhat similar argument about bigendian network encoding when almost everything is litte endian. The only good reason seems to be "because a million years ago someone said it would be so". And now we have to flip everything that comes off the network......

KrzysztofCwalina · 2017-09-16T01:18:34Z

So, what do people think about the following:

// Ideally, the representation would be just like System.String,
// but this cannot be done without runtime support, and so would not work
// on existing runtimes.
// This representation is close. The main difference is that it can box.
public struct Utf8String : IEquatable<Utf8String>
{
    byte[] _codingPoints;

    public Utf8Span Slice(...) ...
}
public ref struct Utf8Span {
    ReadOnlySpan<byte> _codingPoints;
}

// Existing type.
public class String {
    public Utf16Span Slice(...)
}

public ref struct Utf16Span {
    ReadOnlySpan<char> _codingPoints;
}

migueldeicaza · 2017-09-16T01:34:14Z

If I follow the proposal, is the idea that we would have two data types, one heap-friendly, one stack-only? I like that plan.

I would like to steal the idea that was used in Span recently where a naive implementation was provided for older runtimes, but modern runtimes were made aware of the type and a better implementation was available.

So the idea would be to make Utf8String a class and runtime optimize it so:

Modern runtimes could create Utf8Strings that have the string data inlined in the body of the object instance, like System.String.
We could provide alternative storage systems (current byte buffer, no-copy byte buffer, wrap an existing unmanaged blob).

KrzysztofCwalina · 2017-09-16T01:51:33Z

If I follow the proposal, is the idea that we would have two data types, one heap-friendly, one stack-only?

Yes.

a naive implementation was provided for older runtimes, but modern runtimes were made aware of the type and a better implementation was available.

What would be the representation of the naive representation? An array inside a class? It would mean two GC objects per naive string.

migueldeicaza · 2017-09-16T02:06:33Z

Yes, for the native representation it would cause an extra allocation.

KrzysztofCwalina · 2017-09-16T02:21:30Z

The difference between fast and slow span is very small. The difference between these two string representations would be very significant. I worry that people would just not want to use it.

benaadams · 2017-09-16T02:27:00Z

Is mapping an array inside a struct wrapper to a class too much of a stretch?

e.g.
Older

public struct Utf8String : IEquatable<Utf8String>
{
    byte[] _codingPoints;

    public Utf8Span Slice(...) ...
}

Newer

public class Utf8String : IEquatable<Utf8String>
{
    byte _firstCodingPoint;

    public Utf8Span Slice(...) ...
}

KrzysztofCwalina · 2017-09-16T02:34:53Z

@benaadams, I am not sure I follow.

benaadams · 2017-09-16T02:36:18Z

Type forwarding from struct to class; whether it would make things too weird

migueldeicaza · 2017-09-16T02:42:45Z

@KrzysztofCwalina we have a high-performance version, it just does not go into the heap ;-)

In a moment of insanity, but mostly because I am going to sleep now and what better end of the day than throwing a grenade and leaving, what if we had "interface IUtf8String" and have "struct SUtf8String : IUtf8String" and "Utf8String : IUtf8String"

/me runs from @benaadams

benaadams · 2017-09-16T03:00:42Z

Need a [CanOnlyBeUsedAsGenericConstraint] attribute for the interface

VSadov · 2017-09-16T16:53:27Z

I would prefer Utf8String to be a class if at all possible.
Experience with ImmutableArray being struct shows that while perf is good, there is a number of unfortunate scenarios around boxing/interfaces/IntrlockedExchange/nullable and so on.

ImmutableArray could not be no-indirection class without VM support since elements may contain references that must be reported to GC. Utf8String does not have this problem so, perhaps with some unsafe trickery it can be done without direct VM help.

I am thinking of something like:

[StructLayout(LayoutKind.Sequential)]
public sealed class Utf8String : IEquatable<Utf8String> //, other interfaces
{
    // same field layout as in String
    private int m_stringLength;
    private byte m_firstChar;

    public static readonly Utf8String Empty = new Utf8String();

    // never call publicly
    private Utf8String() { }

    // factory allocates System.String instance with the right data, 
    // but then unsafely casts it to Utf8String and patches its VTable.
    public static Utf8String(byte[] data)
    {
        string stringInst;
        fixed (char* charPtr = &data[0])
        {
            stringInst = new String(charPtr, 0, data.Length / 2);
        }

        Utf8String inst = Unsafe.Cast<Utf8String>(stringInst);
        inst.m_stringLength = data.Length;

        //unsafely replace VTable of inst with one from Utf8String.Empty
        fixed (void** p1 = &inst.m_stringLength)
        {
            fixed (void** p2 = &Empty.m_stringLength)
            {
                // no idea if this is the right offset
                p1[-1] = p2[-1];
            }
        }

        return inst;
    }

    //... the rest of implementation assumes that "chars" follow m_firstChar contiguously.
}

@jkotas - can something like the above work?

jkotas · 2017-09-16T17:39:11Z

This is what @migueldeicaza suggested above - modern runtimes could create Utf8Strings that have the string data inlined in the body of the object. It can be done, but it requires runtime support. I do not think we would implement it by patching the vtable - it is dirty and it would not actually work as written above.

The difference between these two string representations would be very significant.

I think it would be about the same as difference between slow span and fast span overall. Depends on microbenchmarks you pick to make the point.

jkotas · 2017-09-16T17:46:56Z

E.g. The difference on .NET Core x64 is: The fast version allocates AlignUp(21 + length, 8) bytes on GC heap and the slow version allocates AlignUp(49 + length, 8) on GC heap.

VSadov · 2017-09-16T17:52:00Z

Of course the new runtimes can mplement it natively.

The suggestion above is for "slow" Utf8String on legacy runtimes.
I wonder why VTable patching would not work. Other that it is "dirty" on which i agree.

KrzysztofCwalina · 2017-09-16T18:09:41Z

See SixLabors/ImageSharp#327 for an example how some people dislike small differences in performance

jkotas · 2017-09-16T18:19:29Z

I wonder why VTable patching would not work

The GC needs to be able to compute size of an object. Variable sized objects have to have special vtable that make them variable sized.

See SixLabors/ImageSharp#327 for an example how some people dislike small differences in performance

Real-world portable libraries tend to have workarounds for both functional and performance issues in older runtimes. There is nothing to like about those, but it is the reality.

KrzysztofCwalina · 2017-09-16T18:25:40Z

Yeah, I am not saying that two object string is a no-go, just pointing out that some people are very sensitive about it.

KrzysztofCwalina · 2017-09-16T18:30:03Z

With the risk of sounding naïve - what would be the risks of concatenation on stack allocated Utf8Strings?

Not naive at all, I probably worded it too strong. I was thinking that if people are willing to live with the limitations of a stack only string, they probably would not be happy with concat allocating like crazy.

VSadov · 2017-09-16T18:47:28Z

I did not know that variable-sized objects have special vtable to indicate size. I thought GC knows the size and only needs info about layout to know what objects are reachable from current.
Oh well...

KrzysztofCwalina · 2017-09-18T16:17:14Z

Another option I thought about for slow string: abstract class with multiple implementations like follows:

public abstract Utf8String { 
    int _length;

    protected abstract Span<byte> Bytes;

    public static Utf8String Create(bytes[] utf8){
        if(utf8.Length == 0) return Utf8String.Empty;
        if(utf8.Length == 1) return new Utf8String1(utf8[0]);
        if(utf8.Length == 2) return new Utf8String2(utf8[0], utf8[1]);
        ...
        else return new Utf8StringArray(utf8.Clone()); 
    }
}
internal class Utf8String1 : Utf8String {
    byte _b0;
    
    protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
internal class Utf8String2 : Utf8String {
    byte _b0;
    byte _b1;

    protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
internal class Utf8String4 : Utf8String {
    byte _b0;
    byte _b1;
    byte _b2;
    byte _b3;

    protected override Span<byte> Bytes => new Span(ref _b0, _length);
}
///...
internal class Utf8StringArray : Utf8String {
    byte[] _bytes;

    protected override Span<byte> Bytes => _bytes;
}

But it will make calls on the fast string virtual

KrzysztofCwalina · 2017-09-18T21:35:14Z

Anyway, I think we all agree that we want Utf8String to be heapable.

@A-And, if you'd like to keep running with this project, please:

Rename Utf8String to Utf8Span
Add a class called Utf8String that for now just stores an array of bytes (UTF8 coding units). We can optimize it later
I think the class class should have roughly the same APIs like the Utf8Span type, in fact, the Utf8Span's implementation (for most members) could be reused.

A-And · 2017-09-18T22:38:21Z

@KrzysztofCwalina Great! I'll get on it. Also glad this caused so much discussion.

tannergooding · 2017-09-22T17:17:49Z

As a reference, I have a proposal to support declaring UTF8String constants using data declarations here: dotnet/csharplang#909

I believe the only requirement would be that Utf8String be allowed to map directly to a some format that contains a length and a byte*

migueldeicaza · 2017-09-28T16:19:39Z

I am so happy to hear about the move for Utf8String to become heapable @KrzysztofCwalina!

KrzysztofCwalina · 2017-09-28T18:37:13Z

I am so happy you are pushing for UTF8 string in general @migueldeicaza! :-)

jaredpar · 2017-10-02T15:10:25Z

Why are we using byte[] as the storage here instead of Memory<byte>.

KrzysztofCwalina · 2017-10-02T15:38:05Z

I agree. We should consider Memory.

jkotas · 2017-10-02T15:51:27Z

Memory would add 8 more bytes to the instance footprint and it would make primitive operations like indexing significantly slower...

KrzysztofCwalina · 2017-10-02T16:24:25Z

We need to decide if we copy the data in the ctor or not. If we don't copy, Utf8String will not be immutable. If we do copy, we can take Memory<byte>, or even Span<byte>. If we don't copy, I agree with @jkotas that the overhead would be too high.

benaadams · 2017-10-02T17:30:31Z

We need to decide if we copy the data in the ctor or not.

Copying leads to user behavior side effects. For example with string, creating N size and mutating in place with fixed at creation 😔

Shady non-(additionally)-allocating approach (data written 1 time):

Fast allocate empty string size N (new string('\0', N)) - Pre 0-init
Fix string
(Write) Write N chars

Alternative non-(additionally)-allocating approach (data including zeroinit written 3 times, read 2 times):

stackalloc N size char*
(Write) zeroinit N
(Write) write N chars
(Read) string .ctor - scan char* to determine length (WszMultiByteToWideChar)
string .ctor - Fast allocate empty string size N - Pre 0-init
(Read+Write) string .ctor - copy char* input (WszMultiByteToWideChar)

Copy in char* string .ctor - isn't a great sell; especially if you have to go unsafe anyway to call the method (safe stackalloc -> Span<char> overload + ReadOnlySpan<char> -> string.ctor may reduce this a little)

KrzysztofCwalina · 2017-10-02T17:39:55Z

we could solve the problem with something like: https://github.com/dotnet/corefx/issues/21472

var text = Utf8String.Create(20, (utf8Buffer)=> {
    for(int i=0; i<data.Length; i++) utf8Buffer[i] = 65+i;
});

KrzysztofCwalina · 2017-10-02T17:41:35Z

If we only had function pointers, i.e. allocation free delegates :-)

tannergooding · 2017-10-02T17:44:34Z

If we only had function pointers, i.e. allocation free delegates :-)

@KrzysztofCwalina, there is a championed proposal that would add such functionality: dotnet/csharplang#302

I just don't know where it is in the priority queue for language features

migueldeicaza · 2017-10-03T00:07:45Z

My take is that we should not copy, this is what I did for ustring. A convenience method could be used by those that desire a copy of the data to take place.

That design decision has the side effect of making Utf8String a proxy for interpreting data encoded in Utf8 format. This also means that like NStack.ustring it should also allow for invalid UTF8 data to be specified, and provide the APIs necessary to cope with it, this allows the UTF8String to be used as a transport correctly, even when the underlying encoding might be wrong (see, every Unix file system in the world, and also one of the reasons that Go supports this).

tannergooding · 2017-10-03T02:08:08Z

My take is that we should not copy

I think forced copying would also provide a perf hit for UTF8 string literals (dotnet/csharplang#909), if they (or some variation) get championed/approved in the future.

I will say, no matter which is the default (copying or not), there should be an API that explicitly allows the other to be possible.

ahsonkhan · 2018-05-31T02:12:11Z

cc @GrabYourPitchforks, fyi.

ahsonkhan added the area-System.Text.Utf8String label Sep 16, 2017

A-And mentioned this issue Oct 3, 2017

Heapable Utf8String Implementattion #1810

Merged

A-And closed this as completed Nov 22, 2017

migueldeicaza mentioned this issue Jan 31, 2020

Introducing System.Rune dotnet/runtime#23578

Closed

4 tasks

New UTF8String APIs #1751

New UTF8String APIs #1751

Comments

A-And commented Sep 11, 2017 • edited Loading

Rationale

API Proposal

Discussion

A-And commented Sep 14, 2017

KrzysztofCwalina commented Sep 14, 2017 • edited Loading

VSadov commented Sep 14, 2017

A-And commented Sep 14, 2017

migueldeicaza commented Sep 14, 2017

ahsonkhan commented Sep 14, 2017

blowdart commented Sep 14, 2017

migueldeicaza commented Sep 16, 2017

Drawaes commented Sep 16, 2017

KrzysztofCwalina commented Sep 16, 2017

migueldeicaza commented Sep 16, 2017

KrzysztofCwalina commented Sep 16, 2017 • edited Loading

migueldeicaza commented Sep 16, 2017

KrzysztofCwalina commented Sep 16, 2017

benaadams commented Sep 16, 2017

KrzysztofCwalina commented Sep 16, 2017

benaadams commented Sep 16, 2017

migueldeicaza commented Sep 16, 2017

benaadams commented Sep 16, 2017

VSadov commented Sep 16, 2017 • edited Loading

jkotas commented Sep 16, 2017

jkotas commented Sep 16, 2017

VSadov commented Sep 16, 2017 • edited Loading

KrzysztofCwalina commented Sep 16, 2017

jkotas commented Sep 16, 2017 • edited Loading

KrzysztofCwalina commented Sep 16, 2017 • edited Loading

KrzysztofCwalina commented Sep 16, 2017

VSadov commented Sep 16, 2017 • edited Loading

KrzysztofCwalina commented Sep 18, 2017

KrzysztofCwalina commented Sep 18, 2017

A-And commented Sep 18, 2017 • edited Loading

tannergooding commented Sep 22, 2017

migueldeicaza commented Sep 28, 2017

KrzysztofCwalina commented Sep 28, 2017

jaredpar commented Oct 2, 2017

KrzysztofCwalina commented Oct 2, 2017

jkotas commented Oct 2, 2017

KrzysztofCwalina commented Oct 2, 2017 • edited Loading

benaadams commented Oct 2, 2017

KrzysztofCwalina commented Oct 2, 2017 • edited Loading

KrzysztofCwalina commented Oct 2, 2017

tannergooding commented Oct 2, 2017

migueldeicaza commented Oct 3, 2017

tannergooding commented Oct 3, 2017

ahsonkhan commented May 31, 2018

A-And commented Sep 11, 2017 •

edited

Loading

KrzysztofCwalina commented Sep 14, 2017 •

edited

Loading

KrzysztofCwalina commented Sep 16, 2017 •

edited

Loading

VSadov commented Sep 16, 2017 •

edited

Loading

VSadov commented Sep 16, 2017 •

edited

Loading

jkotas commented Sep 16, 2017 •

edited

Loading

KrzysztofCwalina commented Sep 16, 2017 •

edited

Loading

VSadov commented Sep 16, 2017 •

edited

Loading

A-And commented Sep 18, 2017 •

edited

Loading

KrzysztofCwalina commented Oct 2, 2017 •

edited

Loading

KrzysztofCwalina commented Oct 2, 2017 •

edited

Loading