Champion "utf8 string literals" #184

gafter · 2017-02-26T18:13:46Z

Proposal: https://github.com/dotnet/csharplang/blob/main/proposals/csharp-11.0/utf8-string-literals.md
Old draft proposal: #2911

Design Review

https://github.com/dotnet/csharplang/blob/main/meetings/2021/LDM-2021-10-27.md#utf-8-string-literals
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-01-26.md#open-questions-in-utf-8-string-literals
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-06-29.md#utf-8-literal-concatenation-operator

MovGP0 · 2017-02-27T10:04:20Z

UTF-8 encodings come with a lot of issues for non-english languages and developers. So this feature might only be a good thing for english developers and a bad idea for everybody else.

It might be useful for interop with non-unicode applications, but then I prefer to have an explicit encoding conversation using System.Text.Encoding.

0xd4d · 2017-02-27T23:32:42Z

@MovGP0 I think this is related to https://github.com/dotnet/corefxlab/blob/master/docs/specs/parsing.md . UTF8 strings are very common and if you don't have to convert to UTF16 (== .NET strings) and back again you save memory and CPU.

wanton7 · 2017-06-02T08:49:34Z

When UTF-8 string literals are added it would be nice to have UTF-8 version of StringBuilder as well.

ufcpp · 2017-06-02T15:08:46Z

@wanton7
This may be it:
https://github.com/dotnet/corefxlab/tree/master/src/System.Text.Formatting/System/Text/Formatting/Formatters

sumtec · 2018-11-27T07:08:18Z

I am quite curious about how it is going to handle the index operator.
utf8string utf8 = "©α中文𨙧: some regular Chinese and special characters “;
utf8char utf8c = utf8[5];

Does it mean the class need to enumerate and decode bytes to utf8characters inside until it finds the sixth character? Or, are you disallowing the index operator on the UTF8String?

Here are the similar questions:

Will there be any function like Substring, Trim, EndsWith?
Will there be any optimization when we have reversed looping, like this?
for (var i= utf8.Length - 1; i>=0; i--) { /* do something */ }

If there no support or optimization for this, I would probably say probably we just need a syntax sugar for converting a string to a byte array with a UTF8 encoding.

yaakov-h · 2018-11-27T07:16:21Z

Those questions are better suited for corefx.

miloush · 2019-05-06T12:59:53Z

@sumtec and @MovGP0 .NET Micro Framework always had UTF-8 string implementation only, transparent to the developer. It does support trimming and substrings and indexing, although the reverse looping is not optimized (source). It saved memory. You could, however, have the same arguments around "normal" strings with surrogate pairs.

gafter · 2019-10-25T22:52:13Z

See #2911 for a minimal specification for this feature.

orthoxerox · 2019-10-26T08:32:04Z

Will there be a type that represents potentially invalid UTF-8 strings, like Linux file paths?

CyrusNajmabadi · 2019-10-26T18:44:52Z

I'm not a fan of this approach as it treats utf8 strings as something other that then needs to be brought in through a side-channel.

It seems this fundamentally could not be picked up by a library author. i.e. if i have a library and i'm already using System.String (highly highly likely), i can't switch to utf8 strings because it will break all my consumers.

And, if i don't use utf8 strings, similarly my consumers will be less likely to as well since they would not want the costs marshalling to/from all libs.

--

I talked to @jcouv about this and the approach that feels like it would be most likely to succeed would be to provide a way to switch the .net runtime to/from utf8 mode (on a process boundary most likely). The benefits here are:

users can switch over everything entirely to utf8 when it is acceptable for their domain.
most apps would immediately get a near 50% reduction in memory for all their strings (iirc measurements showed that 90%+ of all strings are simple ascii).
apps/libraries get switched over all at once based on the needs of the final consumers.

There is a downside in this that often gets brought up. Namely that utf8 strings do have different perf behavior for some ops over strings (namely indexing). However, this doesn't actually seem like a critical problem to me. First, remember that what i'm proposing involves a switch (either opt-in or opt-out) to use ut8 across the board. As such, if someone is in a domain where they index heavily and get a perf hit, they can not use utf8 until they address that problem. Second, i think the problem seems somewhat overblown in terms of how bad it is. We can likely break string indexing up into two domains:

people streaming through a string with monotonically increasing indexes. This can be addressed by:
1.1 pushing those people (with analyzers) to use iterators instead.
1.2 having the runtime be slightly smarter with string indexing. like many utf8 systems out there it could store additional information in the runtime about the last index operation that happened on hte last few strings. If the user passes in str[i] and then str[i + 1] the information about the locatin collected in the first op can be used to make the second fast.
people randomly accessing string indices. This seems like this would be a very small subset of users. And, if that space was truly important, they:
2.1 could opt-out of utf8 strings
2.2 could use some new type that guaranteed constant random access for a string. maybe a new Utf16String, or just a char[] or ImmutableArray<char>

Basically, it feels like there is a path that can get us to a future where almost everyone (final consumers and libraries alike) are on utf8 and the entire ecosystem gets the massive memory savings. It comes at the complexity of having opt-in/out and potentially needing some analyzers/classes for the people using strings in uncommon ways today. However, it seems much better to me than introducing a new utf8 string type that is highly unlikely to be picked up.

CyrusNajmabadi · 2019-10-26T18:51:14Z

As an example of how we have a problem, take a look at Roslyn itself, including the entire Roslyn API we ship.

it is massively System.String based everywhere.
It uses a huge amount of memory internally with strings. IIRC measurements have shown it's >50% of our memory usage in compiler and IDE.

How could Roslyn itself possibly get the benefits of utf8 strings?

We could try switching to it internally, but our marshalling points between the internal and public layers would kill us. For example, every time the IDE accessed a string-property exposed by the compiler, we would take a marshalling hit. And we access those string-properties continuously.
We could try to expose both types of strings somehow? allowing consumers to move to utf8 when possible, while still having the System.String property. But how would this look? .Name and .Name8? How would memory not explode in such a world?
We could switch wholesale over to utf8 strings for our entire surface area. But that would break 100% of the ecosystem out there.

Effectively, afaict, a project like Roslyn could never move to utf8. And we're one of the projects that would benefit the most here. We likely would save gigabytes of memory on real projects on user boxes.

So, as mentioned in teh start, this overall approach seems highly limited and constraining. it will only help projects that are isolated and can completely switch over without having to worry about dependencies. The overall ecosystem will find it nearly impossible to switch.

Conversely, the approach I outlined gives a path forward that allows big saving immediately across the board, with appropriate mechanisms for people to deal with rare problems if they arise. Then, if problems do occur in some places, they can be fixed up without holding the rest of the ecosystem back.

davidroth · 2019-10-30T12:52:19Z

@CyrusNajmabadi Ist there an ongoing discussion on your "side-channel" proposal without introducing a new UTF8String type?

I share your concerns about the fragmentation problem a new UTF8String type would bring.
However I could only find the discussion around design of the new types utf8 types dotnet/corefxlab#2350 and the older compact string proposal: https://github.com/dotnet/coreclr/issues/7083

CyrusNajmabadi · 2019-10-30T19:01:52Z

@CyrusNajmabadi Ist there an ongoing discussion on your "side-channel" proposal without introducing a new UTF8String type?

No clue. @jcouv @gafter is there any hope of this being not a side-channel type? note: personally, i think this is an appropriate hill to die on. It is that important.

gafter · 2019-10-30T19:39:51Z

@CyrusNajmabadi That is a question for corefxlab,coreclr, and corefx, possibly focused at dotnet/corefxlab#2350. This proposal isn't going anywhere without that team making a decision about what they want to do to support UTF-8. If the answer is a new type, this proposal applies.

@orthoxerox Re "Will there be a type that represents potentially invalid UTF-8 strings, like Linux file paths?". Are you asking about ReadOnlySpan<byte>?

orthoxerox · 2019-10-30T19:54:06Z

@gafter will System.IO classes use ReadOnlySpan<byte>?

Like, IEnumerable<ReadOnlySpan<byte>> System.IO.Directory.EnumerateFiles(ReadOnlySpan<byte> path)?

gafter · 2019-10-30T21:35:19Z

@orthoxerox You would have to ask the folks designing those APIs.

airbreather · 2019-11-16T00:00:59Z

I keep going back and forth on this... @CyrusNajmabadi's concerns are absolutely what I have felt to be the biggest downside, and I share the opinion that indexing into the string is not a major concern: I don't see indexing UTF-16 code units as being much different from indexing UTF-8 code units, as both encodings are variable-length, and so one code unit does not always represent one code point (nevermind that one code point does not always represent one character, depending on what the developer has in mind when they talk about "the third character in this string").

I mean, if you were to ask me, "hey @airbreather, if you were designing C# / .NET from scratch, what encoding would you use to store character data in string?", then I would say "UTF-8" without a hint of hesitation (I feel more strongly about this point than I do about array covariance being a mistake). But there's so much momentum behind UTF-16 strings that I can't unequivocally support this proposal: introducing UTF-8 companion types to today's UTF-16 string / char has a very real risk of harming performance, as the majority of the users of the UTF-8 stuff would wind up marshaling anyway to interop with third-party code that uses UTF-16 (edit: at least in the short-term until adoption picks up).

I'm also not terribly optimistic that this will really bear fruit without also investing significantly in CoreFX to add comprehensive first-class support, like what was done for Span<T> / ReadOnlySpan<T> / Memory<T> / ReadOnlyMemory<T>, and I can definitely imagine that major established third-party libraries would not share my enthusiasm for adding parallels in their public API surface.

Ultimately, however, I've settled on a 👍 for this. I personally have a phobia about wasting CPU cycles and virtual memory bytes, so if LDT thinks that, in spite of the concerns raised here, this is something that has a realistic chance of making UTF-8 more of a first-class member of our ecosystem, then I'd be delighted to see this next important step towards breaking the chicken-and-egg feedback loop of:

The language and out-of-the-box APIs make it much easier to use UTF-16 than UTF-8, so practically every library and application uses UTF-16 for their strings, and
Practically everybody uses UTF-16, so most investments in the language and out-of-the-box APIs tend to go towards making things easier for people who have UTF-16 strings

We could try to expose both types of strings somehow? allowing consumers to move to utf8 when possible, while still having the System.String property. But how would this look? .Name and .Name8? How would memory not explode in such a world?

@CyrusNajmabadi in this example, would it be viable for Roslyn to use .Name8 as the actual storage, but keep the existing .Name properties around with accessors that marshal to/from UTF-16 on demand?

Performance-sensitive code paths in the IDE (and elsewhere) would be highly encouraged to switch to .Name8, and there could be Roslyn-specific analyzers that help identify these.
You get the performance benefits of the "hard break" proposal, without behavior changes.

Admittedly, the prospect of ~doubling the public API surface alone may be enough to kill this idea...

CyrusNajmabadi · 2019-11-16T02:58:23Z

Admittedly, the prospect of ~doubling the public API surface alone may be enough to kill this idea...

Yes. It seems like it would just be awful :-/

tmat · 2022-04-15T17:54:28Z

DataTips are more interesting - when you hover you might want to immediately see the string for u8 literals, not the bytes. Since data tips rely on presence of source code I think we should be able to analyze the relevant source and infer that a string should be displayed.

GrabYourPitchforks · 2022-04-15T20:32:40Z

That said, I think it's totally reasonable to push that burden onto the decompiler: whenever it detects a byte sequence that's valid UTF-8 and could be represented as a UTF-8 string literal, it chooses that decompilation over any alternatives. UTF-8 is a sparse encoding, and so the risk of a false detection is kind-of low... not to mention that it could use heuristics.

I think you're going to find that this strategy leads to numerous false positives. Just within dotnet/runtime, this strategy would result in false positives in:

src\coreclr\tools\Common\TypeSystem\IL\
src\coreclr\tools\aot\ILCompiler.Compiler\Compiler\
src\libraries\Common\src\System\Security\Cryptography\Asn1\
src\libraries\System.Data.Common\src\System\Data\SQLTypes\
src\libraries\System.IO.Compression\src\System\IO\Compression\DeflateManaged\
src\libraries\System.Net.Primitives\src\System\Net\IPAddress.cs
src\libraries\System.Net.Security\src\System\Net\Security\TlsFrameHelper.cs
src\libraries\System.Security.Cryptography.Pkcs\src\System\Security\Cryptography\Pkcs\
src\libraries\System.Text.RegularExpressions\src\System\Text\RegularExpressions\Symbolic\
src\mono\mono\mini\

The use of heuristics to solve this problem is interesting, but that's subject to the whims of the disassembler author and would necessarily treat some languages (Chinese/Japanese/Korean being the likely candidate) as more user-hostile than English, which is a bad experience. My experience has been that heuristics aren't a good substitute for an agent (the compiler) embedding correct information at build time.

CyrusNajmabadi · 2022-04-15T21:02:53Z

My experience has been that heuristics aren't a good substitute for an agent (the compiler) embedding correct information at build time.

I guess i don't understand the concept of 'correct'. They're equally correct to me. What it sounds like is more around what the author wrote. But that's not relevant to me as the consumer (for all the other reasons that style are not relevant). If i care about those details, i'll just look at the original source. If i'm working with an arbitrary compiled down representation, then my preference is to follow the style I want, not what the original code was. Indeed, that's more important to me in case a user doesn't use this feature, but i would still like the clarity in the decompiled code for my own readability purposes.

svick · 2022-04-16T08:21:59Z

My point of view is that it is not a goal of IL to accurately represent the original source code, there are other tools for that (like PDBs and Source Link). Also, I can't think of another case where the compiler chooses IL representation based on the needs of decompilers and I don't see a reason why it should start here.

zlatanov · 2022-04-19T17:07:37Z

Aren't the implicit operators introducing a breaking change? Take for example the following code, which with C# 10 compiles fine, but will fail to compile if C# 11 is used.

using System;

var helper = new Helper();
helper.Add( "key", "value" );

class Helper
{
    public void Add(string key, ReadOnlySpan<char> data) { }
    public void Add(string key, ReadOnlySpan<byte> data) { }
}

BreyerW · 2022-04-20T08:07:51Z

@zlatanov there shouldnt be since plain strings should remain utf-16 and to make byte overload considered you would have to add utf8 suffix to the last string or whatever the final suffix will be

zlatanov · 2022-04-20T08:11:10Z

@zlatanov there shouldnt be since plain strings should remain utf-16 and to make byte overload considered you would have to add utf8 suffix to the last string or whatever the final suffix will be

As I am writing this, it seems not to be the case. Check this out: https://sharplab.io/#v2:EYLgtghglgdgPgAQEwEYCwAoTA3CAnAAgAsBTAGwAcTCBeAmEgdwIAlyq8AKASgG5NSlagDoAggBNxnAgCIA1iQCeMgDSzcZAK4kZBPpkzJW7apgDemAlYIIAzDYAsBCVIQoADAQWK1AJRIQ4gDyMGSKAMoUEDAAPADGRPgAfATiEAAuENwEZgQAvpbWdo7Okpxunt5+AcGhEVGxwIrpJClpmdm5BRh5QA==

It fails to compile and gives an error CS0121: The call is ambiguous between the following methods or properties: 'Helper.Add(string, ReadOnlySpan<char>)' and 'Helper.Add(string, ReadOnlySpan<byte>)'.

BreyerW · 2022-04-20T08:17:50Z

@zlatanov probably preview bug since utf8 suffix also isnt implemented in this branch (gives syntax error) so i wouldnt worry too much they still have about half a year before final release

zlatanov · 2022-04-20T08:19:34Z

@BreyerW This compiles just fine:

using System;

var helper = new Helper();
helper.Add( "key", "value"u8 );

class Helper
{
    public void Add(string key, ReadOnlySpan<char> data) { }
    public void Add(string key, ReadOnlySpan<byte> data) { }
}

BreyerW · 2022-04-20T08:23:16Z

@zlatanov ah i was mistakenly using utf8. Then it is worth checking but i still think it is just preview glitch and will get smoothed out before final release

bernd5 · 2022-04-20T08:39:03Z

The problem is, that string has now an implicit conversion operator to ReadOnlySpan<char>. In addition with the new UTF8 proposol the compiler added an implicit conversion from string literals to ReadOnlySpan<byte> and byte[] . This conversion should have a lower priority.

zlatanov · 2022-04-20T08:43:15Z

The problem is, that string has now an implicit conversion operator to ReadOnlySpan<char>. In addition with the new UTF8 the compiler added an implicit from string literals to ReadOnlySpan<byte> and byte[] conversion. This conversion should be have a lower priority.

Implicit operator to ReadOnlySpan<char> is not new, it has been there for a while now. The implicit operator to ReadOnlySpan<byte> is the new one.

bernd5 · 2022-04-20T09:09:50Z

This issue seems to be fixed in the spec / proposal: https://github.com/dotnet/csharplang/blob/main/proposals/utf8-string-literals.md#resolved-overload-resolution-breaks

This fix is just not implemented:

using System;
using static App;

M1("");

class App{
    public static void M1(ReadOnlySpan<char> charArray) => Console.WriteLine(charArray.Length);
    public static void M1(byte[] byteArray) => Console.WriteLine(byteArray.Length);
}

WeihanLi · 2022-05-29T03:43:00Z

I got a problem when I update the SDK, it breaks my previous code

Error message:

Error CS8652: The feature 'Utf8 String Literals' is currently in Preview and unsupported. To use Preview features, use the 'preview' language version.

Failed CI:

Sample code: https://github.com/WeihanLi/SamplesInPractice/tree/63795cd961dc64ff3b92473a09eb90a59792db19/MiniAspNetCore

orthoxerox · 2022-06-30T13:05:49Z

Will UTF-8 string literals support raw string literals?

var s2 = $"""
hello
"""u8;                // Okay and type is ReadOnlySpan<byte>

HaloFour · 2022-06-30T13:19:36Z

@orthoxerox

Regular raw strings already work on SharpLab, verbatim strings too. But it doesn't appear that interpolated strings are supported, in any flavor.

FaustVX · 2022-06-30T13:33:07Z

Will interpolated UTF8 work with Interpolated String Handler ?

333fred · 2022-06-30T15:00:07Z

As HaloFour said, there are no interpolated UTF-8 strings.

FaustVX · 2022-06-30T15:11:27Z

@333fred
Ok, @HaloFour edited his response after I wrote mine 😄

HaloFour · 2022-06-30T15:13:19Z

@FaustVX

Yeah, sorry, I missed the $ in the example code in the original question. 😄

gafter added the Proposal champion label Feb 26, 2017

gafter assigned MadsTorgersen Feb 26, 2017

gafter added this to the 7.2 candidate milestone Feb 26, 2017

gafter added the Proposal label Apr 10, 2017

MadsTorgersen modified the milestones: 8.0 candidate, 7.2 candidate Aug 18, 2017

gafter modified the milestones: 8.0 candidate, 9.0 candidate Apr 29, 2019

gafter mentioned this issue Oct 25, 2019

UTF8 String Literals - Draft Specification #2911

Closed

CyrusNajmabadi mentioned this issue Oct 30, 2019

Utf8String design proposal dotnet/corefxlab#2350

Closed

Sergio0694 mentioned this issue Jun 24, 2022

[Proposal]: UTF8 literal support for nameof() expressions #6235

Closed

4 tasks

333fred removed the Blocked Waiting for a dependency label Jun 29, 2022

ufcpp mentioned this issue Jul 19, 2022

Proposal: list pattern #3435

Open

jcouv mentioned this issue Sep 26, 2022

Test plan for Utf8StringLiterals feature. dotnet/roslyn#58848

Closed

53 tasks

jcouv modified the milestones: Backlog, 11.0 Sep 26, 2022

333fred added the Implemented Needs ECMA Spec This feature has been implemented in C#, but still needs to be merged into the ECMA specification label Jan 9, 2023

This was referenced Feb 11, 2023

C# Features adrianoc/cecilifier#9

Open

Add support for Utf8 string literals adrianoc/cecilifier#221

Open

dotnet locked as resolved and limited conversation to collaborators Dec 12, 2024

Champion "utf8 string literals" #184

Champion "utf8 string literals" #184

Comments

gafter commented Feb 26, 2017 • edited by jcouv Loading

Design Review

MovGP0 commented Feb 27, 2017

0xd4d commented Feb 27, 2017

wanton7 commented Jun 2, 2017

ufcpp commented Jun 2, 2017

sumtec commented Nov 27, 2018

yaakov-h commented Nov 27, 2018

miloush commented May 6, 2019

gafter commented Oct 25, 2019

orthoxerox commented Oct 26, 2019

CyrusNajmabadi commented Oct 26, 2019 • edited Loading

CyrusNajmabadi commented Oct 26, 2019 • edited Loading

davidroth commented Oct 30, 2019 • edited Loading

CyrusNajmabadi commented Oct 30, 2019

gafter commented Oct 30, 2019

orthoxerox commented Oct 30, 2019 • edited Loading

gafter commented Oct 30, 2019

airbreather commented Nov 16, 2019 • edited Loading

CyrusNajmabadi commented Nov 16, 2019

tmat commented Apr 15, 2022

GrabYourPitchforks commented Apr 15, 2022

CyrusNajmabadi commented Apr 15, 2022

svick commented Apr 16, 2022

zlatanov commented Apr 19, 2022

BreyerW commented Apr 20, 2022 • edited Loading

zlatanov commented Apr 20, 2022

BreyerW commented Apr 20, 2022

zlatanov commented Apr 20, 2022

BreyerW commented Apr 20, 2022

bernd5 commented Apr 20, 2022 • edited Loading

zlatanov commented Apr 20, 2022

bernd5 commented Apr 20, 2022 • edited Loading

WeihanLi commented May 29, 2022 • edited Loading

orthoxerox commented Jun 30, 2022

HaloFour commented Jun 30, 2022 • edited Loading

FaustVX commented Jun 30, 2022

333fred commented Jun 30, 2022

FaustVX commented Jun 30, 2022

HaloFour commented Jun 30, 2022

gafter commented Feb 26, 2017 •

edited by jcouv

Loading

CyrusNajmabadi commented Oct 26, 2019 •

edited

Loading

CyrusNajmabadi commented Oct 26, 2019 •

edited

Loading

davidroth commented Oct 30, 2019 •

edited

Loading

orthoxerox commented Oct 30, 2019 •

edited

Loading

airbreather commented Nov 16, 2019 •

edited

Loading

BreyerW commented Apr 20, 2022 •

edited

Loading

bernd5 commented Apr 20, 2022 •

edited

Loading

bernd5 commented Apr 20, 2022 •

edited

Loading

WeihanLi commented May 29, 2022 •

edited

Loading

HaloFour commented Jun 30, 2022 •

edited

Loading