Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF8 byte representation concatenation operator to utf8-string-literals.md #6221

Merged
merged 2 commits into from
Jun 21, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions proposals/utf8-string-literals.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,22 @@ When the input text for the literal is a malformed UTF16 string, then the langua
var bytes = "hello \uD801\uD802"u8; // Error: the input string is not valid UTF16
```

### Addition operator

A new bullet point will be added to [§11.9.5 Addition operator](https://github.com/dotnet/csharpstandard/blob/draft-v7/standard/expressions.md#1195-addition-operator) as follows.

- UTF8 byte representation concatenation:

```csharp
ReadOnlySpan<byte> operator +(ReadOnlySpan<byte> x, ReadOnlySpan<byte> y);
```

This binary `+` operator performs byte sequences concatenation and is applicable if and only if both operands are semantically UTF8 byte representations.
An operand is semantically a UTF8 byte representation when it is eiher a value of a `u8` literal, or a value produced by the UTF8 byte representation concatenation operator.

The result of the UTF8 byte representation concatenation is a ```ReadOnlySpan<byte>``` that consists of the bytes of the left operand followed by the bytes of the right operand. A null terminator is placed beyond the last byte in memory (and outside the length of the ```ReadOnlySpan<byte>```) in order to handle some
interop scenarios where the call expects null terminated strings.

### Lowering

The language will lower the UTF8 encoded strings exactly as if the developer had typed the resulting `byte[]` literal in code. For example:
Expand All @@ -78,6 +94,17 @@ ReadOnlySpan<byte> span = new ReadOnlySpan<byte>(new byte[] { 0x68, 0x65, 0x6c,

That means all optimizations that apply to the `new byte[] { ... }` form will apply to utf8 literals as well. This means the call site will be allocation free as C# will optimize this be stored in the `.data` section of the PE file.

Multiple consecutive applications of UTF8 byte representation concatenation operators are collapsed into a single creation of `ReadOnlySpan<byte>` with byte array containing the final byte sequence.

```c#
ReadOnlySpan<byte> span = "h"u8 + "el"u8 + "lo"u8;

// Equivalent to

ReadOnlySpan<byte> span = new ReadOnlySpan<byte>(new byte[] { 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x00 }).
Slice(0,5); // The `Slice` call will be optimized away by the compiler.
```

## Drawbacks
### Relying on core APIs
The compiler implementation will use `UTF8Encoding` for both invalid string detection as well as translation to `byte[]`. The exact APIs will possibly depend on which target framework the compiler is using. But `UTF8Encoding` will be the workhorse of the implementation.
Expand Down Expand Up @@ -407,3 +434,4 @@ Examples where we leave perf on the table

https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-01-26.md
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-04-18.md
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-06-06.md