Skip to content

Commit

Permalink
Add UTF8 byte representation concatenation operator to utf8-string-li…
Browse files Browse the repository at this point in the history
…terals.md (#6221)
  • Loading branch information
AlekseyTs authored Jun 21, 2022
1 parent b86623f commit d1d0093
Showing 1 changed file with 28 additions and 0 deletions.
28 changes: 28 additions & 0 deletions proposals/utf8-string-literals.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,22 @@ When the input text for the literal is a malformed UTF16 string, then the langua
var bytes = "hello \uD801\uD802"u8; // Error: the input string is not valid UTF16
```

### Addition operator

A new bullet point will be added to [§11.9.5 Addition operator](https://github.com/dotnet/csharpstandard/blob/draft-v7/standard/expressions.md#1195-addition-operator) as follows.

- UTF8 byte representation concatenation:

```csharp
ReadOnlySpan<byte> operator +(ReadOnlySpan<byte> x, ReadOnlySpan<byte> y);
```

This binary `+` operator performs byte sequences concatenation and is applicable if and only if both operands are semantically UTF8 byte representations.
An operand is semantically a UTF8 byte representation when it is eiher a value of a `u8` literal, or a value produced by the UTF8 byte representation concatenation operator.

The result of the UTF8 byte representation concatenation is a ```ReadOnlySpan<byte>``` that consists of the bytes of the left operand followed by the bytes of the right operand. A null terminator is placed beyond the last byte in memory (and outside the length of the ```ReadOnlySpan<byte>```) in order to handle some
interop scenarios where the call expects null terminated strings.

### Lowering

The language will lower the UTF8 encoded strings exactly as if the developer had typed the resulting `byte[]` literal in code. For example:
Expand All @@ -78,6 +94,17 @@ ReadOnlySpan<byte> span = new ReadOnlySpan<byte>(new byte[] { 0x68, 0x65, 0x6c,

That means all optimizations that apply to the `new byte[] { ... }` form will apply to utf8 literals as well. This means the call site will be allocation free as C# will optimize this be stored in the `.data` section of the PE file.

Multiple consecutive applications of UTF8 byte representation concatenation operators are collapsed into a single creation of `ReadOnlySpan<byte>` with byte array containing the final byte sequence.

```c#
ReadOnlySpan<byte> span = "h"u8 + "el"u8 + "lo"u8;

// Equivalent to
ReadOnlySpan<byte> span = new ReadOnlySpan<byte>(new byte[] { 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x00 }).
Slice(0,5); // The `Slice` call will be optimized away by the compiler.
```

## Drawbacks
### Relying on core APIs
The compiler implementation will use `UTF8Encoding` for both invalid string detection as well as translation to `byte[]`. The exact APIs will possibly depend on which target framework the compiler is using. But `UTF8Encoding` will be the workhorse of the implementation.
Expand Down Expand Up @@ -407,3 +434,4 @@ Examples where we leave perf on the table

https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-01-26.md
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-04-18.md
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-06-06.md

0 comments on commit d1d0093

Please sign in to comment.