Skip to content

Latest commit

 

History

History
1523 lines (1253 loc) · 59.2 KB

Overview.md

File metadata and controls

1523 lines (1253 loc) · 59.2 KB

Reference-Typed Strings

A minimum viable proposal to add a reference-typed strings to WebAssembly (the stringref proposal).

Champions

Andy Wingo <wingo@igalia.com>

Goals

  1. Enable programs compiled to WebAssembly to efficiently create and consume JavaScript strings
  2. Provide a good string implementation that many languages implemented on top of the GC proposal would find useful

These goals are sometimes in tension! The operative words to help us find good compromises are "minimal" and "viable".

Requirements

  1. Zero-copy passing of strings from JavaScript to WebAssembly & back
  2. No new string implementations on the web: allow re-use of JS engine's strings
  3. Allow WebAssembly implementations to efficiently represent strings internally in either WTF-8 or WTF-16 encodings
  4. Allow access to WTF-16 code units for Java, Dart, Kotlin and similar languages
  5. Allow string literals as constant expressions

Definitions

  • codepoint: An integer in the range [0,0x10FFFF].
  • surrogate: A codepoint in the range [0xD800,0xDFFF].
  • unicode scalar value: A codepoint that is not a surrogate.
  • character: An imprecise concept that we try to avoid in this document.
  • code unit: An indivisible unit of an encoded unicode scalar value. For UTF-8 encodings, an integer in the range [0,0xFF] (a byte); for UTF-16 encodings, an integer in the range [0,0xFFFF]; for UTF-32, the unicode scalar value itself.
  • high surrogate: A surrogate in the range [0xD800,0xDBFF].
  • low surrogate: A surrogate which is not a high surrogate.
  • surrogate pair: A sequence of a high surrogate followed by a low surrogate, used by UTF-16 to encode a codepoint in the range [0x10000,0x10FFFF].
  • isolated surrogate: Any surrogate which is not part of a surrogate pair.

Design

What's a string?

Good question! It sure would be nice to say that a string is a sequence of unicode scalar values. However, to satisfy the goal of being a good compilation target for a wide range of programming languages, as well as the goal of good JavaScript interoperability, things are a little more complicated.

Some languages present no problem to this idea that strings are composed of unicode scalar values. Python and Rust are in this category.

Other languages, notably JavaScript and Java, define their strings to be sequences of 16-bit code units, for historical reasons. These code unit sequences aren't quite UTF-16, because they can contain isolated surrogates, which are prohibited by standard Unicode encoding forms. The facility we end up building should be able to represent all Java strings, as a compilation target, and also represent all JavaScript strings, for good web iteroperability.

Therefore we define a string to be a sequence of unicode scalar values and isolated surrogates. The code units of a Java or JavaScript string can be interpreted to encode such a sequence, in the WTF-16 encoding form.

This proposal does not require a WebAssembly implementation to use WTF-16 to represent its strings internally. It only requires that the implementation be able to represent all codepoint sequences that can be encoded with WTF-16, notably codepoint sequences containing isolated surrogates.

Encodings and views

There is an impedance-matching problem between the way that WebAssembly implementations want to represent strings, and the way that source languages want to access strings. The stringref API has to do the best it can to smooth over these differences.

On the implementation side, some WebAssembly implementations will want to represent string contents in the WTF-16 encoding, notably implementations embedded on the web because of JavaScript's usage of WTF-16. Other implementations without legacy requirements will want to use WTF-8, as it is generally more space-efficient and closer to the common UTF-8 interchange format.

From the source language side, some source languages such as Java will also want to consider strings as WTF-16 code unit sequences. Other source languages will want to think of strings as WTF-8 byte sequences, and others will want to think of strings as codepoint sequences.

On the most basic level, when a source language wants to access string contents, it can encode the whole string to memory (or eventually to a GC-managed array) as UTF-8, WTF-8, or WTF-16, depending on the source language's needs. For example Java usually wants to treat strings as WTF-16 code unit sequences, so Java would encode to WTF-16. This proposal provides facilities for measuring how many bytes an encoding would take, and for actually doing the encoding.

This proposal also includes the ability to get a WTF-8 or WTF-16 "view" on a string, which should provide near-constant-time random access to the bytes of a WTF-8 encoding of the string, or to the 16-bit code units of a WTF-16 encoding of a string. We also provide a view that allows an iterator interface over the codepoints in a string.

An implementation using WTF-8 will have some costs for source languages that want WTF-16, and vice versa.

Getting a view for a WebAssembly implementation's "native" string encoding is likely a constant-time operation, and possibly even free. For example, getting a WTF-8 view for an implementation that uses WTF-8 to represent strings will be free. Getting a WTF-16 view on a WTF-8 implementation could imply a copy, or possibly the computation of a breadcrumbs table.

This proposal defines positions in strings only with respect to a specific string view type. Any interface that needs to refer to a position in a string should take a string view and an offset whose meaning depends on the string view type: a byte offset for a WTF-8 string view, a code unit offset for a WTF-16 view, or a codepoint offset for a codepoint view.

Handling invalid UTF-8

WTF-8 and WTF-16 are not interchange formats: they should not be used when communicating data over a network, for example. If a program goes to encode a string to UTF-8, and the string contains an isolated surrogate, the program can trap, or it can replace the codepoint with U+FFFD (the replacement character). The stringref facility provides an efficient mechanism for detecting strings which are not valid USV sequences. When encoding data as UTF-8, it allows the programmer to specify whether to replace any isolated surrogate or whether to trap.

For some source languages, defining strings to be sequences of unicode scalar values and isolated surrogates is an antifeature: those source languages do not require the ability to represent isolated surrogates and would prefer to not be given strings with isolated surrogates. We sympathize. On a boundary where such a program might receive a string containing isolated surrogates, the program can check for such strings and trap using facilities defined in this proposal.

Prior discussions

Proposal

This proposal consists of a basic stringref facility and a (possibly post-MVP) stringview facility.

The stringref facility defines a new reference type, stringref. Literal stringref values can be embedded in a WebAssembly module, with their contents taken from a new section. WebAssembly programs can also create stringref values from data encoded in memory or GC arrays in the WTF-8 or WTF-16 encodings, and can likewise write stringref contents to memory in these encodings. There is an instruction to concatenate stringref values. Finally, stringref values can be compared for equality.

The stringview facility allows WebAssembly to obtain a "view" on the contents of a stringref, treating it as a sequence of values in the WTF-8 and WTF-16 encodings, as well as treating a string as a sequence of codepoints. WebAssembly programs can use stringviews to encode parts of strings to memory, access string contents by index, and to take substrings.

The stringref facility

One new reference type: stringref. Opaque, like externref and funcref.

When reading or writing encoded bytes, the address in memory at which to read or write the bytes depends on the memory model of the WebAssembly module.

address ::= i32 | i64

Such instructions also take the memory to which to read or write as an immediate.

Although stringref is a nullable type, trap if a null stringref value reaches any instruction in this proposal. The one exception is string.eq.

Creating strings

(string.new_utf8 $memory ptr:address bytes:i32)
  -> str:stringref
(string.new_lossy_utf8 $memory ptr:address bytes:i32)
  -> str:stringref
(string.new_wtf8 $memory ptr:address bytes:i32)
  -> str:stringref

Create a new string from the bytes bytes in memory at ptr. Out-of-bounds access will trap. The maximum value for bytes is 231–1; passing a higher value traps.

These three instructions decode the bytes in three different ways:

  • string.new_utf8 decodes using a strict UTF-8 decoder. If the bytes are not valid UTF-8, trap.

  • string.new_lossy_utf8 decodes using a sloppy UTF-8 decoder: all maximal subparts of an invalid subsequence are decoded as if they were U+FFFD (the replacement character) instead. This instruction will never trap due to a decoding error. See the section entitled "U+FFFD Substitution of Maximal Subparts" in the Unicode standard, version 14.0.0, page 126.

  • string.new_wtf8 decodes using a strict WTF-8 decoder, which is like UTF-8 but also allows isolated surrogates. If the bytes are not valid WTF-8, trap.

(string.new_wtf16 $memory ptr:address codeunits:i32)
  -> str:stringref

Create a new string from the codeunits code units encoded in memory at ptr. Out-of-bounds access will trap. ptr must be two-byte aligned, and will trap otherwise. The maximum value for codeunits is 230–1; passing a higher value traps. Each code unit is read from memory as if with i32.load16, and is therefore decoded using little-endian byte order.

string.new size limits

Creating a string is a form of dynamic allocation and can fail. The same implementation running on different machines can have different behaviors. The specification can only say that byte/code-unit sizes above a certain limit must fail; but for sizes within the limits, the allocations may fail. If an allocation fails, the implementation must trap. Fallible string.new is a possible future extension.

String literals

(string.const contents:i32)
  -> str:stringref

Create a new string from the literal string contents, as in (string.const "Hello, World!"). This instruction is constant and can be used in global variable initializers.

String literal section

The string.const section indicates the literal as an i32 index into a new regular section: a string table, encoded as a vec(vec(u8)) of valid WTF-8 strings. Because literal strings can contain codepoint 0, strings in the string table do not use NUL as a terminator. The string table section must immediately precede the global section, or where the global section would be, in the binary.

Though it is useful for string literals used in constant instructions to appear early in the module binary, it may be advantageous to defer string literals that are only used at run-time to later in the module binary. This can get bulky string data off the hot path, allowing a WebAssembly implementation to start compiling functions as soon as possible. The encoding of the string literals section is preceded by a placeholder 0x00 value, allowing for the possibility of a deferred string literal section as a future extension.

string.const size limits

The maximum size for the WTF-8 encoding of an individual string literal is 231–1 bytes. Embeddings may impose their own limits which are more restricted. But similarly to string.new_wtf8, instantiating a module with string literals may fail due to lack of memory resources, even if the string size is formally within the limits. However string.const itself never traps when passed a valid literal offset.

Accessing string contents

All parameters and return values measuring a number of codepoints or a number of code units represent these sizes as unsigned values.

(string.measure_utf8 str:stringref)
  -> codeunits:i32

Measure the number of code units (bytes) that would be required to encode the contents of the string str to UTF-8. If the string contains an isolated surrogate, return -1.

The maximum number of code units returned by string.measure_utf8 is is 231-1. If an encoding would require more code units than the limit, the result is -1.

(string.measure_wtf8 str:stringref)
  -> codeunits:i32

Measure the number of code units (bytes) that would be required to encode the codepoints of the string str to WTF-8.

Note that this instruction also serves to measure an encoding length for UTF-8 when isolated surrogates are replaced with U+FFFD ("lossy UTF-8"); the same number of bytes is required to encode U+FFFD as would be required to encode an isolated surrogate to WTF-8.

The maximum number of code units returned by string.measure_wtf8 is is 231-1. If an encoding would require more code units than the limit, the result is -1.

(string.measure_wtf16 str:stringref)
  -> codeunits:i32

Measure the number of code units that would be required to encode the contents of the string str to WTF-16.

The maximum number of code units returned by string.measure_wtf16 is is 230-1. If an encoding would require more code units than the limit, the result is -1.

(string.encode_utf8 $memory str:stringref ptr:address)
  -> codeunits:i32

Encode the contents of the string str as UTF-8 to memory at ptr. If an isolated surrogate is seen, trap. Return the number of code units written, which will be the same as returned by the corresponding string.measure_utf8.

The maximum number of bytes that can be encoded at once by string.encode is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap).

(string.encode_lossy_utf8 $memory str:stringref ptr:address)
  -> codeunits:i32

Encode the contents of the string str as UTF-8 to memory at ptr. If an isolated surrogate is seen, encode U+FFFD (the replacement character) instead. Return the number of code units written, which will be the same as returned by the corresponding string.measure_wtf8.

The maximum number of bytes that can be encoded at once by string.encode is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap).

(string.encode_wtf8 $memory str:stringref ptr:address)
  -> codeunits:i32

Encode the contents of the string str as WTF-8 to memory at ptr. Return the number of code units written, which will be the same as returned by the corresponding string.measure_wtf8.

The maximum number of bytes that can be encoded at once by string.encode is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap).

(string.encode_wtf16 $memory str:stringref ptr:address)
  -> codeunits:i32

Encode the contents of the string str as WTF-16 to memory at ptr. Return the number of code units written, which will be the same as returned by the corresponding string.measure_wtf16.

Each code unit is written to memory as if stored by i32.store16, so WTF-16 code units are in little-endian byte order.

The maximum number of bytes that can be encoded at once by string.encode is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap).

Concatenation

(string.concat a:stringref b:stringref) -> stringref

Return a new stringref containing the codepoints from a followed by the codepoints from b.

Note that the implementation should take care when, at any future time, treating the resulting string as a sequence of codepoints. If a's last codepoint is a high surrogate and b's first codepoint is a low surrogate, these two codepoints combine into one, as if they were the two code units of a UTF-16-encoded unicode scalar value.

Concatenating two strings is a form of dynamic allocation and can fail. If an allocation fails, the implementation must trap. Fallible string.concat is a possible future extension.

Predicates

(string.eq a:stringref b:stringref) -> i32

If both a and b are null, return 1. If only one of them is null, return 0. Otherwise return 1 if the strings a and b contain the same codepoint sequence, or 0 otherwise.

(string.is_usv_sequence str:stringref)
  -> bool:i32

Return 1 if the string str is a sequence of unicode scalar values, and 0 otherwise. A 0 result indicates that str contains isolated surrogates.

The stringview facility

Three new reference types: stringview_wtf8, stringview_wtf16, and stringview_iter. Opaque, like externref and funcref.

stringview_wtf8

(string.as_wtf8 str:stringref)
  -> view:stringview_wtf8

Obtain a view on a string's contents as WTF-8. The stringview can then be used to interpret the string as a byte sequence in the WTF-8 encoding.

(stringview_wtf8.advance view:stringview_wtf8 pos:i32 bytes:i32)
  -> next_pos:i32

Starting at offset pos into the WTF-8 encoding of view, return the highest offset that is not greater than pos + bytes.

If pos is greater than the WTF-8 byte length of view, it is as if it were instead given as the byte length. If pos is less than the byte length but does not indicate the byte offset of the start of a codepoint, it is advanced to the next codepoint (or the end of the string, for the last codepoint). Collectively these transformations are the "WTF-8 position treatment".

If the mathematical value of next_pos would be greater than 231, trap. (Future extensions of the stringref proposal along the lines of the memory64 proposal may allow for 64-bit variants of the position-using instructions, which could relax this restriction.)

(stringview_wtf8.encode_utf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
  -> next_pos:i32, bytes:i32
(stringview_wtf8.encode_lossy_utf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
  -> next_pos:i32, bytes:i32
(stringview_wtf8.encode_wtf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
  -> next_pos:i32, bytes:i32

Write a subsequence of the WTF-8 encoding of view to memory at ptr, starting at the WTF-8 offset pos, writing no more than bytes bytes. No NUL byte is written. Return the WTF-8 offset of the next characters to encode, along with the number of bytes written.

pos receives the "WTF-8 position treatment", as for stringview_wtf8.advance.

If the mathematical value of next_pos would be greater than 231, trap. (Future extensions of the stringref proposal along the lines of the memory64 proposal may allow for 64-bit variants of the position-using instructions, which could relax this restriction.)

If an isolated surrogate is seen, the behavior depends on the instruction:

  • stringview_wtf8.encode_utf8 will trap.
  • stringview_wtf8.encode_lossy_utf8 will encode U+FFFD.
  • stringview_wtf8.encode_wtf8 will encode the isolated surrogate.
(stringview_wtf8.slice view:stringview_wtf8 start:i32 end:i32)
  -> str:stringref

Return a substring of view, for the WTF-8 bytes starting at offset start and continuing to but not including end. start and end receive the "WTF-8 position treatment", as for stringview_wtf8.advance.

stringview_wtf16

(string.as_wtf16 str:stringref)
  -> view:stringview_wtf16

Obtain a view on a string's contents as WTF-16. The stringview can then be used to interpret the string as a sequence of 16-bit code units in the WTF-16 encoding.

(stringview_wtf16.length view:stringview_wtf16)
  -> length:i32

Return the total number of 16-bit code units necessary to represent view in the WTF-16 encoding.

(stringview_wtf16.get_codeunit view:stringview_wtf16 pos:i32)
  -> codeunit:i32

Return the 16-bit code unit at offset pos in the WTF-16 encoding of view. If pos is greater than or equal to the WTF-16 length of view, trap.

(stringview_wtf16.encode $memory view:stringview_wtf16 ptr:address pos:i32 len:i32)
  -> codeunits:i32

Write a subsequence of the WTF-16 encoding of view to memory at ptr, starting at the WTF-16 offset pos, writing no more than len 16-bit code units. If ptr is not two-byte aligned, trap. Return the number of code units written.

If pos is greater than the number of WTF-16 code units in view, it is as if it were instead given as the code unit length. This transformation is the "WTF-16 position treatment".

(stringview_wtf16.slice view:stringview_wtf16 start:i32 end:i32)
  -> str:stringref

Return a substring of view, for the WTF-16 code units starting at offset start and continuing to but not including end. start and end receive the "WTF-16 position treatment", as for stringview_wtf16.encode.

stringview_iter

(string.as_iter str:stringref)
  -> view:stringview_iter

Obtain a view on a string's contents as an iterator over the codepoints in str, initially positioned at the beginning of the string. The stringview can then be used to iterate over the codepoints of the string.

(stringview_iter.next view:stringview_iter)
  -> codepoint:i32

If view is already at the end of the string, return -1. Otherwise return the codepoint currently pointed to by the iterator, and advance the iterator's position by one codepoint.

(stringview_iter.advance view:stringview_iter codepoints:i32)
  -> codepoints:i32

Advance the iterator view by up to codepoints codepoints. Return the number of codepoints that were actually consumed.

(stringview_iter.rewind view:stringview_iter codepoints:i32)
  -> codepoints:i32

Rewind the iterator view by up to codepoints codepoints. Return the number of codepoints that were actually consumed.

(stringview_iter.slice view:stringview_iter codepoints:i32)
  -> str:stringref

Return a substring of view, starting at the current position of view and continuing for at most codepoints codepoints.

GC integration

Though this proposal does not have a dependency on the GC proposal, compiler authors that target GC will likely want to be able to encode the contents of a stringref to a GC array, and vice versa.

The primary use cases are:

  1. String-builder interfaces, which will likely use a WTF-8 or WTF-16 array as intermediate storage, depending on the language being compiled. We will need to be able to create strings from arrays. When the string contents are ready, we will almost always decode from array offset 0 and continue to some offset before the end of the array. We'll also need to be able to append a string's contents to an array at a given offset.
  2. Communicating strings with another process, possibly over the network. Here, UTF-8 and WTF-8 are the important encodings, and we need to be able to read and write to arbitrary slices of arrays.

The instructions below shall be available in WebAssembly implementations that support both GC and stringrefs.

(string.new_utf8_array codeunits:$t start:i32 end:i32)
  if expand($t) => array i8
  -> str:stringref
(string.new_lossy_utf8_array codeunits:$t start:i32 end:i32)
  if expand($t) => array i8
  -> str:stringref
(string.new_wtf8_array codeunits:$t start:i32 end:i32)
  if expand($t) => array i8
  -> str:stringref

Create a new string from a subsequence of the codeunits bytes in a GC-managed array, starting from offset start and continuing to but not including end. If end is less than start or is greater than the array length, trap. The bytes are decoded in the same way as string.new_utf8, string.new_lossy_utf8, and string.new_wtf8, respectively. The maximum value for endstart is 231–1; passing a higher value traps.

(string.new_wtf16_array codeunits:$t start:i32 end:i32)
  if expand($t) => array i16
  -> str:stringref

Create a new string from a subsequence of the codeunits WTF-16 code units in a GC-managed array, starting from offset start and continuing to but not including end. If end is less than start or is greater than the array length, trap. The maximum value for endstart is 230–1; passing a higher value traps.

(string.encode_utf8_array str:stringref array:$t start:i32)
  if expand($t) => array (mut i8)
  -> codeunits:i32
(string.encode_lossy_utf8_array str:stringref array:$t start:i32)
  if expand($t) => array (mut i8)
  -> codeunits:i32
(string.encode_wtf8_array str:stringref array:$t start:i32)
  if expand($t) => array (mut i8)
  -> codeunits:i32
(string.encode_wtf16_array str:stringref array:$t start:i32)
  if expand($t) => array (mut i16)
  -> codeunits:i32

Encode the contents of the string str as WTF-8 or WTF-16, respectively, to the GC-managed array array, starting at offset start. Return the number of code units written, which will be the same as the result of a the corresponding string.measure_wtf8 or string.measure_wtf16, respectively. If there is not space for the code units in the array, trap. Note that no NUL terminator is ever written.

For string.encode_utf8_array, trap if an isolated surrogate is seen. For string.encode_lossy_utf8_array, replace isolated surrogates with U+FFFD.

Binary encoding

reftype ::= ...
         |  0x64 ⇒ stringref         ; SLEB128(-0x1c)
         |  0x63 ⇒ stringview_wtf8   ; SLEB128(-0x1d)
         |  0x62 ⇒ stringview_wtf16  ; SLEB128(-0x1e)
         |  0x61 ⇒ stringview_iter   ; SLEB128(-0x1f)

instr ::= ...
       |  0xfb 0x80:u32 $mem:u32       ⇒ string.new_utf8 $mem
       |  0xfb 0x81:u32 $mem:u32       ⇒ string.new_wtf16 $mem
       |  0xfb 0x82:u32 $idx:u32       ⇒ string.const $idx
       |  0xfb 0x83:u32                ⇒ string.measure_utf8
       |  0xfb 0x84:u32                ⇒ string.measure_wtf8
       |  0xfb 0x85:u32                ⇒ string.measure_wtf16
       |  0xfb 0x86:u32 $mem:u32       ⇒ string.encode_utf8 $mem
       |  0xfb 0x87:u32 $mem:u32       ⇒ string.encode_wtf16 $mem
       |  0xfb 0x88:u32                ⇒ string.concat
       |  0xfb 0x89:u32                ⇒ string.eq
       |  0xfb 0x8a:u32                ⇒ string.is_usv_sequence
       |  0xfb 0x8b:u32 $mem:u32       ⇒ string.new_lossy_utf8 $mem
       |  0xfb 0x8c:u32 $mem:u32       ⇒ string.new_wtf8 $mem
       |  0xfb 0x8d:u32 $mem:u32       ⇒ string.encode_lossy_utf8 $mem
       |  0xfb 0x8e:u32 $mem:u32       ⇒ string.encode_wtf8 $mem
       |  0xfb 0x90:u32                ⇒ string.as_wtf8
       |  0xfb 0x91:u32                ⇒ stringview_wtf8.advance
       |  0xfb 0x92:u32 $mem:u32       ⇒ stringview_wtf8.encode_utf8 $mem
       |  0xfb 0x93:u32                ⇒ stringview_wtf8.slice
       |  0xfb 0x94:u32 $mem:u32       ⇒ stringview_wtf8.encode_lossy_utf8 $mem
       |  0xfb 0x95:u32 $mem:u32       ⇒ stringview_wtf8.encode_wtf8 $mem
       |  0xfb 0x98:u32                ⇒ string.as_wtf16
       |  0xfb 0x99:u32                ⇒ stringview_wtf16.length
       |  0xfb 0x9a:u32                ⇒ stringview_wtf16.get_codeunit
       |  0xfb 0x9b:u32 $mem:u32       ⇒ stringview_wtf16.encode $mem
       |  0xfb 0x9c:u32                ⇒ stringview_wtf16.slice
       |  0xfb 0xa0:u32                ⇒ string.as_iter
       |  0xfb 0xa1:u32                ⇒ stringview_iter.next
       |  0xfb 0xa2:u32                ⇒ stringview_iter.advance
       |  0xfb 0xa3:u32                ⇒ stringview_iter.rewind
       |  0xfb 0xa4:u32                ⇒ stringview_iter.slice
       |  0xfb 0xb0:u32           [gc] ⇒ string.new_utf8_array
       |  0xfb 0xb1:u32           [gc] ⇒ string.new_wtf16_array
       |  0xfb 0xb2:u32           [gc] ⇒ string.encode_utf8_array
       |  0xfb 0xb3:u32           [gc] ⇒ string.encode_wtf16_array
       |  0xfb 0xb4:u32           [gc] ⇒ string.new_lossy_utf8_array
       |  0xfb 0xb5:u32           [gc] ⇒ string.new_wtf8_array
       |  0xfb 0xb6:u32           [gc] ⇒ string.encode_lossy_utf8_array
       |  0xfb 0xb7:u32           [gc] ⇒ string.encode_wtf8_array

;; New section.  If present, must be present only once, and right before
;; the globals section (or where the globals section would be).  Each
;; vec(u8) must be valid WTF-8.  The 0x00 is a placeholder for future
;; expansion.  One possible expansion would be to replace the 0x00 with
;; a u32 indicating a count of supplementary string literals that are in
;; a section that appears later in the binary, after the code section.
stringrefs ::= section_14(0x00 vec(vec(u8)))

Note that the u32 (uleb) encoding for the opcode after the 0xfb prefix takes two bytes, for opcode values between 0x80 and 0x3fff.

Examples

We assume that the textual syntax for instructions that take a memory operand allows you to elide the memory, in which case it defaults to 0.

Make string from NUL-terminated UTF-8 in memory

(func $string-from-utf8 (param $ptr i32) (result stringref)
  local.get $ptr
  local.get $ptr
  call $strlen
  string.new_utf8)

If the bytes being decoded aren't actually valid UTF-8, this function will trap. Use string.new_lossy_utf8 in contexts where replacing invalid data with U+FFFD is a better strategy than trapping.

Make string from an array of WTF-8 code units in memory

(func $string-from-wtf8n (param $ptr i32) (param $len i32) (result stringref)
  local.get $ptr
  local.get $len
  string.new_wtf8)

Note that string.new_wtf8 (and string.new_wtf8_array) are always strict decoders: if the bytes are not valid WTF-8, the instruction traps.

Make string from UTF-16 in memory

(func $string-from-utf16 (param $ptr i32) (param $units i32) (result stringref)
  local.get $ptr
  local.get $units
  string.new_wtf16)

This proposal doesn't distinguish between UTF-16 and WTF-16 at all; rather it just deals in WTF-16, as most source languages that expose 16-bit code units to users actually expose WTF-16 strings.

Number of codepoints in string

(func $codepoint-length (param $str stringref) (result i32)
  local.get $str
  string.as_iter      ;; Get iterator view
  i32.const -1        ;; advance by all codepoints
  stringview_iter.advance) ;; return number of codepoints advanced

String literals

(global $hey stringref (string.const "Hey"))

(func $howdy (result stringref)
  (string.const "Howdy"))

(func $is-cowboy (param $str stringref) (result i32)
  local.get $str
  call $howdy
  string.eq)

Return the first codepoints of a string, as a stringref

(func $prefix (param $str stringref) (param $codepoints i32)
              (result stringref)
  local.get $str
  string.as_iter
  local.get $codepoints
  stringview_iter.slice)

Return a slice of WTF-16 code units of a string, as a stringref

(func $slice (param $str stringref)
             (param $offset i32) (param $codeunits i32)
             (result stringref)
  local.get $str
  string.as_wtf16
  local.get $offset
  local.get $offset
  local.get $codeunits
  i32.add
  stringview_wtf16.slice)

Suffix, prefix comparisons

There are a few ways to compare against a substring, but the easiest is probably to slice the string, which is something you can do only with respect to a particular encoding and view. Given that we're comparing against known strings, we know how long of a slice to take.

(func starts-with-hey? (param $str stringref) (result i32)
  local.get $str
  string.as_wtf8
  i32.const 0
  i32.const 3
  stringview_wtf8.slice
  global.get $hey
  string.eq)

(func ends-with-howdy?/wtf8 (param $str stringref) (result i32)
  (local $wtf8 stringview_wtf8)

  local.get $str
  string.as_wtf8
  local.set $wtf8

  local.get $wtf8
  local.get $wtf8
  ;; Get wtf-8 offset of end
  i32.const 0
  i32.const -1
  stringview_wtf8.advance
  ;; Subtract 5.  Given WTF-8 position treatment, OK to wrap or 
  ;; not be on codepoint boundary.
  i32.const 5
  i32.sub
  ;; Slice until end.  If string ends with "howdy", these will be
  ;; 5 1-byte codepoints.
  i32.const -1
  stringview_wtf8.slice

  string.const "Howdy"
  string.eq)

;; WTF-16 flavor is similar.
(func ends-with-howdy?/wtf16 (param $str stringref) (result i32)
  (local $wtf16 stringview_wtf16)

  local.get $str
  string.as_wtf16
  local.set $wtf16

  ;; Slice last 5 code units.
  local.get $wtf16
  local.get $wtf16
  stringview_wtf16.length
  i32.const 5
  i32.sub
  i32.const -1
  stringview_wtf16.slice

  string.const "Howdy"
  string.eq)

;; Finally, a version with the iterator API.
(func ends-with-howdy?/iter (param $str stringref) (result i32)
  (local $iter stringview_iter)

  local.get $str
  string.as_iter
  local.set $iter

  ;; Advance to end.
  local.get $iter
  i32.const -1
  stringview_iter.advance

  ;; Rewind by 5.
  local.get $iter
  i32.const 5
  stringview_iter.rewind

  ;; Slice.
  local.get $iter
  i32.const 5
  stringview_iter.slice

  ;; Compare.
  string.const "Howdy"
  string.eq)

Which version of ends-with-howdy? will a source language produce? They are essentially equivalent in this use case of comparing against a static string, but in the general case, a source language that processes strings in terms of codepoints would probably use the iterator, languages that treat strings as UTF-8 sequences would produce the WTF-8 version whereas those that process strings in terms of 16-bit code units will compile to the WTF-16 version.

One could instead do a character-by-character comparison, to avoid creating the slice.

Stepping back a bit, prefix and suffix checks are examples of operations for which the stringref proposal should facilitate high-performance implementations. The primary strategy of the stringref proposal is to allow any such operation to be build in terms of its primitives. However if there are important compound operations (e.g. prefix/suffix checks) that can be sped up with a dedicated instruction, we should be open to considering adding more instructions.

Store a stringref without copying

(table $strings 100 stringref)
(global $next-handle i32 (i32.const 0))

(func $intern-string (param $str stringref) (result i32)
  (local $handle i32)
  global.get $next-handle
  local.tee $handle
  local.get $str
  table.set $strings
  i32.const 1
  i32.add
  global.set $next-handle
  local.get $handle)

Copy string contents to application-managed memory

(func $malloc (param i32) (result i32))
(func $utf8-contents (param $str stringref) (result i32)
  (local $cur i32)
  (local $len i32)
  (local $ptr i32)
  local.get $str
  string.measure_utf8
  local.set $len

  block $valid
    local.get $len
    i32.const -1
    i32.ne
    br_if $valid
    unreachable                    ;; trap on error
  end

  local.get $len
  i32.const 1
  i32.add
  call $malloc                     ;; reserve space for bytes and NUL
  local.set $ptr

  local.get $str
  local.get $ptr
  string.encode_utf8        ;; push bytes written, same as $len

  local.get $ptr
  i32.add
  i32.const 0
  i32.store8                       ;; write NUL

  local.get $ptr
  return)

Using string.measure_utf8 ensures that the encoded string is a valid unicode scalar value sequence. How to handle invalid UTF-8 is up to the user; instead of unreachable we could throw an exception.

Note that in this case, the subsequent string.encode_utf8 could just as well have been string.encode_lossy_utf8 or string.encode_wtf8, as these instructions are all the same for strings that do not contain isolated surrogates, and we checked that there were none.

If we meant to handle isolated surrogates, we could use string.measure_wtf8 instead.

Stream over contents of string

Assume you have a 1024-byte array of memory at $buf. This function will encode isolated surrogates as WTF-8.

(global $buf i32)
(func $process-wtf8 (param $ptr i32) (param $len i32))

(func $process-string (param $str stringref)
  (local $cursor i32)                ;; initial value of 0 is start
  (local $bytes i32)

  loop
    local.get $str
    local.get $cursor
    global.get $buf
    i32.const 1024
    string.encode_wtf8               ;; push bytes written
    local.tee $bytes
    (if i32.eqz (then return))       ;; if no bytes encoded, done
    local.get $bytes
    local.get $cursor
    i32.add
    local.set $cursor

    global.get $buf
    local.get $bytes
    call $process-utf8
  end)

Stream over UTF-16 code units of string, handling isolated surrogates

This function is probably slower than encoding chunks of the string to WTF-16 in linear memory, for longer strings.

(func $have-code-unit (param $codeunit i32))

(func $process-string (param $str stringref)
  (local $wtf16 stringview_wtf16)
  (local $cur i32)
  (local $len i32)

  local.get $str
  string.as_wtf16
  local.set $wtf16

  local.get $wtf16
  stringview_wtf16.length
  local.set $len

  block $done
    loop $loop
      local.get $cur
      local.get $len
      i32.ge
      br_if $done

      local.get $wtf16
      local.get $cur
      stringview_wtf16.get_codeunit
      call $have-code-unit

      i32.const 1
      local.get $cur
      i32.add
      local.set $cur
    end
  end)

Stream over codepoints of string, handling isolated surrogates

This function is probably slower than encoding chunks of the string to WTF-8 in memory, for longer strings.

(func $have-codepoint (param $codepoint i32))

(func $process-string (param $str stringref)
  (local $iter stringview_iter)
  (local $ch i32)

  local.get $str
  string.as_iter
  local.set $iter

  block $done
    loop $loop
      local.get $iter
      stringview_iter.next
      local.tee $ch

      i32.const -1
      i32.eq
      br_if $done

      local.get $ch
      call $have-codepoint
    end
  end)

Concatenate two strings

(func $append (param $a stringref) (param $b stringref)
              (result stringref)
  local.get $a
  local.get $b
  string.concat)

FAQ

What does Emscripten currently do for strings and how could WebAssembly strings help?

Generally speaking, Emscripten eagerly converts JavaScript strings to NUL-terminated UTF-8, allocating space for the UTF-8 encoding in linear memory using stack allocation. The stringToUTF8 function is written in JavaScript and handles surrogate pairs. However for isolated surrogates, emscripten's decoder appears to produce garbage.

For C functions that return strings, emscripten parses NUL-terminated UTF-8 from memory, either using TextDecoder or via hand-rolled JavaScript. Presumably TextDecoder is significantly faster as it doesn't have to build rope strings.

Memory management is an issue, of course; the memory for a returned string value may or may not be owned by the caller.

This proposal avoids memory ownership issues entirely, via automatic memory management (implemented either via GC or reference counting). It also avoids eager string encoding onto the stack and the need for NUL termination, allowing string contents to be written to memory exactly where they are needed.

Why should WebAssembly strings be able to represent isolated surrogates?

The main motivation is to support source languages with WTF-16 strings (e.g. Java, Kotlin, C#). JVM-based and CLR-based languages treat strings as sequences of 16-bit code units. Sometimes programs written in e.g. Java will decode these sequences into codepoints encoded as WTF-16, but not always. Many common algorithms can be performed directly on the code units, for example prefix matching. Therefore to efficiently support Java and friends when compiled to WebAssembly, we need to support this view of strings as sequences of any 16-bit code units, without any validity constraints that enforce that surrogates always be properly paired. This is the main reason to support WTF-16 rather than just UTF-16.

An important secondary reason is interoperability with JavaScript hosts. For zero-copy interoperation with JavaScript and DOM facilities, it would be good if for stringref to have the same semantics as a JavaScript string, which like Java is an arbitrary sequence of 16-bit code units.

Isolated surrogates are rare in JavaScript, but can occur via:

  • Reading invalid UTF-16 from external sources. However this is not common, as most services prefer UTF-8 over UTF-16 as an interchange format.
  • JavaScript code that creates strings whose code units are not valid UTF-16.
  • JavaScript code that processes strings in chunks and happens to split a chunk on a surrogate boundary.
    • This happens most often in JavaScript code that processes strings one code unit at a time.
  • JavaScript / DOM keyboard input event handlers (though this may be just a bug; [1], [2]).

Therefore we define a stringref as an arbitrary sequence of not just unicode scalar values, but also isolated surrogates. Note that this definition excludes codepoint sequences containing proper surrogate pairs. This restriction is enforced by construction for the WTF-8 and WTF-16 encoding schemes.

Are stringrefs mutable?

No. We don't need mutable strings when compiling Java, C#, or Python, and we don't need them when interoperating with JavaScript hosts. Immutable strings have the benefit that you can hand them to an untrusted interface without copying, and you know that interface won't be able to use the string to affect any of your own state.

It is not a goal for stringref to be the main string representation for programming languages that need mutable strings. Fortunately there are fewer and fewer of these languages as time goes on.

What do existing C++ APIs to JavaScript strings look like?

While developing this proposal, we realized that we might already have a design oracle as regards JavaScript integration: v8.h. Perhaps for languages that tend to work on strings in linear memory (C++, Rust), we can use the C++ interface to a JS engine as an indication of what interfaces we might need.

  • We can assume that v8.h has all the interfaces that Chromium needs, so we expect that the interfaces in v8.h are sufficient.
  • V8 wants to minimize API surface and historically has removed API, so we expect V8's interface is close to minimal.
  • C++ interfaces to different JS engines are similar. We can look at v8.h and draw conclusions for any engine.

The V8 C++ String API includes the following procedural interfaces:

  • Create a string from encoded bytes in memory
    • Supported encodings: one-byte, utf-8, utf-16
  • Get length of string when encoded as one-byte, utf-8, utf-16
    • Does not include unicode scalar value count
  • Predicate on string to identify strings represented using one byte per character (a cheap check) and strings that can be represented using one byte per character (possibly a linear search)
  • Write encoded bytes to memory
    • Supported encodings: one-byte, utf-8, utf-16
    • Options: hint that ropes should be flattened, include NUL terminator or not, whether to preserve NUL codepoints, whether to replace isolated surrogates with the replacement character or to trap
  • Support for strings whose characters are in linear memory and which shouldn't be copied ("external strings"); probably not appropriate for WebAssembly
  • Equality predicate
  • Concatenate two strings. Interestingly, v8.h has no interface to make a substring (slice).

We used this set of interfaces as a starting point for the stringref design. The need to support WebAssembly implementations that use WTF-8 to represent strings internally did cause us, however, to separate out some functionality into stringview.

What is the expected implementation on non-browser run-times?

Assuming that the non-browser implementation uses WTF-8 as the native string representation, then a stringref could be just a pointer, a length, and a reference count. Some implementations may also want to keep a flag indicating whether a string is valid UTF-8.

Generally speaking, WebAssembly doesn't specify the time or space complexity of its operations. In that regard, an implementation is free to implement e.g. string.concat via an eager copy. In practice however we expect the same dynamics that lead JavaScript implementations to natively support ropes and slices would hold with non-browser run-times. These implementations would also have their own heuristics for when to flatten strings.

When creating a stringview_wtf16 from a stringref on a system that represents stringref as WTF-8, we expect that some implementations will eagerly copy the string to a WTF-16 encoding. Others will want to implement a map from WTF-16 position to WTF-8 position via breadcrumbs.

What's the expected implementation in web browsers?

We expect that web browsers use JS strings as stringref.

We expect also that web browsers use JS strings directly as their stringview_wtf16 implementation, given that current web browsers represent strings internally as WTF-16 (with some optimizations for latin-1 strings).

For stringview_wtf8, we expect either an eager copy or breadcrumbs, as in the non-browser runtime case. Some small strings may avoid the re-encoding and instead re-encode on the fly.

There is a possibility that some web browsers may eventually switch from the one-byte/two-byte representation to WTF-8 with breadcrumbs, which would make those web browsers use the same strategy as the non-browser case.

How should stringviews be represented in JavaScript hosts?

It's possible for a WebAssembly module to define an exported function that returns a stringview_iter. This proposal leaves the question of the JS API for stringviews to a post-MVP proposal. We expect that until such a proposal lands, attempting to pass a stringview across the WebAssembly/JS boundary will throw an exception, as was the case for i64 values before the BigInt proposal landed.

How do we expect Rust to compile to stringref?

Generally speaking, for Rust we expect eager copies to UTF-8 data when Rust receives a stringref.

Rust represents strings natively as well-formed UTF-8. Rust string processing routines can therefore assume that a UTF-8 string is valid. stringref strings are WTF-8, though. So we can expect that for a Rust interface that exports a function that takes a stringref parameter, wasm-bindgen would then use a WasmString type, which could be transformed to an Option<String> (which could fail or replace with U+FFFD if there are isolated surrogates). This will remove the need for TextDecoder/TextEncoder.

As an optimization for Rust modules that are designed to work with WebAssembly, WasmString may expose some methods to avoid an eager copy.

How do we expect JVM and CLR languages to compile to stringref?

We expect Java to use stringref directly to represent string values.

Java deals with strings as immutable sequences of 16-bit code units. Access to individual code units would use stringview_wtf16.

Alternately, a Java compiler might instead choose to use stringview_wtf16, eagerly obtaining WTF-16 views when it receives a stringref from the outside world.

How do we expect Python to compile to stringref?

We expect CPython to provide a wrapper around stringref for strings that come from "outside". We expect PyPy to use stringref directly for all strings.

Python strings are immutable sequences of Unicode code points. This may include surrogates.

CPython's string support is abstract: all codepoint access goes through an accessor API. Therefore when CPython receives a stringref on a public interface, CPython could store that stringref in a table and then forward any indexed codepoint access to that stringref.

PyPy would instead use stringref directly to implement its strings. The PyPy maintainer notes that most strings in Python aren't accessed using indexed accessors, so probably PyPy would only obtain a view as needed.

How do we expect C++ to compile to stringref?

We expect that LLVM will be extended with an additional reference type, stringref, like the existing externref and funcref support, along with a number of builtins to expose the basic stringref operations. LLVM will be able to directly expose C++ functions to WebAssembly that take stringref parameters, removing the need for much Emscripten-side code. However as reference-typed values aren't storable to main memory, we expect that unless a C++ program is carefully built to integrate reference types, that most stringref values will be eagerly converted to WTF-8 on the WebAssembly boundary.

Is the stringref type nullable?

Oh God I guess so. ref.null string it is I guess!! 😭 😭 😭

What kinds of performance gains can we expect?

  • WebAssembly can receive encoded content of JS strings exactly where it is wanted: no need to stack-allocate then copy.
  • WebAssembly can process long strings in chunks rather than having to reserve space for the whole string.
  • WebAssembly can cheaply check incoming strings against literals, treating them as symbols.
  • Avoid JIT warmup for JS-implemented UTF-8 encode and decode.
  • Avoid allocation of subarrays when decoding; e.g. as used by emscripten
  • Cheap prefix/suffix tests without reading whole string
  • WebAssembly can cheaply pass string literals to JS without decoding or copying

What's the security implication?

Right now, working with strings fundamentally means communicating UTF-8 via memory. To grant someone access to a string, you have to grant them access to all of your memory. This violates the principle of least privilege. Having reference-typed strings would limit the capability to just the immutable codepoint sequence in question, and not all of memory.

Additionaly, interfacing between memory lifetimes in C/C++ and JavaScript is bug-prone. Using stringref would eliminate questions of memory ownership, reducing the risk of use-after-free, data corruption, write overruns, and privileged data leakage.

Why might you want to eagerly copy string contents to and from linear memory?

Some programming languages will be happy to deal with string contents via the stringview APIs, avoiding copies of string contents to linear or GC-managed memory. Some others will prefer to copy out a WTF-8 encoding to main memory, because that's how they are used to dealing with strings. This copying has an overhead but for algorithms that touch many code units it can be advantageous, as you get to inline the per-code-unit processing work rather than calling out to stringview interfaces.

As the stringview interfaces may exhibit polymorphism, they may have some per-operation overheads. For example, a stringview_wtf16 will be cheap to create in JavaScript, but accessing the code units still has to dispatch over whether the string is a rope or a slice, whether the codepoints are one-byte or two-byte, and so on. Even in non-browser WTF-8 implementations there will still be ropes and slices.

Why not just use externref and imported functions?

The instruction set could be implemented with imported functions, replacing the stringref type with externref. So why bother adding it to WebAssembly itself? Three reasons: platform expressivity, performance, and security.

On the first point: if the strings feature required some capability from the host, then it would be clearly best as a library. For example, WebGL access falls in this category. But reference-typed strings are a more fundamental feature common to all languages that use automatic memory management. In that way they are closer to the GC proposal; although you could implement structs and arrays via externref and imports, if you did that you might as well compile to JavaScript instead of WebAssembly. It should be possible to make a WebAssembly program that uses reference-typed strings (because almost all such programs would have strings) without relying on any JavaScript at all.

Also, the evolutionary endpoint of an externref-and-imports strategy is a JavaScript-specific string interface. Without any broader WebAssembly platform concern, strings-using WebAssembly code would find itself relying on details of JavaScript's string representation, for example having the only interface be to process strings one code unit at a time instead of one codepoint at a time. This is not a good platform outcome.

Finally, though the WebAssembly platform should be able to stand alone, it should also interoperate smoothly with hosts, especially JavaScript on the web. This rules out any implementation of strings in terms of reference-typed structs and arrays: not only would such an implementation be likely slower than the host's strings, it would also be incompatible. On the web, WebAssembly and JavaScript should use the same string implementation.

On the performance side, we expect that stringref will be faster than externref+imports:

  1. Whereas an externref might need to be a tagged union, a stringref can be an unpacked pointer.
  2. WebAssembly instructions are likely faster and less of an optimization barrier than callouts to imports.
  3. Run-time helper code for WebAssembly instructions is probably implemented in C++/Rust/etc more directly, resulting in more predictable performance than e.g. an encoder implemented in JS (for web embeddings).
  4. Reading string contents, either via string.encode_wtf8-then-process-inline or via stringview_wtf16, is likely faster than calling out to JavaScript to read code units one at a time. WebAssembly-to-JavaScript calls are cheap but not free.

On the other hand, it's true that JS run-time routines can use adaptive JIT techniques to possibly inline representation-specific accessors. This is of limited use though for run-time routines with many different call sites.

On the reliability and security side, adding stringref to WebAssembly removes a significant user of extra-module access to memory. Because the WebAssembly code can pick apart the string itself, that's one fewer reason for the WebAssembly module to have to expose its memory.

How does this relate to the component model and interface types?

The component model is a vision of how to compose systems out of shared-nothing parts implemented in WebAssembly. The boundaries between these components are mediated by interface types, which specify how to communicate data from one component to another in the most efficient way possible.

At one level, reference-typed strings don't appear have anything to do with the component model. Because components are specified to not share anything, even GC-managed data, zero-copy communication of reference-typed strings between components is strictly out of scope (though this operation may be zero-copy in practice; see below).

From the perspective of the component model, reference-typed strings are rather an intra-component concern. A component may be composed internally of a number of WebAssembly modules, as well as possible host facilities such as JavaScript. The zero-copy properties provided by stringref are only assured on to the inter-module, intra-component boundaries of a program.

That said, strings in the abstract are an important data type, and relate to interface types (a WebAssembly proposal based on the component model). Obviously you will want to be able to use stringref with interface types. The shared-nothing design choice of the component model then implies that stringref contents should be copied when they cross a component boundary.

Incidentally, for inter-component interfaces that deal in strings, the component model specifies that abstractly, strings are sequences of unicode scalar values. This implies that some JavaScript strings can't traverse a component boundary, because of the potential for isolated surrogates, and also implies an eager check that a stringref is a valid USV sequence, for an interface-typed call. In practice this is not a problem because the stringref contents are being copied anyway and so can be validated at the same time.

Interface types are used to specify a WebAssembly function's signature in an abstract way. This signature should then be compiled down to a concrete adapter function specialized to the data representations used by the caller and the callee. The instruction set in this proposal can be used to implement the adapter function for passing a stringref as a string; assuming that the adapter function is generated in such a way that it has access to the target memory, string.encode_wtf8 can implement the copy and validation at the same time. string.new_wtf8 would be the implementation of getting a stringref from an interface-typed string value, again assuming UTF-8 encoding for these values.

Of course, because a stringref is immutable, whether it is copied or not on a component boundary or during a call to an interface-typed function is an implementation detail. Some implementations of the component model may wish to copy in all cases, for memory usage accounting reasons. Others will apply a zero-copy strategy when possible, for example when both the caller and the callee of an interface are implemented with stringref. In the zero-copy case, however, hosts have to eagerly verify that the string is a valid USV sequence. For this they would use string.is_usv_sequence.