Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String literal expressions #1452

Merged
merged 4 commits into from
Jan 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 212 additions & 5 deletions src/expressions/literal-expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,29 +26,227 @@ Each of the lexical [literal][literal tokens] forms described earlier can make u
5; // integer type
```

In the descriptions below, the _string representation_ of a token is the sequence of characters from the input which matched the token's production in a *Lexer* grammar snippet.

> **Note**: this string representation never includes a character `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).

## Escapes

The descriptions of textual literal expressions below make use of several forms of _escape_.

Each form of escape is characterised by:
* an _escape sequence_: a sequence of characters, which always begins with `U+005C` (`\`)
* an _escaped value_: either a single character or an empty sequence of characters

In the definitions of escapes below:
* An _octal digit_ is any of the characters in the range \[`0`-`7`].
* A _hexadecimal digit_ is any of the characters in the ranges \[`0`-`9`], \[`a`-`f`], or \[`A`-`F`].

### Simple escapes

Each sequence of characters occurring in the first column of the following table is an escape sequence.

In each case, the escaped value is the character given in the corresponding entry in the second column.

| Escape sequence | Escaped value |
|-----------------|--------------------------|
| `\0` | U+0000 (NUL) |
| `\t` | U+0009 (HT) |
| `\n` | U+000A (LF) |
| `\r` | U+000D (CR) |
| `\"` | U+0022 (QUOTATION MARK) |
| `\'` | U+0027 (APOSTROPHE) |
| `\\` | U+005C (REVERSE SOLIDUS) |

### 8-bit escapes

The escape sequence consists of `\x` followed by two hexadecimal digits.

The escaped value is the character whose [Unicode scalar value] is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.

> **Note**: the escaped value therefore has a [Unicode scalar value] in the range of [`u8`][numeric types].

### 7-bit escapes

The escape sequence consists of `\x` followed by an octal digit then a hexadecimal digit.

The escaped value is the character whose [Unicode scalar value] is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.

### Unicode escapes

The escape sequence consists of `\u{`, followed by a sequence of characters each of which is a hexadecimal digit or `_`, followed by `}`.

The escaped value is the character whose [Unicode scalar value] is the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.

> **Note**: the permitted forms of a [CHAR_LITERAL] or [STRING_LITERAL] token ensure that there is such a character.

### String continuation escapes

The escape sequence consists of `\` followed immediately by `U+000A` (LF), and all following whitespace characters before the next non-whitespace character.
For this purpose, the whitespace characters are `U+0009` (HT), `U+000A` (LF), `U+000D` (CR), and `U+0020` (SPACE).

The escaped value is an empty sequence of characters.

> **Note**: The effect of this form of escape is that a string continuation skips following whitespace, including additional newlines.
> Thus `a`, `b` and `c` are equal:
> ```rust
> let a = "foobar";
> let b = "foo\
> bar";
> let c = "foo\
>
> bar";
>
> assert_eq!(a, b);
> assert_eq!(b, c);
> ```
>
> Skipping additional newlines (as in example c) is potentially confusing and unexpected.
> This behavior may be adjusted in the future.
> Until a decision is made, it is recommended to avoid relying on skipping multiple newlines with line continuations.
> See [this issue](https://github.com/rust-lang/reference/pull/1042) for more information.

## Character literal expressions

A character literal expression consists of a single [CHAR_LITERAL] token.

> **Note**: This section is incomplete.
The expression's type is the primitive [`char`][textual types] type.

The token must not have a suffix.

The token's _literal content_ is the sequence of characters following the first `U+0027` (`'`) and preceding the last `U+0027` (`'`) in the string representation of the token.

The literal expression's _represented character_ is derived from the literal content as follows:

* If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
* [Simple escapes]
* [7-bit escapes]
* [Unicode escapes]

* Otherwise the represented character is the single character that makes up the literal content.

The expression's value is the [`char`][textual types] corresponding to the represented character's [Unicode scalar value].

> **Note**: the permitted forms of a [CHAR_LITERAL] token ensure that these rules always produce a single character.

Examples of character literal expressions:

```rust
'R'; // R
'\''; // '
'\x52'; // R
'\u{00E6}'; // LATIN SMALL LETTER AE (U+00E6)
```

## String literal expressions

A string literal expression consists of a single [STRING_LITERAL] or [RAW_STRING_LITERAL] token.

> **Note**: This section is incomplete.
The expression's type is a shared reference (with `static` lifetime) to the primitive [`str`][textual types] type.
That is, the type is `&'static str`.

The token must not have a suffix.

The token's _literal content_ is the sequence of characters following the first `U+0022` (`"`) and preceding the last `U+0022` (`"`) in the string representation of the token.

The literal expression's _represented string_ is a sequence of characters derived from the literal content as follows:

* If the token is a [STRING_LITERAL], each escape sequence of any of the following forms occurring in the literal content is replaced by the escape sequence's escaped value.
* [Simple escapes]
* [7-bit escapes]
* [Unicode escapes]
* [String continuation escapes]

These replacements take place in left-to-right order.
For example, the token `"\\x41"` is converted to the characters `\` `x` `4` `1`.

* If the token is a [RAW_STRING_LITERAL], the represented string is identical to the literal content.

The expression's value is a reference to a statically allocated [`str`][textual types] containing the UTF-8 encoding of the represented string.

Examples of string literal expressions:

```rust
"foo"; r"foo"; // foo
"\"foo\""; r#""foo""#; // "foo"

"foo #\"# bar";
r##"foo #"# bar"##; // foo #"# bar

"\x52"; "R"; r"R"; // R
"\\x52"; r"\x52"; // \x52
```

## Byte literal expressions

A byte literal expression consists of a single [BYTE_LITERAL] token.

> **Note**: This section is incomplete.
The expression's type is the primitive [`u8`][numeric types] type.

The token must not have a suffix.

The token's _literal content_ is the sequence of characters following the first `U+0027` (`'`) and preceding the last `U+0027` (`'`) in the string representation of the token.

The literal expression's _represented character_ is derived from the literal content as follows:

* If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
* [Simple escapes]
* [8-bit escapes]

* Otherwise the represented character is the single character that makes up the literal content.

The expression's value is the represented character's [Unicode scalar value].

> **Note**: the permitted forms of a [BYTE_LITERAL] token ensure that these rules always produce a single character, whose Unicode scalar value is in the range of [`u8`][numeric types].

Examples of byte literal expressions:

```rust
b'R'; // 82
b'\''; // 39
b'\x52'; // 82
b'\xA0'; // 160
```

## Byte string literal expressions

A string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_BYTE_STRING_LITERAL] token.
A byte string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_BYTE_STRING_LITERAL] token.

> **Note**: This section is incomplete.
The expression's type is a shared reference (with `static` lifetime) to an array whose element type is [`u8`][numeric types].
That is, the type is `&'static [u8; N]`, where `N` is the number of bytes in the represented string described below.

The token must not have a suffix.

The token's _literal content_ is the sequence of characters following the first `U+0022` (`"`) and preceding the last `U+0022` (`"`) in the string representation of the token.

The literal expression's _represented string_ is a sequence of characters derived from the literal content as follows:

* If the token is a [BYTE_STRING_LITERAL], each escape sequence of any of the following forms occurring in the literal content is replaced by the escape sequence's escaped value.
* [Simple escapes]
* [8-bit escapes]
* [String continuation escapes]

These replacements take place in left-to-right order.
For example, the token `b"\\x41"` is converted to the characters `\` `x` `4` `1`.

* If the token is a [RAW_BYTE_STRING_LITERAL], the represented string is identical to the literal content.

The expression's value is a reference to a statically allocated array containing the [Unicode scalar values] of the characters in the represented string, in the same order.

> **Note**: the permitted forms of [BYTE_STRING_LITERAL] and [RAW_BYTE_STRING_LITERAL] tokens ensure that these rules always produce array element values in the range of [`u8`][numeric types].

Examples of byte string literal expressions:

```rust
b"foo"; br"foo"; // foo
b"\"foo\""; br#""foo""#; // "foo"

b"foo #\"# bar";
br##"foo #"# bar"##; // foo #"# bar

b"\x52"; b"R"; br"R"; // R
b"\\x52"; br"\x52"; // \x52
```

## C string literal expressions

Expand Down Expand Up @@ -167,6 +365,11 @@ The expression's type is the primitive [boolean type], and its value is:
* false if the keyword is `false`


[Simple escapes]: #simple-escapes
[8-bit escapes]: #8-bit-escapes
[7-bit escapes]: #7-bit-escapes
[Unicode escapes]: #unicode-escapes
[String continuation escapes]: #string-continuation-escapes
[boolean type]: ../types/boolean.md
[constant expression]: ../const_eval.md#constant-expressions
[floating-point types]: ../types/numeric.md#floating-point-types
Expand All @@ -177,12 +380,16 @@ The expression's type is the primitive [boolean type], and its value is:
[suffix]: ../tokens.md#suffixes
[negation operator]: operator-expr.md#negation-operators
[overflow]: operator-expr.md#overflow
[textual types]: ../types/textual.md
[Unicode scalar value]: http://www.unicode.org/glossary/#unicode_scalar_value
[Unicode scalar values]: http://www.unicode.org/glossary/#unicode_scalar_value
[`f32::from_str`]: ../../core/primitive.f32.md#method.from_str
[`f32::INFINITY`]: ../../core/primitive.f32.md#associatedconstant.INFINITY
[`f32::NAN`]: ../../core/primitive.f32.md#associatedconstant.NAN
[`f64::from_str`]: ../../core/primitive.f64.md#method.from_str
[`f64::INFINITY`]: ../../core/primitive.f64.md#associatedconstant.INFINITY
[`f64::NAN`]: ../../core/primitive.f64.md#associatedconstant.NAN
[`u8::from_str_radix`]: ../../core/primitive.u8.md#method.from_str_radix
[`u128::from_str_radix`]: ../../core/primitive.u128.md#method.from_str_radix
[CHAR_LITERAL]: ../tokens.md#character-literals
[STRING_LITERAL]: ../tokens.md#string-literals
Expand Down
36 changes: 11 additions & 25 deletions src/tokens.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,30 +156,13 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
which must be _escaped_ by a preceding `U+005C` character (`\`).

Line-breaks are allowed in string literals. A line-break is either a newline
(`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`). Both
byte sequences are normally translated to `U+000A`, but as a special exception,
when an unescaped `U+005C` character (`\`) occurs immediately before a line
break, then the line break character(s), and all immediately following
` ` (`U+0020`), `\t` (`U+0009`), `\n` (`U+000A`) and `\r` (`U+0000D`) characters
are ignored. Thus `a`, `b` and `c` are equal:
Line-breaks are allowed in string literals.
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
Both byte sequences are translated to `U+000A`.
Comment on lines +159 to +161
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to keep. Just a note that my preference longer-term would be to have some global explanation of the CRLF→LF conversion in a single place so it doesn't need to be repeated in multiple places. For example, this is relevant for all string types (raw strings, byte strings, c-strings, etc.), but they don't currently mention this in the same way.

If there is a section somewhere in the lexing chapter that talks about this translation, I think these sentences could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of looking at documenting CRLF conversion properly next (rather than trying to fight all of #626 in one go).


```rust
let a = "foobar";
let b = "foo\
bar";
let c = "foo\

bar";

assert_eq!(a, b);
assert_eq!(b, c);
```
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.

> Note: Rust skipping additional newlines (like in example `c`) is potentially confusing and
> unexpected. This behavior may be adjusted in the future. Until a decision is made, it is
> recommended to avoid relying on this, i.e. skipping multiple newlines with line continuations.
> See [this issue](https://github.com/rust-lang/reference/pull/1042) for more information.

#### Character escapes

Expand Down Expand Up @@ -274,7 +257,7 @@ preceded by the characters `U+0062` (`b`) and `U+0022` (double-quote), and
followed by the character `U+0022`. If the character `U+0022` is present within
the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
Alternatively, a byte string literal can be a _raw byte string literal_, defined
below. The type of a byte string literal of length `n` is `&'static [u8; n]`.
below.

Some additional _escapes_ are available in either byte or non-raw byte string
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
Expand Down Expand Up @@ -479,7 +462,7 @@ An _integer literal_ has one of four forms:

Like any literal, an integer literal may be followed (immediately, without any spaces) by a suffix as described above.
The suffix may not begin with `e` or `E`, as that would be interpreted as the exponent of a floating-point literal.
See [literal expressions] for the effect of these suffixes.
See [Integer literal expressions] for the effect of these suffixes.

Examples of integer literals which are accepted as literal expressions:

Expand Down Expand Up @@ -576,7 +559,7 @@ A _floating-point literal_ has one of two forms:
Like integer literals, a floating-point literal may be followed by a
suffix, so long as the pre-suffix part does not end with `U+002E` (`.`).
The suffix may not begin with `e` or `E` if the literal does not include an exponent.
See [literal expressions] for the effect of these suffixes.
See [Floating-point literal expressions] for the effect of these suffixes.

Examples of floating-point literals which are accepted as literal expressions:

Expand Down Expand Up @@ -784,12 +767,14 @@ Similarly the `r`, `b`, `br`, `c`, and `cr` prefixes used in raw string literals
[extern crates]: items/extern-crates.md
[extern]: items/external-blocks.md
[field]: expressions/field-expr.md
[Floating-point literal expressions]: expressions/literal-expr.md#floating-point-literal-expressions
[floating-point types]: types/numeric.md#floating-point-types
[function pointer type]: types/function-pointer.md
[functions]: items/functions.md
[generics]: items/generics.md
[identifier]: identifiers.md
[if let]: expressions/if-expr.md#if-let-expressions
[Integer literal expressions]: expressions/literal-expr.md#integer-literal-expressions
[keywords]: keywords.md
[lazy-bool]: expressions/operator-expr.md#lazy-boolean-operators
[literal expressions]: expressions/literal-expr.md
Expand All @@ -808,6 +793,7 @@ Similarly the `r`, `b`, `br`, `c`, and `cr` prefixes used in raw string literals
[raw pointers]: types/pointer.md#raw-pointers-const-and-mut
[references]: types/pointer.md
[sized]: trait-bounds.md#sized
[String continuation escapes]: expressions/literal-expr.md#string-continuation-escapes
[struct expressions]: expressions/struct-expr.md
[trait bounds]: trait-bounds.md
[tuple index]: expressions/tuple-expr.md#tuple-indexing-expressions
Expand Down