Skip to content

Commit

Permalink
Document C string literal tokens.
Browse files Browse the repository at this point in the history
  • Loading branch information
jmillikin committed Nov 1, 2023
1 parent 8947db0 commit 5d19507
Show file tree
Hide file tree
Showing 2 changed files with 90 additions and 8 deletions.
10 changes: 10 additions & 0 deletions src/expressions/literal-expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
>    | [BYTE_LITERAL]\
>    | [BYTE_STRING_LITERAL]\
>    | [RAW_BYTE_STRING_LITERAL]\
>    | [C_STRING_LITERAL]\
>    | [RAW_C_STRING_LITERAL]\
>    | [INTEGER_LITERAL]\
>    | [FLOAT_LITERAL]\
>    | `true` | `false`
Expand Down Expand Up @@ -48,6 +50,12 @@ A string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_B

> **Note**: This section is incomplete.
## C string literal expressions

A C string literal expression consists of a single [C_STRING_LITERAL] or [RAW_C_STRING_LITERAL] token.

> **Note**: This section is incomplete.
## Integer literal expressions

An integer literal expression consists of a single [INTEGER_LITERAL] token.
Expand Down Expand Up @@ -182,5 +190,7 @@ The expression's type is the primitive [boolean type], and its value is:
[BYTE_LITERAL]: ../tokens.md#byte-literals
[BYTE_STRING_LITERAL]: ../tokens.md#byte-string-literals
[RAW_BYTE_STRING_LITERAL]: ../tokens.md#raw-byte-string-literals
[C_STRING_LITERAL]: ../tokens.md#c-string-literals
[RAW_C_STRING_LITERAL]: ../tokens.md#raw-c-string-literals
[INTEGER_LITERAL]: ../tokens.md#integer-literals
[FLOAT_LITERAL]: ../tokens.md#floating-point-literals
88 changes: 80 additions & 8 deletions src/tokens.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,16 @@ Literals are tokens used in [literal expressions].

#### Characters and strings

| | Example | `#` sets\* | Characters | Escapes |
|----------------------------------------------|-----------------|------------|-------------|---------------------|
| [Character](#character-literals) | `'H'` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) |
| [String](#string-literals) | `"hello"` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) |
| [Raw string](#raw-string-literals) | `r#"hello"#` | <256 | All Unicode | `N/A` |
| [Byte](#byte-literals) | `b'H'` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) |
| [Byte string](#byte-string-literals) | `b"hello"` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) |
| [Raw byte string](#raw-byte-string-literals) | `br#"hello"#` | <256 | All ASCII | `N/A` |
| | Example | `#` sets\* | Characters | Escapes |
|----------------------------------------------|-----------------|------------|-----------------|---------------------|
| [Character](#character-literals) | `'H'` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) |
| [String](#string-literals) | `"hello"` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) |
| [Raw string](#raw-string-literals) | `r#"hello"#` | <256 | All Unicode | `N/A` |
| [Byte](#byte-literals) | `b'H'` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) |
| [Byte string](#byte-string-literals) | `b"hello"` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) |
| [Raw byte string](#raw-byte-string-literals) | `br#"hello"#` | <256 | All ASCII | `N/A` |
| [C string](#c-string-literals) | `c"hello"` | 0 | non-`NUL` ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) |
| [Raw C string](#raw-c-string-literals) | `cr#"hello"#` | <256 | non-`NUL` ASCII | `N/A` |

\* The number of `#`s on each side of the same literal must be equivalent.

Expand Down Expand Up @@ -328,6 +330,76 @@ b"\x52"; b"R"; br"R"; // R
b"\\x52"; br"\x52"; // \x52
```

### C string and raw C string literals

#### C string literals

> **<sup>Lexer</sup>**\
> C_STRING_LITERAL :\
> &nbsp;&nbsp; `c"` ( ASCII_FOR_C_STRING | BYTE_ESCAPE | STRING_CONTINUE )<sup>\*</sup> `"` SUFFIX<sup>?</sup>
>
> ASCII_FOR_C_STRING :\
> &nbsp;&nbsp; _any non-NUL ASCII (i.e 0x01 to 0x7F), except_ `"`, `\` _and IsolatedCR_
A non-raw _C string literal_ is a sequence of ASCII characters and _escapes_,
preceded by the characters `U+0063` (`c`) and `U+0022` (double-quote), and
followed by the character `U+0022`. If the character `U+0022` is present within
the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
Alternatively, a C string literal can be a _raw C string literal_, defined
below. The type of a C string literal is `&core::ffi::CStr`.

Some additional _escapes_ are available in either C or non-raw C string
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
following forms:

* A _byte escape_ escape starts with `U+0078` (`x`) and is followed by exactly
two _hex digits_. It denotes the byte equal to the provided hex value. The
byte escape sequence `\x00` is forbidden, as C strings may not contain `NUL`.
* A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072`
(`r`), or `U+0074` (`t`), denoting the bytes values `0x0A` (ASCII LF),
`0x0D` (ASCII CR) or `0x09` (ASCII HT) respectively.
* The _backslash escape_ is the character `U+005C` (`\`) which must be
escaped in order to denote its ASCII encoding `0x5C`.

#### Raw C string literals

> **<sup>Lexer</sup>**\
> RAW_C_STRING_LITERAL :\
> &nbsp;&nbsp; `cr` RAW_C_STRING_CONTENT SUFFIX<sup>?</sup>
>
> RAW_C_STRING_CONTENT :\
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII_EXCEPT_NUL<sup>* (non-greedy)</sup> `"`\
> &nbsp;&nbsp; | `#` RAW_C_STRING_CONTENT `#`
>
> ASCII_EXCEPT_NUL :\
> &nbsp;&nbsp; _any non-NUL ASCII (i.e. 0x01 to 0x7F)_
Raw C string literals do not process any escapes. They start with the
character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
_raw string body_ can contain any sequence of non-`NUL` ASCII characters and is terminated
only by another `U+0022` (double-quote) character, followed by the same number of
`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
character. A raw C string literal can not contain any non-ASCII byte.

All characters contained in the raw string body represent their ASCII encoding,
the characters `U+0022` (double-quote) (except when followed by at least as
many `U+0023` (`#`) characters as were used to start the raw string literal) or
`U+005C` (`\`) do not have any special meaning.

Examples for C string literals:

```rust
c"foo"; cr"foo"; // foo
c"\"foo\""; cr#""foo""#; // "foo"

c"foo #\"# bar";
cr##"foo #"# bar"##; // foo #"# bar

c"\x52"; c"R"; cr"R"; // R
c"\\x52"; cr"\x52"; // \x52
```

### Number literals

A _number literal_ is either an _integer literal_ or a _floating-point
Expand Down

0 comments on commit 5d19507

Please sign in to comment.