-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
130 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
# Character literals | ||
|
||
<!-- | ||
Part of the Carbon Language project, under the Apache License v2.0 with LLVM | ||
Exceptions. See /LICENSE for license information. | ||
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
--> | ||
|
||
[Pull request](https://github.com/carbon-language/carbon-lang/pull/1964) | ||
|
||
<!-- toc --> | ||
|
||
## Table of contents | ||
|
||
- [Problem](#problem) | ||
- [Background](#background) | ||
- [Proposal](#proposal) | ||
- [Details](#details) | ||
- [Encoding](#encoding) | ||
- [Rationale](#rationale) | ||
- [Alternatives considered](#alternatives-considered) | ||
|
||
<!-- tocstop --> | ||
|
||
## Problem | ||
|
||
This proposal specifies lexical rules for constant characters in Carbon. | ||
|
||
## Background | ||
|
||
We wish to provide a distinct lexical syntax for character literals versus | ||
string literals. | ||
|
||
In theory we could just reuse string literals for the purpose of character | ||
literals. However, it could benefit the readablity of our code if we had a | ||
distinct lexical syntax for character literals versus string literals. | ||
|
||
## Proposal | ||
|
||
The idea is to create and manage a character literal the same we would as a | ||
string, but using the single quote delimiter (') compared to the string double | ||
quote ("). | ||
|
||
As with string literals, each character literal would have a different type. | ||
|
||
var w: ch8 = 'w'; | ||
|
||
We will not support: | ||
|
||
- Multi-line literals | ||
- "raw" literals (using #'x'#) | ||
- Empty character literals (''') | ||
|
||
## Details | ||
|
||
A character literal is a sequence enclosed with single quotes delimiter ('), | ||
excluding: | ||
|
||
- New line | ||
- Single quote (`'`) | ||
- Back-slash (`\`) | ||
- Escape sequences | ||
|
||
The type of a character literal will depend on the the contents, so that `'c'` | ||
and `u'b'` would have different types (as would `'b'` and `"b"`). However any | ||
`'\n'` and `'\u{A}'` would be of the same type (As when they are encoded, they | ||
are the same unicode entities `%0A`). | ||
|
||
These different types should resemble the different C++ character literal types: | ||
|
||
Ordinary (UTF-8) character literals: | ||
|
||
- C++`char`: `char c = 'c';` | ||
- Carbon: `ch8`: `var c: ch8 = 'c';` | ||
|
||
UTF-16 character literals | ||
|
||
- C++ `char16_t`: `char16_t c = u'c';` | ||
- Carbon `ch16`: `var c: ch16 = u'c';` | ||
|
||
UTF-32 character literals | ||
|
||
- C++ `char32_t`: `char32_t c = U'c';` | ||
- Carbon `ch32`: `var c: ch32 = U'c';` | ||
|
||
Wide-character literals: | ||
|
||
- C++ `wchar_t`: `wchar_t c = L'c';` | ||
- Carbon `wch`: `var c: wch = L'c';` | ||
|
||
### Encoding | ||
|
||
They type of character literal and the way it is encoded should directly | ||
correlate i.e depend on what type is being initialized by the literal: | ||
|
||
- Ordinary (UTF-8) character literals should use a single UTF-8 code unit. | ||
- Wide-character literals should use single Unicode code point. | ||
- UTF-16 character literals should use a single Unicode code point. | ||
- UTF-32 character literals should use a single Unicode code point. | ||
- Glyph character literals should use a base character (Single Unicode point) | ||
plus a sequence of combining characters. | ||
|
||
This is experimental, and should be revisited if we find motivation for | ||
expressing character literals in other encodings. | ||
|
||
## Rationale | ||
|
||
This proposal supports the goal of making Carbon code | ||
[easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write) | ||
and | ||
[Interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code) | ||
by ensuring that every kind of character literal that exists in C++ can be | ||
represented in a Carbon character literal. This is done in a way that is natural | ||
to adopt, understand, easy to read by having explicit character types mapped to | ||
the C++ character types and the correct associated encoding. | ||
|
||
## Alternatives considered | ||
|
||
- No explicit Wide-character literals type, as this is primarily used by | ||
Windows systems, encoded to UTF-16 whereas other systems use UTF-32. In | ||
terms of C++ interop, we would need to import the associated `wchar_t` to | ||
the correct Carbon type based simply on the encoding/system using `wchar_t` | ||
leading to further complexity. | ||
|
||
- No distinct character literal. In principle a character literal can be | ||
represented by reusing string literals. However it terms of readablility, if | ||
we had a distinct lexical syntax for character literals versus string | ||
literals, this would be more inline with Carbon's language design goals | ||
related to self documenting code, easy to read, understand, write and C++ | ||
interopability. |