Character Literals

carbon-language · Aug 10, 2022 · 5b22640 · 5b22640
1 parent 21d39ef
commit 5b22640
Showing 1 changed file with 130 additions and 0 deletions.
diff --git a/proposals/p1964.md b/proposals/p1964.md
@@ -0,0 +1,130 @@
+# Character literals
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/1964)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Problem](#problem)
+-   [Background](#background)
+-   [Proposal](#proposal)
+-   [Details](#details)
+    -   [Encoding](#encoding)
+-   [Rationale](#rationale)
+-   [Alternatives considered](#alternatives-considered)
+
+<!-- tocstop -->
+
+## Problem
+
+This proposal specifies lexical rules for constant characters in Carbon.
+
+## Background
+
+We wish to provide a distinct lexical syntax for character literals versus
+string literals.
+
+In theory we could just reuse string literals for the purpose of character
+literals. However, it could benefit the readablity of our code if we had a
+distinct lexical syntax for character literals versus string literals.
+
+## Proposal
+
+The idea is to create and manage a character literal the same we would as a
+string, but using the single quote delimiter (') compared to the string double
+quote (").
+
+As with string literals, each character literal would have a different type.
+
+    var w: ch8 = 'w';
+
+We will not support:
+
+-   Multi-line literals
+-   "raw" literals (using #'x'#)
+-   Empty character literals (''')
+
+## Details
+
+A character literal is a sequence enclosed with single quotes delimiter ('),
+excluding:
+
+-   New line
+-   Single quote (`'`)
+-   Back-slash (`\`)
+-   Escape sequences
+
+The type of a character literal will depend on the the contents, so that `'c'`
+and `u'b'` would have different types (as would `'b'` and `"b"`). However any
+`'\n'` and `'\u{A}'` would be of the same type (As when they are encoded, they
+are the same unicode entities `%0A`).
+
+These different types should resemble the different C++ character literal types:
+
+Ordinary (UTF-8) character literals:
+
+-   C++`char`: `char c = 'c';`
+-   Carbon: `ch8`: `var c: ch8 = 'c';`
+
+UTF-16 character literals
+
+-   C++ `char16_t`: `char16_t c = u'c';`
+-   Carbon `ch16`: `var c: ch16 = u'c';`
+
+UTF-32 character literals
+
+-   C++ `char32_t`: `char32_t c = U'c';`
+-   Carbon `ch32`: `var c: ch32 = U'c';`
+
+Wide-character literals:
+
+-   C++ `wchar_t`: `wchar_t c = L'c';`
+-   Carbon `wch`: `var c: wch = L'c';`
+
+### Encoding
+
+They type of character literal and the way it is encoded should directly
+correlate i.e depend on what type is being initialized by the literal:
+
+-   Ordinary (UTF-8) character literals should use a single UTF-8 code unit.
+-   Wide-character literals should use single Unicode code point.
+-   UTF-16 character literals should use a single Unicode code point.
+-   UTF-32 character literals should use a single Unicode code point.
+-   Glyph character literals should use a base character (Single Unicode point)
+    plus a sequence of combining characters.
+
+This is experimental, and should be revisited if we find motivation for
+expressing character literals in other encodings.
+
+## Rationale
+
+This proposal supports the goal of making Carbon code
+[easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
+and
+[Interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code)
+by ensuring that every kind of character literal that exists in C++ can be
+represented in a Carbon character literal. This is done in a way that is natural
+to adopt, understand, easy to read by having explicit character types mapped to
+the C++ character types and the correct associated encoding.
+
+## Alternatives considered
+
+-   No explicit Wide-character literals type, as this is primarily used by
+    Windows systems, encoded to UTF-16 whereas other systems use UTF-32. In
+    terms of C++ interop, we would need to import the associated `wchar_t` to
+    the correct Carbon type based simply on the encoding/system using `wchar_t`
+    leading to further complexity.
+
+-   No distinct character literal. In principle a character literal can be
+    represented by reusing string literals. However it terms of readablility, if
+    we had a distinct lexical syntax for character literals versus string
+    literals, this would be more inline with Carbon's language design goals
+    related to self documenting code, easy to read, understand, write and C++
+    interopability.