Carbon: Lexical conventions #17

zygoloid · 2020-05-19T06:01:25Z

Possible set of lexical conventions for Carbon. Early draft circulated for initial feedback.

The primary principles leading to this approach are to make language evolution (adding keywords, operators, brackets, new kinds of comments) as easy as possible, and to make lexing and parsing as straightforward and efficient as we reasonably can.

RFC: https://forums.carbon-lang.dev/t/rfc-lexical-conventions/67

Fixes #16.

docs/proposals/p0016.md

gribozavr · 2020-05-19T09:09:28Z

docs/proposals/p0016.md

+This is not a comment.
+```
+
+If the character after the comment introducer is an exclamation mark, the


In other languages, the choice of /// for documentation comments is vastly more popular than //!.

Regarding /*! vs. /** I'm not sure there is a specific majority, but /** is just ever so slightly easier to type than /*!.

/** is frequently used to start a /*********-style banner comment. As such, I don't think we should use it for documentation -- we could treat / followed by exactly 2 *s as a special case, but that feels ugly to me. Obviously there's no concern here if we choose to not have block comments.

I prefer //! over ///, because /// is easy to mistake for // and vice versa. I've seen a lot of code use one when they meant the other, and the mistake went unnoticed by the reviewer. I think this is especially important as the difference would affect program validity, and could even theoretically change how we parse things (though I'd want our grammar to avoid that).

That said, I do find /// more aesthetically appealing for long documentation comments than //!. We could also pick something new -- we don't need to follow Doxygen convention here.

we could treat / followed by exactly 2 *s as a special case, but that feels ugly to me.

Specifically, /** followed by a newline would be my suggestion for a doc comment.

I've seen a lot of code use one when they meant the other, and the mistake went unnoticed by the reviewer.

Me too, but I don't think a different comment marker would help. I think this type of issue is best addressed by a linter (warn on a regular comment that appears in a doc comment position, and if the user really meant a plain comment, suggest to add an extra newline after it, or a "nondoc" marker within the comment)

I'd prefer /// over //! for ergonomics. Odd comment I know, but it's an odd series of characters in a row doing //<shift> (with the right pinky) for both QWERTY and DVORAK, and people would type it repeatedly. /// is better on this axis because you're just tapping the same char repeatedly.

Although, as a different choice, could we dictate the opposite? i.e., any regular line comment (//) in a place where a doc comment allowed is a doc comment. Require something like //! or // end-doc to end the doc comments, if people want to write comments there that should be treated as whitespace (which I'd expect to be relatively rare).

I've listed /// versus //! as an open question. I've not yet formed an opinion on @jonmeow's alternative choice. It would be challenging from a parsing perspective to have // mean both doc and non-doc comments (Clang can do that, but it's a pain), so if we go that way, I'd prefer we find another introducer for non-doc comments.

Maybe //# if you want a clear difference? # is a traditional comment char, and feels slightly easier to type... (the current # thread made me think of this)

docs/proposals/p0016.md

gribozavr · 2020-05-19T09:27:17Z

docs/proposals/p0016.md

+
+A real number can be followed by an `e`, an optional `+` or `-` (defaulting to
+`+`), and a decimal integer *N*; the effect is to multiply the given value by
+10<sup>*N*</sup>.


We'd also need hex floats at some point.

I'd welcome that!

Do we like the C and C++ notation for hex floats, or would we prefer something else?

It's reasonably wide-spread through other languages. Are there languages that have hexfloats but use a different syntax? So I wouldn't get too creative here. (Consider also formated I/O and interop. It'd maybe be a bit odd to output one format but have source code in a different one.)

I don't think I have an opinion about hexfloats specifics, but them being a niche and expert feature (meaning that they will be used rarely and by people who know what they are doing and likely are familiar with the concept in other languages), and the likelihood of hexfloats being copied from code in other languages into Carbon code, I think we should have really good reasons to deviate from the de-facto standards here.

gribozavr · 2020-05-19T09:28:20Z

docs/proposals/p0016.md

+more additional decimal digits.
+
+Integers in other bases are written as a `0` followed by a base specifier
+character, followed by a sequence of digits in the corresponding base. The


Any thoughts on digit separators?

Two thoughts:

If we allow them, what symbol do we use? , seems problematic (perhaps lexable, but undesirable). I personally prefer C++'s ' over the _ used by other languages.

Where do we permit them to appear? I would be inclined to require them to appear at "natural" positions within the number -- evenly spaced, groups of 3 for decimal and groups of 4 or 8 for binary and hexadecimal -- but that's a very Anglocentric perspective. (I think that could actually be OK, though: our keywords, use of . rather than , as a decimal point, use of " as quotation marks, and so on are also very Anglocentric.)

re: placement, I've personally used digit separators in binary numbers in Swift to mark bitfield boundaries: https://github.com/apple/swift/blob/39397860a57bf64c45d49be08ba401b94d07be5e/stdlib/public/core/UTF16.swift#L339

Interesting, and doubled digit separators at that. (That would not be valid in C++, where digit separators are required to actually separate digits.) In most cases, I'd think it's a language design issue if literals are being written for bit-fields, but UTF-8 encoding (and things of its ilk) are perhaps a different case.

On balance, I think the benefit of requiring the digit separators to be properly placed (ie, rejecting mistakes like 0xffff'00000'0000) is worth disallowing the more nuanced cases such as your example. But I don't feel strongly about it.

I think requiring digit separators to be regularly placed in decimal and hexadecimal numbers is the right choice. But binary numbers are different, because serialization/deserialization code that has to deal with bitfields is common.

The C++ situation has always struck me as odd. We have prefix-based notations built in to the language, and we have suffix-based notations for libraries. So the language gives me 19, 0xFA, 026, and 0b1010110, but if I want to define a trinary literal it's 20021_t. That inconsistency has always bothered me. The document mentions above that we could potentially allow something like user-defined literals, which would bring back this inconsistency.

My initial feeling is that I'm weakly against enforcing a particular separation style in integers. It's fairly easy to change the rule in either direction, though, since it's easy to write a migration tool.

gribozavr · 2020-05-19T09:33:20Z

docs/proposals/p0016.md

+
+#### Characters
+
+A *character literal* is formed of any single character other than a backslash


Need a more precise definition of what a "character" is -- a Unicode scalar, an extended grapheme cluster etc.

We do, but I'm not sure this is the right place to consider that. I imagine we'll have a later proposal on character and string types. Perhaps the best thing to do in the context of this document is to model character literals the same as simple string literals, and let the proposal that deals with character and string types worry about what happens if the value is unrepresentable as a "single character" for whatever data model it's using.

I don't think it's possible to say what a character literal is without saying what a single character is. We can't defer to the Unicode standard on that question, because, if I understand correctly, "character" just isn't a concept in Unicode. There are code points and extended grapheme clusters and such. Thinking of code points as characters doesn't quite work: I'd probably want to think of both 'õ' and 'x̤' as single characters, but one of them is a single code point and the other is two code points.

Unicode has a notion of "character" -- those are the things to which code points are assigned. (See https://www.unicode.org/versions/Unicode13.0.0/ch01.pdf, which uses the word "character" extensively.) That's generally what I mean whenever I say "character" in this document (except perhaps -- ironically -- for the term "character literal", which is certainly underspecified).

I think that this is the wrong document to be specifying how character and string types work and are represented, and the semantics of literals of those types. I only want to cover the syntax here. From that perspective, I think the answer is relatively easy: if we allow character literals at all (which I now list as an open question), then they have the same morphology as simple string literals, other than having different delimiters. It's then up to the semantic interpretation of them to determine whether a character literal is valid.

I could imagine we might want multiple different kinds of character type, representing ASCII characters, Unicode characters, Unicode grapheme clusters, and a number of other things, and each of them might want to interpret and validate the contents of a character literal in a different way. So I think it would not be appropriate to specify anything here other than a lexical convention. Maybe we will choose to not have character literals at all. That'd be nice; we could free up ' for other purposes, as @gribozavr pointed out elsewhere. But for now I'd like to reserve it for character literals -- to give them "first dibs" as it were.

Unicode has a notion of "character" -- those are the things to which code points are assigned.

I just scanned it, and I think chapter 1 uses the term "character" informally -- it is an introductory chapter after all. I could not find a definition of it. You can find a definition of, for example, an "encoded character" -- which is a term of art, as indicated by the italics in the text.

I think it is fair to punt the specifics to future proposals, as long as we're explicit about not making any particular commitment in this proposal.

gribozavr · 2020-05-19T09:36:53Z

docs/proposals/p0016.md

+
+A *character literal* is formed of any single character other than a backslash
+(`\\`) or single quotation mark, enclosed in a pair of single quotation marks
+(`'`), or an escape sequence enclosed in a pair of single quotation marks.


Do we need a separate character literal syntax? In Swift, for example, string literals that use the double quotation mark are polymorphic: when a string literal contains only one Unicode character, the string literal is also a character literal.

https://github.com/apple/swift/blob/0968d16a99f812d46782eee84b6c7899ca88e185/test/stdlib/Character.swift#L71-L120

(This is a test for Character which is an extended grapheme cluster, UnicodeScalar is a single Unicode scalar and it works similarly.)

Not using a single quote for character literals frees up one special character so that we could use it for something else.

Hm. I'm not a huge fan of this kind of punning between different types: a character and a string are ontologically different things, and I think the programmer should be expressing which one they want. (Python's full-scale unifying of characters and strings is not a good thing. The Swift approach seems a lot better, but I'm still concerned.) I'm not completely opposed, mind you -- freeing up ' for other uses might be nice -- but I'm certainly not sold on this idea.

is it that different from int's going to their own unique type that than convert to size ints?

Personally, I think the Swift approach is pretty reasonable. An ASCII string is an array of characters, but a UTF-8 string is not -- UTF-8 characters are variable-length, so to handle them efficiently, you need to treat them like substrings rather than like elements of an array. (And that's doubly true if you're talking about extended grapheme clusters rather than individual code points.) Efficient code that iterates through the characters of a Unicode string looks very similar to code that iterates through the words of an ASCII string.

As a result, Unicode characters feel like a special case of strings to me, rather than a fundamentally different data type.

I've added this as an open question. I think this depends on the design of things we will consider later (string and character types), but for now I'd like to at least reserve ' for character literals, so we don't use them for anything else until / unless we decide we don't want a separate character literal syntax.

docs/proposals/p0016.md

gribozavr · 2020-05-19T14:19:02Z

docs/proposals/p0016.md

+an identifier, there should never be an optional keyword preceding the
+identifier, and nor should the identifier be optional if it can be followed by


there should never be an optional keyword preceding the identifier,

Why not? As long as that optional keyword is included in the language version beyond the "compatibility horizon", it should be feasible to recognize it.

I would like to use the same consistent set of rules for all keywords throughout the entire language, rather than giving different behavior to recently introduced keywords compared to older keywords. I think we should aim for our intended evolutionary path to not introduce scar tissue -- places where you can see that something used to be different and changed. And I think that means that all keywords should behave as if they're new keywords.

I suppose another perspective on this is: while we could have such an optional keyword now, we could never add another one as a point change. We would need to first add it, then wait for the compatibility horizon to expire (which could potentially be years), and then start using it. That would put a lot of pressure on us to reuse an existing keyword, which we explicitly do not want. If we disallow such changes forever, then the pressure to reuse keywords is gone, because reusing a keyword doesn't help solve the problem.

docs/proposals/p0016.md

josh11b · 2020-05-19T19:24:18Z

docs/proposals/p0016.md

+Characters with the Unicode property `White_Space` but not
+`Pattern_White_Space` are invalid outside comments and literals. Code


For those of us who aren't as familiar with these Unicode properties, which of the characters above are we talking about here (as of Unicode version 13)?

They're these ones:

00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE

For the U+2000..U+200A, see https://www.compart.com/en/unicode/block/U+2000

I, too, had a hard time understanding the significance of this rule. Is it possible to explain it in a way that makes the underlying intent clearer?

josh11b · 2020-05-19T19:27:58Z

docs/proposals/p0016.md

+A *comment* in Carbon is either:
+
+ * A *line comment*, beginning with `//` and running to the end of the line, or
+ * A *block comment*, beginning with `/*` and running to the matching `*/`.


I wonder if we might get away with just line comments? Block comments open up questions about nesting, and can allow some underhanded code -- where someone tries to fool you that code is safe by tricking you into thinking something is commented out, but there is actually something after the comment start on the same line that is live code. This attack is specifically discussed in https://www.ida.org/-/media/feature/publications/i/in/initial-analysis-of-underhanded-source-code/d-13166.ashx

I'm going to leave this thread unresolved for a bit in the hope of attracting more opinions, but I'm tentatively in favor of adopting this suggestion.

From @gribozavr:

Regardless of specific syntax, I'm not sure that we need multiline comments that are not code comments.

Josh, the comment-related attack I see in the paper is using // in Python (which uses # for comments, and // is an operator). I don't see block comments or /* mentioned for attacks. Am I missing the attack you're referring to?

I think block comments have a lot of utility in commenting out large sections of code, e.g. while debugging. We could make them more readable by only allowing block comments as the only thing on a line. That kind of approach would probably still beg an alternative for:

Bang(foo, /*bar=*/baz);

(also, good syntax highlighting, based on easy parsing, should help mitigate attacks)

For now I've removed /* ... */ comments. Will bring them back if we establish consensus that we want them.

I'm optimistic that we can address the need for Bang(foo, /*bar=*/baz) comments a different way (eg, with designators for parameters: Bang(foo, .bar=baz)).

@jonmeow Comment-related attacks is a whole category, described as:
"Many samples worked by confusing humans about comments (e.g.,
misleading humans about where the comments started or having active code
embedded in a comment)."

He elsewhere talks about "active code is hiding within a comment" or "non-comments hidden in comments" which is more descriptive of what you can do with /* ... */ than that specific Python exploit.

It is much easier to confuse humans about where a comment ends with /* ... */ than something that terminates with a newline (in fact one mitigation in the doc is to reformat so comments are on their own lines). As an example, what is /*/*/*/*/*/ equivalent to? It looks a bit like a fancy separator comment, but it gets parsed as *.

I found it, page B-10: "Misformatted comment (early termination due to an embedded */)...."

The example /*/*/*/*/*/ would be addressed by requiring block comments as the only thing on the line - while it's an issue in C++, I don't see why it should be a barrier for Carbon.

@zygoloid If you're removing block comments, can you please explicitly address it in an "Alternative considered" or otherwise clarify the disposition?

Ping -- this proposal is now in RFC, I would've expected this to be addressed. I don't see /* mentioned at all in the proposal, even though it's an obvious alternative.

josh11b · 2020-05-19T19:36:46Z

docs/proposals/p0016.md

+left-to-right scan of the source file, using a "max munch" rule: the longest
+possible next token is formed at each step.
+
+## Rationale


Moving all the rationales to a separate section was not a win for my reading of this document. I would have preferred them as subsections next to the thing they are the rationale for. In the one case of the "encoding rationale", I probably would have been happier to read it later instead of when I found the [why?] link to it. However, I didn't know that when I saw the [why?] link, so I ended up jumping back and forth anyway.

Perhaps the ideal would have been a collapsible section, but I don't have any idea if those are available in the markdown variants we are using.

I'm attempting to follow https://github.com/carbon-language/carbon-lang/blob/master/docs/project/evolution.md#make-a-proposal -- the BLUF / Inverted Pyramid style seems to encourage putting all of the "what" before any of the "why". I was also taking the perspective of imagining that the non-rationale section (somewhat formalized) would eventually become part of the specification, with the rationale separated out into a distinct document.

I'm not sure this is the right balance. I'm much happier with this structure than that of the "Operators" document (which ended up pretty mangled because it started as an exploration of what we could do about precedence and then got hit by major scope creep, and needs some fundamental restructuring as a result).

We can introduce collapsible sections in GitHub-flavoured markdown via inline HTML -- see for example the "Digression" section in https://github.com/zygoloid/carbon-proposals/blob/operators/operators/operators.md#unary-negation -- and I could try switching to those if people would generally prefer that.

FWIW, I'm in agreement with Josh. My thought is that there should be a brief overview for the "what". All the details here are really the "why".

Collapsible or not seems minor -- this is in a proposal doc, and so will typically be seen regardless.

The document structure you have is:

CompleteSpec1
CompleteSpec2
...
Rationale1
Rationale2
...

An alternative that I think would serve the "BLUF" goal better would be:

Summary1
Summary2
...
CompleteSpec1
Rationale1
CompleteSpec2
Rationale2
...

The summary would be very brief: one or two sentences of explanation and a short code snippet example, and maybe have a link to the full spec and rationale section below. E.g.:

Literals

Integer literals are written like: 42, 0x3A (hex), 0o777 (octal), 0b01011010 (binary); not 01, 0x3a, 0X3A (upper vs. lower case matters).

Real literals are written like: 1.2, 1.0e5, 2.0e-3, 3.1e+1; not 1. (at least one digit after the .), 1.0e05 (exponent can't start with 0), 1.0E5 (e must be lowercase).

String literals are written like: "abc", "xyz\n123" (\ introduces escapes), and may contain utf-8.

...
See the complete literals spec below.

I think this is the approach used by the language overview, and I think would be enough for someone to read the summary and be able to parse example code. The detail level of the spec is not needed for understanding generally the intended syntax, and is better paired with the explanations that justify those details.

I strongly concur with Josh's suggestion that the proposed rules need to be closer to the associated rationale, but I think it needs to involve more than just moving blocks of text around. The main issue I'm having with this doc is that, although the proposed rules are clearly stated, in at least some cases I can't usefully evaluate the rules without more background information, and/or more explicit discussion of their consequences (I've pointed out one instance in more detail above, namely the discussion of directionality and indentation). Rather than structuring this as a series of rules, followed by rationales for those rules, I'd recommend thinking of it more like a series of questions/problems, followed by the proposed answers/solutions.

@geoffromer I tried something like that with the operators document, and it didn't work well. In particular, the big problem with the operators doc was that it wasn't clear exactly what was being proposed, precisely because the background and rationale and exploration of alternatives was included in the same running text as the proposal itself.

I'll try Josh's approach, and we can see how that works out. That approach doesn't address the problem that rationale and specification are not in 1<->1 correspondence, but I think that can be handled on a case-by-case basis.

josh11b · 2020-05-19T19:40:14Z

docs/proposals/p0016.md

+If the character after the `/*` introducing a block comment is `{`, the comment
+is a *code comment*. In a code comment, the following text is tokenized until a
+matching `}*/` token is formed; such a token terminates the comment. (In
+particular, such a token is not recognized if it is nested within another
+comment or a literal.) Otherwise, the comment ends at the first matching `*/`
+character sequence.
+[[why?]](#nested-comments-rationale)


I did not find this compelling. I wonder if we might start with a simple story for comments (only //), and rely on IDEs to support adding them to a whole block of code?

I use #if 0 to temporarily comment out code quite a lot. I'd be a little unhappy if we didn't have a comparable mechanism. Having a different way to say "this is commented out code" versus "this is random text" is, I think, useful, because you can still syntax-highlight and format inside such commented-out regions. (Imagine you indent a region containing commented-out text and need to reflow it.)

But I don't think it's an absolute must-have. I would welcome more input on this, so I know whether I need to reinforce the rationale or change the proposal :)

+1 I use #if 0 heavily during debugging

Rearranged block comment support based on discord discussion. I think the question of whether we could get away with only // is one we should explicitly consider, though. I'll add that as an open question.

docs/proposals/p0016.md

josh11b · 2020-05-19T20:28:49Z

docs/proposals/p0016.md

+current operator set. This requires parentheses in code that would apply
+multiple prefix or postfix operators in a row, such as `-*p`, but gives us the


This is a little concerning, but I don't know how common such things are in existing C++ code. Would a space here be allowed instead of parens?

My inclination is to say no, on the basis that a - with a space on the right should be a postfix or infix operator. Swift follows this same rule, and doesn't permit a space to be used to split the token in two.

@gribozavr Do you have any data on whether and to what extent this is a problem in Swift? Any user feedback you can point us at? (And if this is fine in Swift, do you think the reduced emphasis on pointers contributes to that?)

Swift has only a few prefix operators: Policy.swift

Swift does not have the issue that you're describing in practice (having to disambiguate by adding parentheses) because I don't think there is any way to string these prefix operators together in an expression that is useful in practice. Sure you can theoretically combine prefix negation and prefix bitwise complement, but is that realistic? In C++, on the other hand, we have prefix increment and pointer dereference operators that can compose in real world programs with other prefix operators.

Furthermore, if someone does fall into this trap, the compiler provides a custom error message: test/decl/func/operator.swift (implementation)

fowles · 2020-05-19T22:39:30Z

docs/proposals/p0016.md

+Decimal integers are written as a non-zero decimal digit followed by zero or
+more additional decimal digits.


leading +/- are a question I have too

fowles · 2020-05-19T22:40:41Z

docs/proposals/p0016.md

+
+An *identifier* is a maximal sequence of characters beginning with a character
+with Unicode property `XID_Start`, followed by zero or more characters with
+property `XID_Continue`.


do we want to forbid non-ascii identitifiers?

I think it's important to allow non-English identifiers.

I worry quite a bit about adversarial code in cases like this.

Let's discuss this on #19

docs/proposals/p0016.md

googlebot · 2020-05-20T01:33:53Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

googlebot · 2020-05-20T01:33:53Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

googlebot · 2020-05-20T08:03:58Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

josh11b · 2020-05-20T22:16:49Z

docs/proposals/p0016.md

+ * U+2028 LINE SEPARATOR
+ * U+2029 PARAGRAPH SEPARATOR
+
+Space, horizontal tab, and the LTR and RTL mark are *horizontal whitespace*


Based on #17 (comment) , should we forbid RTL marks outside of strings, or at least have some restrictions?

docs/proposals/p0016.md

tkoeppe · 2020-05-21T10:51:22Z

docs/proposals/p0016.md

+is empty.
+
+The *indentation* of a line is the sequence of horizontal whitespace characters
+at the start of the line. A line *A* has more indentation than a line *B* the


(I marvel at the realization that "indentation" is a partial ordering!)

jonmeow · 2020-07-30T20:44:57Z

Small note: it'd probably be good to break this PR up into child proposals for readability/compactness, maybe corresponding to separate files (possibly in the same directory?) in the design dir.

Also, given the unicorns I keep seeing trying to load this PR, splitting may be kinder to GitHub. ;)

zygoloid · 2020-10-22T21:58:49Z

Small note: it'd probably be good to break this PR up into child proposals for readability/compactness, maybe corresponding to separate files (possibly in the same directory?) in the design dir.

Also, given the unicorns I keep seeing trying to load this PR, splitting may be kinder to GitHub. ;)

Agreed; I've split a couple of pieces out and will continue to do so. Closing this on the basis that I have no intention of ever taking the "big picture" proposal to a decision.

zygoloid

(Looks like I forgot to submit these old comments.)

zygoloid · 2020-06-15T23:36:42Z

docs/proposals/p0016.md

+using a "max munch" rule: the longest possible next lexical element is formed
+at each step.
+
+After division into these components, whitespace and text and block comments


True, it would be good to avoid referencing terms I've not yet defined. However, the suggested change doesn't match the intent: documentation comments are not discarded at this stage.

zygoloid · 2020-06-15T23:49:56Z

docs/proposals/p0016.md

+
+Carbon source files are Unicode text files encoded in UTF-8. An initial UTF-8
+BOM is permitted and ignored.
+[[why?]](#encoding-rationale)


I have been intentionally keeping the [why?] links on their own line to improve the readability and maintainability of the Markdown source. Do you think the source would be improved by moving this onto the previous line? (I read the instruction that Prettier "should" be used as permitting me to use a different style if there is justification.)

zygoloid · 2020-06-15T23:54:58Z

docs/proposals/p0016.md

+Carbon is currently based on Unicode 13.0, and will adopt new Unicode versions
+as they are published.
+
+**Open question:** Should we require source text to be in NFC, as C++ plans to


The document says to use NFC, but there have been questions raised as to whether that's the right thing, so I want an explicit discussion and decision on this question. I'm expecting that the outcome from the review will be that revisions are necessary -- in particular, if we choose to normalize identifiers ourselves, this has ripple effects throughout the document that will require revision in various places.

If you'd like this presented in a different way, let me know, but I'm reluctant to remove the wording describing the consequences from this decision unless there's some indication that we want a different outcome than the one I suggest. I'm not really sure yet how the proposal process should work when we have open questions that will need to be answered before the proposal can be considered complete.

Perhaps instead of describing this as an open question, I could describe it as a known point of potential dissent from the proposal?

zygoloid · 2020-06-16T00:07:23Z

docs/proposals/p0016.md

+appearance of the source code (as determined by the Unicode Bidirectional
+Algorithm) matches the token order as interpreted by the Carbon implementation?


That changes the binding of "as determined [...]" from "appearance of the source code" to "Carbon implementation". Would replacing the parentheses with commas help?

zygoloid · 2020-06-16T00:10:34Z

docs/proposals/p0016.md

+on what we decide for [directionality](#directionality), perhaps LTR marks),
+which would lead to a substantially simpler indentation rule.
+
+#### Directionality


A previous revision went into more depth here, and required Carbon source to have proper directionality in order to be valid, but I ended up deciding that the details are far too involved and messy for it to be reasonable to define them here.

@mconst previously suggested a stricter rule: the directionality for all characters outside identifiers and the contents of string literals and comments is required to be left-to-right. I think that'd be somewhat easier to specify and something that we could make mandatory.

zygoloid · 2020-06-16T01:12:02Z

docs/proposals/p0016.md

+left-to-right scan of the source file, using a "max munch" rule: the longest
+possible next token is formed at each step.
+
+## Rationale


@geoffromer I tried something like that with the operators document, and it didn't work well. In particular, the big problem with the operators doc was that it wasn't clear exactly what was being proposed, precisely because the background and rationale and exploration of alternatives was included in the same running text as the proposal itself.

I'll try Josh's approach, and we can see how that works out. That approach doesn't address the problem that rationale and specification are not in 1<->1 correspondence, but I think that can be handled on a case-by-case basis.

zygoloid · 2020-06-16T01:13:48Z

docs/proposals/p0016.md

+
+## Rationale
+
+### Encoding rationale


You're right. Do you have a better word than "encoding" in mind? "character sets" maybe?

zygoloid · 2020-06-16T01:14:46Z

docs/proposals/p0016.md

+order from how they would be interpreted by a Carbon implementation.
+
+If we allow explicit left-to-right marks in the source code and treat them as
+whitespace, such issues can be fixed by the Carbon formatting tool.


That's a good point. I don't think we can simply remove them before tokenization, because we want to retain them in string literals. I suppose we could either remove them before tokenization and then put them back within string literals, or we could reject programs where two tokens are separated only by zero-width characters and would be tokenized differently if those characters were removed.

(This choice seems to be a little at odds with UAX31-R3 "To meet this requirement, an implementation shall use Pattern_White_Space characters as all and only those characters interpreted as whitespace in parsing". That rule alternatively allows an implementation to use a profile to determine a set of whitespace characters, but doesn't seem to have a provision for requiring at least one non-zero-width character in any run of whitespace.)

zygoloid · 2020-06-16T01:37:01Z

docs/proposals/p0016.md

+
+### Comment introducers rationale
+
+We anticipate the possibility of adding additional kinds of comment in the


Interesting. To me, either phrasing seems correct, but the current phrasing reads better. I'm thinking of this in the context of: "We have three kinds of comment right now, and might introduce additional kinds of comment in the future." For me, the use of the singular "comment" rather than the plural "comments" brings to mind the abstract notion of comments, rather than about a particular set of extant comments being divided into kinds.

https://ell.stackexchange.com/a/1276/7958 has a similar take, but is citing "The Cambridge Guide to English Usage". Maybe this is a regional difference? Do you consider the current form to be wrong, or just unusual?

zygoloid · 2020-06-16T01:38:51Z

docs/proposals/p0016.md

+### Block comment alternatives
+
+We considered various different options for block comments. Our primary goal
+was to permit commenting out a large body of Carbon code, which may or may not


Do you have a concrete alternative in mind?

The only change here is to update the fuzzer build extension path. The main original commit message: > Add an initial lexer. (#17) > > The specific logic here hasn't been updated to track the latest > discussed changes, much less implement many aspects of things like > Unicode support. > > However, this should lay out a reasonable framework and set of APIs. > It gives an idea of the overall lexer architecture being proposed. The > actual lexing algorithm is a relatively boring and naive hand written > loop. It may make sense to replace this with something generated or > other more advanced approach in the future, getting the implementation > right was not the primary goal here. Instead, the focus was entirely > on the architecture, encapsulation, APIs, and the testing > infrastructure. > > The architecture of the lexer differs from "classical" high > performance lexers in compilers. A high level summary: > > - It is eager rather than lazy, lexing an entire file. > - Tokens intrinsically know their source location. > - Grouping lexical symbols are tracked within the lexer. > - Indentation is tracked within the lexer. > > Tracking of grouping and indentation is intended to simplify the > strategies used for recovery of mismatched grouping tokens, and > eventually use indentation. > > Folding source location into the token itself simplifies the data > structures significantly, and doesn't lose any fidelity due to the > absence of a preprocessor with token pasting. > > The fact that this is an eager lexer instead of a lazy lexer is > designed to simplify the implementation and testing of the lexer (and > subsequent components). There is no reason to expect Carbon to lex so > many tokens that there are significant locality advantages of lazy > lexing. Moreover, if we want comparable performance benefits, I think > pipelining is a much more promising architecture than laziness. For > now, the simplicity is a huge win. > > Being eager also makes it easy for us to use extremely dense memory > encodings for the information about lexed tokens. Everything is > created in a dense array, and small indices are used to identify each > token within the array. > > There is a fuzzer included here that we have run extensively over the > code, but currently toolchain bugs and Bazel limitations prevent it > from easily building. I'm hoping myself or someone else can push on > this soon and enable the fuzzer to at least build if not run fuzz > tests automatically. We have a significant fuzzing corpus that I'll > add in a subsequent commit as well. This also includes the fuzzer whose commit message was: > Add fuzz testing infrastructure and the lexer's fuzzer. (#21) > > This adds a fairly simple `cc_fuzz_test` macro that is specialized for > working with LLVM's LibFuzzer. In addition to building the fuzzer > binary with the toolchain's `fuzzer` feature, it also sets up the test > execution to pass the corpus as file arguments which is a simple > mechanism to enable regression testing against the fuzz corpus. > > I've included an initial fuzzer corpus as well. To run the fuzzer in > an open ended fashion, and build up a larger corpus: > ```shell > mkdir /tmp/new_corpus > cp lexer/fuzzer_corpus/* /tmp/new_corpus > ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus > ``` > > You can parallelize the fuzzer by adding `-jobs=N` for N threads. For > more details about running fuzzers, see the documentation: > http://llvm.org/docs/LibFuzzer.html > > To minimize and merge any interesting new inputs: > ```shell > ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \ > lexer/fuzzer_corpus /tmp/new_corpus > ``` Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>

We want to support legacy identifiers that overlap with new keywords (for example, `base`). This is being called "raw identifier syntax" using `r#<identifier>`, and is based on [Rust](https://doc.rust-lang.org/reference/identifiers.html). Note this proposal is derived from [Proposal #17: Lexical conventions](#17). Co-authored-by: zygoloid <richard@metafoo.co.uk> --------- Co-authored-by: Carbon Infra Bot <carbon-external-infra@google.com>

We want to support legacy identifiers that overlap with new keywords (for example, `base`). This is being called "raw identifier syntax" using `r#<identifier>`, and is based on [Rust](https://doc.rust-lang.org/reference/identifiers.html). Note this proposal is derived from [Proposal carbon-language#17: Lexical conventions](carbon-language#17). Co-authored-by: zygoloid <richard@metafoo.co.uk> --------- Co-authored-by: Carbon Infra Bot <carbon-external-infra@google.com>

mconst reviewed May 19, 2020

View reviewed changes