-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zig source encoding #663
Comments
Isn't the CRLF ending style forbidding too restrictive for common use? I mean it will be weirdness in Windows world and even if you conscientious use it in your editor it may still make some problems with some sort automating source processing. I know no other popular language with such restriction. |
See the discussion in #544. |
Is this done? What are the action items to resolve this issue? |
This is done in self hosted. I think that's good enough. |
Maybe CRLF should be allowed in |
Non-ASCII identifiers are a very important feature to me. For code which isn't meant to be published for an English-speaking audience, I regularly use identifiers which can't be represented in ASCII. The current “solution” of enforcing only ASCII in identifiers is very anglocentric. |
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of ziglang#663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of ziglang#663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.
This issue exists to document the rationale for Zig's source encoding. The rules below will be added to the docs, but the rationale discussion will be linked from the docs to here.
Discussion
Goals
We want some kind of unicode support
We want to support unicode in some contexts in Zig, such as string literals:
So we don't want to force all bytes of a zig source file to be ascii:
Each rule in Zig's grammar is either defined with a character whitelist accepting only specific ascii characters (e.g.
[0-9A-Za-z_]
used in identifiers) or with a character blacklist accepting any character except for the terminator/escape characters (e.g.//.*?\n
for comments). Here are the contexts where any character is allowed (using#
as a placeholder for the characters):'#'
"#"
,c"#"
,\\#
,c\\#
//#
It's tempting to simply allow any byte value in those contexts while searching for the terminator. This allows utf8 in string literals and is easy to support in the compiler. This isn't as robust as providing a unicode string type, but it works well enough for some usecases, like the
print
example above.The problem with turning a blind eye to Unicode
If we want an editor to display the
print
example above the intended way, then editors really need to be interpreting the zig source file as utf8. Additionally, it's very natural in many programming environments (e.g. Python 3, Node JavaScript) to read a file as a string rather than as bytes, and the obvious encoding to reach for is utf8.If the zig compiler simply tolerates any bytes where utf8 might be, then it's possible to have "correct" zig code with invalid utf8 sequences. This corner case will have undesirable consequences for naive consumers of the zig source, such as throwing an exception or crashing when simply trying to read the file as a string. If invalid utf8 sequences are valid zig, then zig really isn't utf8 compatible, which is an awkward situation for bytes-to-string conversion; there'd be no correct way to convert zig source bytes to a string.
We want valid zig source to be easy to consume, and we want to support unicode in some way, therefore zig source code shall be encoded in utf8.
Zig source is UTF-8 encoded
It is a compile error for zig source to contain invalid utf8 byte sequences. There are plenty of examples of these: "//\xff", "//\x80", "//\xc2", "//\xc0\x80", "//\xc2\x00", etc.
Although zig source code is technically in unicode, this not mean that zig grammar allows non-ascii unicode outside the "blacklist character" contexts outlined above. You cannot have identifiers in Russian, nor can you use NBSP to format your code. Outside of string literals and comments, it's always an error to have a byte value greater than
0x7f
(this is discussed more below.)Line endings are important
Comments and multiline string tokens are terminated by the end of the line. Knowing where lines end is critical to understanding zig source code.
In zig, all lines are terminated by an LF character,
"\n"
. It is an error for zig source to contain CR characters. This suits the goal of making valid zig source easy to consume. You can either look for simply"\n"
, or you can use a general-purpose regex like\r\n?|\n
. Either one will work, because all the complex alternatives to"\n"
are guaranteed to not be present in the source.But we can't stop there. Visual Studio recognizes even more variations on line ending style:
If NEL, LS, and PS are allowed to show up in zig comments without terminating the comment, then we've got a weird corner case for anyone making a VisualStudio plugin for Zig syntax.
Therefore, we impose additional restrictions on valid zig source that zig source must not contain NEL, LS, or PS unicode points. These characters are encoded in multiple bytes, so this adds complexity to zig source validators. However, this complexity is justified, because it makes valid zig source easier to consume (remember the goals above.).
Ascii control characters are mostly no good
Control characters
'\x00'
through'\x1f'
and'\x7f'
are mostly useless. The only control character zig recognizes is'\x0a'
, a.k.a.'\n'
, which is always and only the line terminator. All the other control characters have either superfluous (CR), confusing (BS), inconsistent (VT), or otherwise obsolete (ENQ) behavior, and they are all banned everywhere in zig source code. (For the debate on windows line endings and hard tabs in zig, see #544.)Other crazy unicode stuff isn't as important
There's a huge amount of weird stuff you can do with unicode, like right-to-left text, zero-width characters, and the poop emoji. Although Zig does want to be a readable language, there's a limit to how much we can enforce when it comes to obscure unicode craziness. You're going to be able to make pretty obfuscated unicode string literals if you try, and zig isn't going to try to stop that. The important thing is that the unicode doesn't interfere with the interpretation of zig's grammar.
If some unicode craziness is found that zig allows that confuses naive editors or analysis tools, then we should consider imposing additional restrictions for the sake of keeping zig easy to consume.
The rules
'й'
)Note: From the above rules, and from the zig grammar, it follows that:
Implications for consumers
If you have zig source that you know is valid, you can trust that:
" "
and"\n"
, or you can use a generic whitespace scanner that checks for"\r"
,"\t"
, and the 25 unicode whitespace characters.The text was updated successfully, but these errors were encountered: