-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider allowing non-ascii identifiers #3947
Comments
It would need to follow unicode TR31 or possibly even TR46. Note that the unicode tables required are non-trivial in size. |
I think that Rust's approach is a very good one. It determines what is and isn't allowed in identifiers based on whether or not characters have the XID_Start and XID_Continue Unicode properties. It also normalizes all identifiers using NFC before comparing them, so differences in normalization between source files can't lead to identifiers failing to match. Finally, it forbids any unassigned code points (at the time of the release of the current version of the compiler) from being used in identifiers, since their properties are unknown. |
@daurnimator It's true that any reasonable solution (other than just a free-for-all) would require including some Unicode property tables with the compiler. We're probably looking at maybe 50 KiB of data for this. If this is truly a size concern, making it an optional component could be possible. However, I suspect that these tables will quickly get dwarfed in size by other components of the toolchain. |
Finally, I want to address the backtick proposal. Personally, this doesn't feel like it would be that pleasant to use. It seems more like a way to interface with existing identifiers that you absolutely must use than a way do deal with identifiers day-to-day. In comparison, simply checking Unicode properties isn't really that hard to implement. The largest concern I think would be the size of the tables, not the code that needs to look characters up in those tables. Rust is taking so long on this issue because before they stabilize it, they want to implement dozens of lints to warn users about similar-looking characters, mixed script identifiers, and so on. Depending on whether or not those things are seen as a priority, this could be anywhere from a simple fix to a huge project. Personally, I lean more towards not caring about confusable identifiers, unlike the Rust team. If your team members are screwing with you by replacing random letter As in your identifiers with Cyrillic, you need to find a better team. |
I think using characters other that basic latin will make harder to reuse code. |
@Rocknest This is something that I've seen come up again and again in discussions when programming languages start to have a discussion about how to handle identifiers. However, it has never been a convincing argument to me. Now, that's not to say that having identifiers in say, the Greek alphabet, won't make it harder for people to use some library you've written. That's essentially a given. Rather, what doesn't convince me is that this is a sufficient reason to disallow such identifiers. A programming language, to put it simply, is not your mom. It's up to you who the target audience for your code is, and how you want to present it to them. But let's say that you don't agree with that. Let's say you want to encourage everyone to use English identifiers in their code to improve global code reuse, so you design your language to enforce only ASCII identifiers, since that will encourage people to use English. Well, in my experience, this simply doesn't work. When someone wants to use identifiers in a certain (human) language, they do, no matter what characters the (programming) language lets them use. I've heard from a Japanese developer friend of mine (and I have seen for myself in codebases that I have inspected), that when people are forced to use ASCII identifiers, what they end up doing is writing identifiers in their preferred language, but filtered through the most inconsistent, ad-hoc, ugliest romanization schemes that you have ever seen. Ultimately, this does more to hurt the readability of code than it does to help it. |
var сумма: i32 = 0;
for (массив) |икс| {
сумма += икс;
if (икс < 3) {
continue;
}
break;
} |
Is this going to affect the speed of zig fmt? Currently zig fmt feels laggy compared to go fmt, but I don't know how many inefficiencies are there at the moment, and this seems to me a more universally relevant point than whether a relatively small group of people should be forced to use English identifiers or not. What engineering implications come with this proposal? |
Go supports Unicode identifiers and |
This shouldn't affect the speed of formatting at all in any noticeable way. It still just has to search for brace and whitespace characters, which will be encoded exactly the same way as before. The parser already handles multibyte UTF-8 sequences for the sake of comments and string literals, so this wouldn't slow that down either. Also, I strongly disagree with your conclusion about only a small number of people being “forced to use English identifiers”. As I mentioned earlier, there are vast swaths of the world where English is not widely spoken by programmers, which includes much of East Asia. This isn't just a matter of personal preference in those places, unlike it would be in most of Europe, where English is required for most programming jobs. |
@Serentty i've seen and worked with Java code with cyrillic identifiers. And its AWFUL. You have to switch keyboard layouts every few seconds or do copy-paste madness and that reduces coding speed to be unpractical outside educational purposes. And ad-hoc romanization arises naturally making unicode identifiers obsolete. |
The whole problem with ad-hoc romanization is that it is by its nature inconsistent. But I think ultimately, arguments about whether each of us would rather work with such identifiers or not are a distraction from the actually problem at hand. I've never agreed with the thinking that if you don't like something, you should stop other people from doing it. |
@Serentty i said that this feature is devoid of practical use. And i think that i would not solve anything, so why add something useless to the language? By the way in Zig hard tabs are not allowed so your last argument probably does not apply here also. |
Binary size of the unicode data is not a factor here. 50K installation size is nothing compared to the rest of the payload. |
You yourself said that you have seen people using Cyrillic identifiers. So clearly there is a use for this. You might not like it, sure, but it is indeed something that many people use.
I'm not a fan of that decision either, but regardless, enforcing a certain indentation style is nowhere near as serious of an issue as the cultural implications of enforcing the English subset of the Latin script. |
I support non-ascii identifiers without backticks. |
How about a keyword eg: |
Normalization sounds like a job for I think it's possible to allow non-ascii identifiers while still achieving my goals in #663 of making valid zig source code easy to process by naive tools. If identifiers are guaranteed to be normalized and validated in order for your code to compile, then a naive parser could consider any non-ascii bytes outside comments and string literals to be identifier characters. Maybe normalization isn't necessary, but it seems like a nice feature to include in the proposal. Does normalization require the full |
I'm from Italy, I've worked for a while in SEA, and I've found that people there, at least in tech circles, can deal with English much better than the average Italian developer (there = Singapore, Malaysia, Thailand). My point is that I think it's an exaggeration to say that allowing non-ASCII identifiers is going to have that big of an impact because they will still have to deal with ASCII identifiers, like today, from all the libraries that they will use, so it's not like they're going to have the option of not having to learn to read/write English symbol names. This is only going to add freedom to the identifiers they will create themselves, which is nice, but not that huge of a change in the daily life of the average SEA developer. On the other hand, the arguments about lowering code reusability are totally moot in my mind. People that don't want to worry about others reusing their code will find a way to make it hard to understand anyway, like you pointed out, and conversely I think a library that uses Japanese identifiers internally, but provides an English interface and documentation, should still count as reasonably universal code. So in conclusion I think the whole "programmer's freedom" vs "code universality" dichotomy doesn't really expose all the important, practical questions that should be explored first. A few examples:
And for each question: If so, by how much? Additionally:
For example, referring back to Rust's strategy, people might want to map HashMaps from databases to structs, and they might feel confused and disappointed when they discover that their seemingly normal identifiers sometimes don't map correctly because of different normalization choices between the various tools involved (zig, database, other clients). To me this example doesn't even seem that much hypothetical because I wanted to cover the mapping use case in my Redis client. With the current setup, when somebody wants to map any non-trivial identifier, they can do so with A good property of the latin alphabet is that it's simple, much simpler than all the alternatives, and that results in some nice properties, more often than not. If the downsides are minor, I see no reason to prevent people from using the symbols they like most, but we should really start with a thorough and practical investigation of that first, and leave other concerns for second. IMO. |
Also, I almost forgot, people in Asia are going to use Zen anyway, no? 😆 |
Normalisation is the compiler's job. The symbol table needs normalised identifiers to properly match them. |
Personally, I think that switching to backticks is the best solution here. Such a syntax (presently |
I don't mind introducing this shorter syntax for escaped identifiers, but I don't see it as being a solution to the problem here. Such identifiers are essentially second-class.
This is just patently untrue. While the Latin alphabet is nowhere near the most complicated, it's not even close to being the simplest. It has many properties, such as having two versions of each letter for a case distinction, that most writing systems eschew. I'd rather not deal in rationalizations here. The (English subset of) the Latin alphabet is privileged in computing because of the historical influence and dominance of American and British computer companies, not because of any inherent properties that make it better for computers. This is, in my opinion, entirely a legacy issue. |
The |
@daurnimator I was specifically stating that only the syntax should change, not the semantics. |
Another option would be to have all identifiers in backticks, but that just trades one problem (treating non-ASCII identifiers as second class) for another (all identifiers are a PITA to type). Upon further reflection, I think this is worth it, with the caveat that if there's a performance impact in code that would work without Unicode support, I'd prefer an option to assert that all identifiers are ASCII and avoid the checks (maybe a compile-time option - CMake setting for stage1 - instead of runtime). |
@pixelherodev I've never seen any correlation between compile times and support for non-ASCII identifiers. I would be greatly surprised if there's a noticeable impact at all on codebases of non-trivial size. Either way, this is something that can't really be known until after support for such identifiers is implemented. If an all-ASCII pragma speeds things up, it can always be added. |
No I don't think I am. The subject is about allowing non-ascii in free identifiers. The argument for is "allow people write in their native language", but I have seen little evidence that non-ascii is then used in actual code for names. In fact it adds pitfalls. I also dislike the idea to use unicode symbols like "≤" to shorten operators or create new ones. The reason is that I have not seen any proof that this provides a tangible improvement. It can obviously be added anyway, but to avoid the kitchen sink problem it is important to skip features that does not bring any proven benefit. This feature is easy to add but impossible to remove without breaking things. A programming language should not be the depository of possible interesting features that can fit – unless one takes C++ as the epitome of good design. |
@lerno I know that non-ASCII operators and non-ASCII identifiers aren't the same thing, but this seems like a bit of a contradiction to me. You say you haven't seen evidence of use, but then mention non-ASCII operators (which are very popular in certain languages, such as Haskell and Agda), but then dismiss them on the basis of not liking them. |
@Serentty Now we're shifting the subject somewhat, to that of non-ASCII operators. I agree that they look attractive, but what I don't like is that I now need to find a way to actually enter them on a keyboard. While I can set up my particular environment to type them with certain keystrokes or sequences of keys, it creates an implicit dependency on the setup that doesn't transfer to me typing code on a colleague's computer (often different OS!), or when using text editing tools that are not 100% UTF-8 compatible. But I want to underline that this is a different problem from non-ASCII identifier. As for use: arguing that Agda and Haskell use of non-ASCII operators "prove their usefulness" can be refuted. Neither is a big mainstream language. We do have one big mainstream language with non-ASCII operators through: Swift. At its inception people went wild with non-ASCII operators. It was then more or less agreed by the community that using them was mostly a bad idea and should be avoided in most cases as far as I know. It certainly created its fair share of bugs and problems in Swift itself. This is not to say that Zig shouldn't have them. It's just that I always like proof of usefulness beyond that "it might be useful". Because such a criteria almost anything conceivable could go into a language and I firmly believe that making a language larger than it absolutely needs to is a bad idea. I cannot prove this for a fact, but seeing how everyone is using "a subset of C++" indicates that what happens is that for a sufficiently large language people start using subsets, which in turn means that there is actually more than one (sometimes many more ways than one) to do a thing. And from what I understand this seems to be something that Zig tries to avoid? |
Well, in the case of Haskell, I don't think it really makes the language bigger, because operators are themselves identifiers there, so this is just a natural extension of the fact that Haskell doesn't limit identifiers to ASCII. I do think you're playing down Haskell a bit, though. Yes, Agda is a fringe research language, but Haskell is definitely an established programming language. I don't think defining fancy mathematical operator aliases for all the standard operators is necessarily a good idea, as that does add to the size of the language. I just see a difference between language size and language restrictions. In the same light, I don't think allowing non-ASCII identifiers makes the language any bigger. I don't think standards that the language relies on should be counted as part of the size of the language, unless you also want to count POSIX as part of the Zig standard just because the standard library uses it. |
@Serentty well I can grant you that allowing a wider range of, say, available identifiers is not necessarily making the language bigger, but that depends: does the language make guarantees when it comes to identifiers, or is it just "a conforming compiler must allow these characters and may use these other ones" then the language is not getting bigger, but if a conforming compiler must support unicode, then there has to be strong constraints on what unicode to allow, how they are encoded in a portable way etc. And this makes the language bigger. I'm thinking about things like Swift running into problems like security issues due to unprintable unicode characters being allowed operators and literals. So these things must be considered when opening up for unicode. |
Unicode provides a standard annex already for what programming languages should allow in identifiers. So it's not really a huge amount of work in my opinion. The only reason it has taken Rust so long is because they want to have handy compiler hints that tell you when you might accidentally confuse yourself with identifiers. The actual specification of what is allowed is all done already. While in some painfully theoretical sense maybe you could consider the list of character properties as being part of the “size” of the language, keeping track of which characters have properties like XID_Start and XID_Continue isn't something that a human should need to think about, even when maintaining a compiler for the language, so I don't consider this equivalent to the kind of “size” that you get when you add duplicate features and unnecessary complexity that is directly relevant to the syntax of the language.
This has been discussed to death in every discussion like this around programming languages and identifiers, but it doesn't seem like a realistic concern to me. Homoglyph attacks are something to worry about in things like domain names and email addresses, where they can fool someone as to the identity of an attacker, so it's understandable that great measures are taken to detect when this might be happening. But for code, I just can't see something like this posing a real security threat. Most programming languages have allowed Unicode for identifiers for decades at this point, with many of them not even following Unicode's recommendations about what NOT to allow (for example, Unicode does not recommend allowing emoji in identifiers, and yet many languages allow this), and yet I haven't heard of this ever being a problem in the real world. The only case where I can imagine this being used for an attack is someone sending in a pull request which uses similar-looking characters to obfuscate intentional security vulnerabilities, with the intention that reviewers won't see them. However, this would be a very risky strategy because if caught it instantly discredits the programmer and proves that the vulnerability was intentional. There are much better ways to obfuscate something like that, with much better plausible deniability. |
I am aware of the unicode annex for identifiers. But that does not refute my argument that this increases language complexity, in particular identifier parsing and validation. In regards to the last vulnerability you can obfuscate your last attack by targeting a 3rd party OS library which you then include which is outside of internal auditing. |
So you're proposing trying to get a homoglyph attack merged into an OS library? Again, this seems like a ridiculously far-fetched scenario to me. This is a possibility that has existed for decades, and I've never heard of it causing any major issues. In domain names and email addresses, sure, but that's not the same scenario at all. Frankly, I imagine that an attacker trying to get malicious code merged is going to try harder than this. |
Who talked about OS libraries? No, I was thinking about normal midsized businesses. At one time I worked on an online poker company, there was at least one attempt by an employee (programmer) to make money by using access to the internal systems. I don't know the details, but that's the scenario: a programmer finds a way to insert vulnerabilities by including a seemingly harmless 3rd party dependency (there were a lot of them in some systems where I worked, enough that there was no detailed oversight). |
I'm a bit confused by what you mean. So, the scenario is for the attacker to write a dependency themself, use homoglyphs to obfuscate what it does, and then add it as a dependency? I still stand by my statement that I would require evidence of this ever happening on any significant scale and causing real security issues for me to worry about this. |
Usually "on a significant scale" is not how one treats security vulnerabilities. My point is only:
It's then totally up to others to determine whether the benefits outweigh the cost. I just want to raise these concerns together with noting that the alleged benefit of unicode identifiers are usually postulated rather than demonstrated with factual support. |
I'm sorry, I can't see this as anything other than FUD when the vast majority of programming languages have had no issue with this for decades.
The benefit is concrete. It means I don't have to figure out substitutions and respellings for identifiers based on ad-hoc romanization schemes that will likely be impenetrable people reading the code. It means that the language isn't enforcing anglocentric constraints on me based on some decades-old American standard that was never even enough to cover the official languages in my country (Canada). What I find in discussions like this is that the benefits are very clearly stated and backed up with real facts, but then dismissed with: “Well, you shouldn't be doing that anyway.” |
I have already provided example with Swift as a mainstream language with safety concerns with unicode. I am merely providing facts. Dismissing it as FUD lacks substance. It could be useful to for example look at the Rust discussion here: rust-lang/rust#28979 One of the things they add is compiler detection of homoglyph attacks by building this list into the compiler itself: http://www.unicode.org/Public/security/revision-06/confusables.txt This solidly refutes both the idea that unicode-based attacks is something not necessary to consider, as well as the idea that unicode support does not add any complexity.
No, that is dependent on you actually using the unicode in practice. You are unlikely to use unicode in libraries intended for a wider audience because of the difficulty of typing said characters. That leaves internal projects and teaching as potential use. So what I would like to see is whether they're actually used that way in practice and if there is any proven gain. Like: "group 1 was taught using variable names with ASCII and group 2 was taught using variable names in unicode using the native alphabet". It might a priori seem like it would be helpful, but I have not seen any evidence that it actually has a tangible benefit. And you are unwilling to provide it. But as this issue is closed anyway I shouldn't waste more time arguing this as I have no real vested interest in the outcome anyway. |
Earlier you talked about people running wild with operators, not anything to do with safety.
I was a big part of that discussion and have been following the progress there for years. The conclusion was that this feature is worthwhile, especially when lints can recognize similar-looking characters.
The fact that action was taken to prevent a possibility does not demonstrate that the possibility is likely. If people are worried about it, it must be a big problem? This is completely backwards reasoning. If something is a big problem, you should be worried about it.
I most definitely would.
I don't know what kind of evidence it's even possible to provide for this. Some sort of study showing that it makes the code better? That people are more productive with it? It's an ergonomics feature that makes the language less annoying to use. You're not going to find some sort of proof that it improves any sort of statistic by 20%. |
I'd like to bring up the zen of Zig with regard to this issue.
Check iology's example too. You don't even have to speak Chinese to tell the status quo is harder to read. Here is a trivial snippet in Japanese: Status Quo: var @"今日" = "土曜日";
var @"平日ですか" = @"平日を確認"(@"今日");
try stdout.print("今日は平日{}", .{@"平日ですか".@"ブール"});
With Unicode support: var 今日 = "土曜日";
var 平日ですか = 平日を確認(今日);
try stdout.print("今日は平日{}", .{平日ですか.ブール});
Maybe there is some technical limitation that makes this issue not feasible, I don't know. But, I think the only reason this issue isn't more of a priority is just because there aren't any non-english speakers that use Zig right now. If you don't ever plan on changing that, then by all means continue with the status quo. |
|
ISO C has supported Unicode identifiers since C99 (though the only compiler I know of that implemented them at the time is Plan9's kencc; e.g. GCC didn't support them until last year, and I suspect MSVC still does not). How "low-level" a language is should have no impact on this; you can meaningfully use Unicode labels in assembly language.
If people who primarily speak other languages aren't interested in using Zig without being able to conveniently use their native language, this is not a chicken-and-egg problem. Unicode support would necessarily come first. This may not be the case, and I'm not saying it is the case; assuming we know for sure why people do or do not use the language is bound to result in bad decisions.
The two obvious ones, for anything math-related, that I do in C: Instead of e.g. |
Whenever people argue that they want to add math symbols to their code, I would like to point out that the way I get math symbols is opening the "emoji & symbol" popup and search for it there. To me that's hardly an optimal experience. I don't even know where to start if I was typing it on the phone or an iPad. Yes, I could do a special binding in my IDE for that symbol. But I am probably going to curse the dev who made me have to do that just in order to use their code. In comparison "using their own language" seems a much better argument – in that case at least you can usually at least just switch to a suitable keyboard layout. The math argument is very poor. |
Math symbols (or any symbol) being a pain to type is an input method problem. I don't see why it should be harder (in theory) to type "Δx" than it is to type "deltax". In practice input methods suck, and most English speakers don't have one configured anyway. But you're right, math symbols are a weak argument because giving up the (questionable) convenience of math symbols in code is a rather small cost. The main argument is, as you said "using their own language". |
I think that in practice pretty much most users with languages using latin letters will not be using custom input methods. It's not limited to "English speakers" unless you by that mean "anyone who can speak English, even as a second or third language". But certainly users from Middle and East Asian countries will have been forced to set up some way to write at least a-z, but that mapping is often very standard – as opposed to math symbols that adds yet another set of symbols they would need to map. A further advantage to status quo (although it could be improved with the backticks proposed) is that there is less risk of malicious attacks using сharacters that lооk lіke ordinary latin lеtters but in fact uses this to obscure shadowing etc. |
(For fun, figure out where I used Cyrillic characters in the above text) |
You've mentioned using similar looking characters for malicious attacks before, but I still can't figure out a broad sense for how one would actually pull that off. Could you give an example, or site some research where these kinds of attacks are discussed? Say you make a library and (maliciously 😈) use Cyrillic "А" in "Аctive", and then I go to use your library and type "Active" when using your API, and then !? ... It doesn't compile. Having, for example, both "Аctive" and "Active" in your code would certainly make it harder to audit. But due to the ubiquitous support of Unicode in modern programming languages, auditors should already be taking measures to distinguish these for the code they review (for any language). You still need to watch out for obfuscations even in ASCII only code (though not as many are possible): swimmer vs swimner, light vs Iight, etc. Take a look at https://www.ioccc.org/ ; the first few I looked at don't use Unicode. I think the possibility of making confusing or obfuscated code sounds like an attack vector just because, hey, if it can be confusing then maybe someone could trick you with it. In practice, however, avoiding this problem just boils down to the same general practice as with many other issues: don't use code you don't trust. |
Pretty sure C99 supports unicode identifiers. It's not just "modern"
languages.
|
I'm not sure if it's been suggested before, but homograph attacks via Unicode identifiers could be pretty much eliminated by making Unicode names opt-in via a compiler flag. The majority of projects could then simply side-step the issue without impinging on the creative freedom of non English-centric programmers. It should also be kept in mind that homograph attacks via string literals and include paths are already possible, and not easily eliminated without special tooling. The additional risk from Unicode identifier names has to be seen in relation to that. |
There are two approaches - LaTeX-like syntax in the editor, to substitute |
There are some input methods which make it easy enough, though i've yet
to find one i like on a mainstream system.
|
These are security advisories. In other words they are saying: this could maybe cause problems, so you might as well patch it. "Homoglyph attacks" are not a reason to reject Unicode support; they are an implementation detail for it. Blocking these attacks is trivial by adding a few rules to the compiler's tokenizer. All of the shown examples only show trivial snippets showing how one could make control flow confusing -- forgetting the fact that changing a single identifier isn't enough to actually do anything. To actually do something malicious, it would take a lot more than changing a few variable names. "HaHaHa! My pull request uses a homoglyph to inconspicuously change the program's control flow to carry out my Evil Plan! They'll never notice the 500 lines of other code I added in order to actually do the Evil Plan... surely!" Presenting "homoglyph attacks" as a reason to reject Unicode is just FUD. No one could actually pull off an attack in real life with this. If they could, it would have happened a long long time ago with C, C++, Java, Python, JavaScript, etc., since pretty much every other language out there supports Unicode. In my brief search, the only real cases of homoglyph attacks working are in URLs with phishing. If anyone can find even a single report of a successful real-world attack in source code using this strategy, I'd be very interested to read it. Again, this "attack" just boils down to: don't use code you don't trust. |
As a reminder, this issue is closed. I'm not really paying attention to it. |
@Serentty writes in #663 (comment) :
I'm opening this issue to be a discussion area about possibly allowing non-ascii identifiers in Zig.
In Go, identifiers can be made of unicode points classified as "Letter" or "Number, decimal digit". I don't know how difficult that would be to program into Zig and specify in Zig's grammar spec. Java and JavaScript have similar identifier specifications. Would Zig having rules like that be valuable?
In Zig, you can make any sequence of bytes an identifier if you use an extra 3 characters for each identifier, e.g.
var @"你好" = @"世界"();
. This is pretty painful if you're doing this for literally every identifier, but it's at least something.Backticks are not used in Zig's grammar today, so perhaps we could shorten the 3 characters to 2 like so:
This looks a bit nicer, mimics SQL identifier escaping, and is much simpler to implement than anything to do with unicode character properties. Would this be a meaningful improvement over
@"你好"
? (This proposal has some details to iron out, but I'd like to get a sense for if this would even be helpful.)The text was updated successfully, but these errors were encountered: