-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85
Comments
In literals, or in general? Catching these in general seems a bit hard to do for clippy. |
I just tested, and they're not allowed in identifier names, so that only leaves literals, right? |
Oh, I thought they may count as regular spaces as far as separating tokens goes. That sounds good. |
Could this be generalized to also check for non-canonical Unicode representations? Or higher ranged Unicode glyphs in general (perhaps Allow by default)? |
I was thinking the same thing, really. Though UTF8 files mean that it's okay to do this anyway. |
I'd rather zero-width spaces be a specific lint, since it'd be useful to say "Hey, there's invisible characters here." An |
@Havvy Full ack. Since the default behaviour should be different anyway, it makes sense to implement multiple lints for the different cases. This also enables users to set So:
If multiple lints would generate warnings for a piece of code, the lint with the higher level wins, then the more specific. |
I just thought about another possible unicode warning: Detect right-to-left strings (or at least mixed left-to-right/right-to-left strings). We should check that on the whole crate, including comments. Note that this trick has been used e.g. in shell scripts to hide malicious code, so the check probably even belongs in the compiler. I'll see if I can whip up a test case. |
I added the non-ascii lint in the linked PR. BTW, in the currently inactive test case the combining accent is at the wrong position, and applies to the opening quote :) |
What might be interesting is to recognize "problematic" characters that look basically the same as ASCII characters, e.g. some Greek and Cyrillic ones. (These are often used to impersonate legitimate file names and formerly DNS names.) But I don't have a list of those handy. |
We could use the same criteria punycode uses. |
But I'm not sure if we really need to do that. |
@birkenfeld yeah I got that one wrong, fixed it. Anyway, the idea of |
Why is NFC to be preferred? |
Because NFD would standardize on |
Unicode lints, second attempt: Lint whole strings, help with replacement This fixes #85
This program should warn that zero-width spaces are being used directly and should be replaced with their unicode escape code (
\u{200B}
).See Unicode spaces for more information on them.
The text was updated successfully, but these errors were encountered: