Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

DemiMarie · 2015-10-12T01:50:52Z

Non-ASCII identifiers are currently feature gated. Handling of them should be fixed and the feature gate removed.

steveklabnik · 2015-10-29T20:55:48Z

/cc @rust-lang/lang

pnkfelix · 2015-10-29T22:10:49Z

nominating

nrc · 2015-10-30T14:08:57Z

cc @SimonSapin

Apparently we implement this: http://www.unicode.org/reports/tr31/ or something like it.

I would like to see this stabilised, but it will take some work to persuade ourselves that we are doing the right thing.

SimonSapin · 2015-10-30T16:08:10Z

I have no idea what the right thing is here. In addition to Unicode recommendations, we might want to look at what other languages actually do, and what related bug reports or criticism they get. Or was this already done when the feature was first introduced?

petrochenkov · 2015-10-30T16:23:09Z

@SimonSapin
C and C++ use http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax (with some minor restrictions) and I haven't seen any complaints about it on isocpp forums or issue lists :)
Overview of the problem: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
Implementation in Clang: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Lex/UnicodeCharSets.h?view=markup
cc #4928

There's also a problem with normalization of identifiers and mapping unicode mod names to the filesystem names (on OS X, IIRC), ~~but I can't find the relevant link~~ here it is: #2253. (In the worst case non-inline mods and extern crates can be forced to be ASCII)

pnkfelix · 2015-11-01T09:00:19Z

Yes #2253 is the big issue I know of that makes me worry about premature stabilization of non-unicode identifiers.

(The discussion there is more broad and arguably could be forked off into two threads; e.g. we could take one normalization path for identifiers and another for string literal contents.)

pnkfelix · 2015-11-01T09:05:00Z

we may want to migrate This discussion to the RFCS repo, e.g. at rust-lang/rfcs#802

bstrie · 2015-11-04T21:29:24Z

I agree that this is a feature that deserves to be put through the RFC process.

aturon · 2015-11-04T22:51:48Z

I've repurposed this issue to track stabilization (or deprecation, etc) of the non_ascii_idents feature gate.

nikomatsakis · 2015-11-05T22:23:36Z

After discussion in the lang team meeting, we decided that yes, an RFC would be the proper way forward here. We need something that collects the solutions from other languages, analyzes their pros/cons, and suggests the appropriate choice for Rust. This is controversial and complex enough that it should be brought to the community at large -- especially as many of us hacking on Rust on a daily basis don't have a lot of experience with non-ASCII anyhow.

nikomatsakis · 2015-11-05T22:24:09Z

triage: P-low

Marking as low as there is no RFC at present and hence no actionable content.

huonw · 2015-11-05T22:24:26Z

cc #7539

kberov · 2017-01-08T00:38:11Z

In JavaScript, Perl 5 and Perl 6 this feature is available.
JavaScript (Firefox 50)

function Слово(стойност) {
  this.стойност = стойност;
}
var здрасти = new Слово("Здравей, свят");
console.log(здрасти.стойност) //Здравей, свят

Perl >=5.12

use utf8;
{
  package Слово;
  sub new {
    my $self = bless {}, shift;
    $self->{стойност} = shift;
    $self
  }
};
my $здрасти = Слово->new("здравей, свят");
say ucfirst($здрасти->{стойност}); #Здравей, свят

Perl6 (this is not just next version of Perl. This is a new language)

class Слово {
  has $.стойност;
}

my $здрасти = Слово.new(стойност => 'здравей, свят');
say $здрасти.tc; #Здравей, свят

I would be happy to see it in Rust too.

SimonSapin · 2017-01-08T02:58:14Z

For what it’s worth identifiers in ECMAScript 2015 are based on the Default Identifier Syntax from Unicode Standard Annex #31.

Perl with use utf8; uses the regexp below, with XID_Start and XID_Continue presumably also from UAX # 31.

/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
        (?[ ( \p{Word} & \p{XID_Continue} ) ]) *    /x

kberov · 2017-01-08T11:38:17Z

Yes! Thanks @SimonSapin!

SimonSapin · 2017-01-08T12:16:54Z

For Python it’s <XID_Start> <XID_Continue>*.

So it looks like many programming languages that allow non-ASCII identifiers are based on the same standard, but in the details they each do something slightly different…

mjbshaw · 2017-02-08T05:53:15Z

I would personally love to see support for math-related identifiers. For example, ∅ (and set operators, like ∩ and ∪). Translating equations from research papers/specifications into code is often a terrible process resulting in verbose and difficult to read code. Being able to use the same identifiers in the code that are in the paper's math equations would simplify implementation and would make the code easier to check and compare against the paper's equations.

DoumanAsh · 2017-03-17T22:06:34Z

What's point of this feature exactly? Aside from adding possibility to create truly ugly mix of different languages in your code(english is the only truly international language), it gives no benefits to language functionality wise. Or is it support of unicode for the sake of supporting unicode?

steveklabnik · 2017-11-09T12:42:16Z

I'd like to cross-link this comment: #4928 (comment)

gnzlbg · 2018-01-17T12:11:04Z

I haven't seen the possibility of enabling homoglyph-based attacks here (If somebody mentioned them please ignore the noise), but I just filled a clippy issue to request a lint that warns on code like this:

#![feature(non_ascii_idents)]
fn main() {
    let a = 2;
    let а = 3;
    assert_eq!(a, 2);  // OK
    assert_eq!(а, 3);  // OK
}

In a nutshell, those two as are different unicode characters so the second let binding does not shadow the first one, and both asserts pass (the playground doesn't seem to support unicode identifiers though so the only way to try this is locally; works for me).

This "feature" can be used to introduce exploits in Rust programs that are harder to detect, in particular given that shadowing let bindings are considered idiomatic Rust by many, myself included.

P.S.: this "feature" might be useful in underhanded Rust contests, although that #![feature(non_ascii_idents)] should raise some eyebrows :)

ketsuban · 2018-01-17T19:12:03Z

@gnzlbg I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers. If it does, then that solves that problem; if it doesn't, at least we have the tooling to do it ready to go.

I'm a little concerned that this is a candidate for being closed and the code removed from the compiler because it's not had significant movement for a while and requires an RFC. I care a fair amount about Rust being a language of the 21st century, which means Unicode, and about Rust being friendly to non-English-speaking programmers. What I lack is the ability to actually write an RFC.

gnzlbg · 2018-01-18T09:43:14Z

@ketsuban

I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers.

yes, I think that, as suggested by @oli-obk in the clippy issue, Rust implementation would instead just use the latest official confusable list:

http://www.unicode.org/Public/security/revision-06/confusables.txt

homoglyph-based attacks can be prevented. This list would need to be kept in sync though, but that is something that can be automated as part of the build system.

gnzlbg · 2018-01-18T09:45:55Z

@ketsuban

If you care about this, there are other languages that support unicode in their identifiers, and these languages have processes similar to the RFC process. You could start by checking those. Who knows, maybe you can just merge them together with the feedback in this issue, and get a pre-RFC in the internals forum going? From that point on, it is just about incorporating/arguing feedback with others, and before you know it you will have an RFC ready.

The grammar defines identifiers in terms of XID_start and XID_continue, but this is referring to the unstable non_ascii_idents feature. The documentation implies that non_ascii_idents is forthcoming, but this is left over from pre-1.0 documentation; in reality, non_ascii_idents has been without even an RFC for several years now, and will not be stabilized anytime soon. Furthermore, according to the tracking issue at rust-lang#28979 , it's highly questionable whether or not this feature will use XID_start or XID_continue even when or if non_ascii_idents is stabilized. This commit fixes this by respecifying identifiers as the usual [a-zA-Z_][a-zA-Z0-9_]*

Fix grammar documentation wrt Unicode identifiers The grammar defines identifiers in terms of XID_start and XID_continue, but this is referring to the unstable non_ascii_idents feature. The documentation implies that non_ascii_idents is forthcoming, but this is left over from pre-1.0 documentation; in reality, non_ascii_idents has been without even an RFC for several years now, and will not be stabilized anytime soon. Furthermore, according to the tracking issue at rust-lang#28979 , it's highly questionable whether or not this feature will use XID_start or XID_continue even when or if non_ascii_idents is stabilized. This commit fixes this by respecifying identifiers as the usual [a-zA-Z_][a-zA-Z0-9_]*

mitsuhiko · 2018-07-21T17:54:22Z

In a way I hope we stick with ASCII identifiers forever. Handling unicode identifiers is such a massive interoperability pain. Some of the more bizarre examples of NFKC mappings is that things like this map to the same identifier:

>>> ℌ = 1
>>> H
1
>>> Ⅸ = 42
>>> IX
42
>>> ℕ = 23
>>> N
23
>>> import math
>>> ℯ = math.e
>>> e
2.718281828459045
>>> ℨ = 2
>>> Z
2

Serentty · 2018-07-24T15:23:11Z

@mitsuhiko The real world has that kind of pain. We can't just ignore this problem because it's hard to deal with and involves a feature that you personally have no use for.

Ixrec · 2018-07-28T16:19:45Z

Also, the current RFC explicitly proposes NFC over NFKC, after a lot of discussion about examples very similar to those.

Centril · 2018-10-29T10:44:11Z

Closing in favor of #55467.

Issue rust-lang#28979 was closed with a link to rust-lang#55467.

Update references to closed issue Issue rust-lang#28979 was closed with a link to rust-lang#55467.

Issue rust-lang#28979 was closed with a link to rust-lang#55467.

arcturusannamalai · 2021-01-17T07:32:50Z

It's 2021 and we should be more inclusive in language design and whats allowed in identifiers; coming from Python world Python3 support for unicode identifiers/functions/module names is truly great progress and I wish this for Rust community as well.

steveklabnik · 2021-01-20T15:21:05Z

@arcturusannamalai please see the final comment here, this work is still ongoing, and including that is the plan.

sanmai-NL · 2023-05-09T11:39:20Z

It's 2021 and we should be more inclusive in language design and whats allowed in identifiers; coming from Python world Python3 support for unicode identifiers/functions/module names is truly great progress and I wish this for Rust community as well.

CPython 3's support for non-ASCII identifiers is pretty spotty, variable and hard to determine. See https://tjol.eu/blog/unicode-identifiers.html

One interesting problem is that with Unicode changes between versions, valid identifiers in Python source code can become invalid.

arcturusannamalai · 2023-05-09T13:43:29Z

I'm actually comfortable programming in 1's and 0's too just happen to want non-ASCII .. but English suffices for systems level work.

steveklabnik · 2023-05-09T15:46:15Z

Just to be clear, this feature landed in stable Rust in 1.53.0, almost two years ago #83799

steveklabnik added the A-lang label Oct 29, 2015

pnkfelix added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 29, 2015

pnkfelix added the I-nominated label Oct 29, 2015

aturon changed the title ~~Fix non-ASCII identifiers~~ Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") Nov 4, 2015

aturon added the B-unstable Blocker: Implemented in the nightly compiler and unstable. label Nov 4, 2015

rust-highfive added P-low Low priority and removed I-nominated labels Nov 5, 2015

ogham mentioned this issue Mar 18, 2016

Implement Color alias to Colour ogham/rust-ansi-term#11

Merged

8573 mentioned this issue Oct 24, 2016

What about named identifiers in local language? rust-lang/rfcs#1776

Closed

dstu mentioned this issue Jan 18, 2017

Support for Postgres enums diesel-rs/diesel#580

Closed

6 tasks

SimonSapin mentioned this issue Feb 1, 2017

Tracking issue for RFC 1566: Procedural macros #38356

Closed

31 tasks

SimonSapin mentioned this issue Feb 19, 2017

Tracking issue for 1.0.0 tracking issues #39954

Closed

38 tasks

zackmdavis mentioned this issue Feb 4, 2018

RFC: Rust 2018 Roadmap rust-lang/rfcs#2314

Merged

bstrie mentioned this issue May 16, 2018

Fix grammar documentation wrt Unicode identifiers #50790

Merged

zbraniecki mentioned this issue May 21, 2018

Relax variant-name grammar projectfluent/fluent#90

Open

SimonSapin mentioned this issue Jun 2, 2018

Allow non-ASCII identifiers rust-lang/rfcs#2455

Closed

pyfisch mentioned this issue Jun 3, 2018

Allow non-ASCII identifiers rust-lang/rfcs#2457

Merged

nibags mentioned this issue Jun 25, 2018

Add unicode escapes, allow non-ASCII identifiers & others improvements zargony/atom-language-rust#136

Open

fbstj mentioned this issue Jul 16, 2018

update all dates in state-of-rust features table rust-lang/rust-forge#156

Closed

Centril closed this as completed Oct 29, 2018

ids1024 mentioned this issue Dec 28, 2018

Update references to closed issue #57159

Merged

kennytm pushed a commit to kennytm/rust that referenced this issue Dec 29, 2018

Update references to closed issue

0c58eec

Issue rust-lang#28979 was closed with a link to rust-lang#55467.

kennytm added a commit to kennytm/rust that referenced this issue Dec 29, 2018

Rollup merge of rust-lang#57159 - ids1024:closed-issue, r=Centril

291d51c

Update references to closed issue Issue rust-lang#28979 was closed with a link to rust-lang#55467.

JohnHeitmann pushed a commit to JohnHeitmann/rust that referenced this issue Jan 5, 2019

Update references to closed issue

efd19d5

Issue rust-lang#28979 was closed with a link to rust-lang#55467.

lerno mentioned this issue Mar 26, 2021

consider allowing non-ascii identifiers ziglang/zig#3947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

DemiMarie commented Oct 12, 2015

steveklabnik commented Oct 29, 2015

pnkfelix commented Oct 29, 2015

nrc commented Oct 30, 2015

SimonSapin commented Oct 30, 2015

petrochenkov commented Oct 30, 2015

pnkfelix commented Nov 1, 2015

pnkfelix commented Nov 1, 2015 •

edited

Loading

bstrie commented Nov 4, 2015

aturon commented Nov 4, 2015

nikomatsakis commented Nov 5, 2015

nikomatsakis commented Nov 5, 2015

huonw commented Nov 5, 2015

kberov commented Jan 8, 2017

SimonSapin commented Jan 8, 2017

kberov commented Jan 8, 2017

SimonSapin commented Jan 8, 2017

mjbshaw commented Feb 8, 2017

DoumanAsh commented Mar 17, 2017

steveklabnik commented Nov 9, 2017

gnzlbg commented Jan 17, 2018 •

edited

Loading

ketsuban commented Jan 17, 2018

gnzlbg commented Jan 18, 2018 •

edited

Loading

gnzlbg commented Jan 18, 2018

mitsuhiko commented Jul 21, 2018

Serentty commented Jul 24, 2018

Ixrec commented Jul 28, 2018

Centril commented Oct 29, 2018

arcturusannamalai commented Jan 17, 2021

steveklabnik commented Jan 20, 2021

sanmai-NL commented May 9, 2023

arcturusannamalai commented May 9, 2023

steveklabnik commented May 9, 2023

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Comments

DemiMarie commented Oct 12, 2015

steveklabnik commented Oct 29, 2015

pnkfelix commented Oct 29, 2015

nrc commented Oct 30, 2015

SimonSapin commented Oct 30, 2015

petrochenkov commented Oct 30, 2015

pnkfelix commented Nov 1, 2015

pnkfelix commented Nov 1, 2015 • edited Loading

bstrie commented Nov 4, 2015

aturon commented Nov 4, 2015

nikomatsakis commented Nov 5, 2015

nikomatsakis commented Nov 5, 2015

huonw commented Nov 5, 2015

kberov commented Jan 8, 2017

SimonSapin commented Jan 8, 2017

kberov commented Jan 8, 2017

SimonSapin commented Jan 8, 2017

mjbshaw commented Feb 8, 2017

DoumanAsh commented Mar 17, 2017

steveklabnik commented Nov 9, 2017

gnzlbg commented Jan 17, 2018 • edited Loading

ketsuban commented Jan 17, 2018

gnzlbg commented Jan 18, 2018 • edited Loading

gnzlbg commented Jan 18, 2018

mitsuhiko commented Jul 21, 2018

Serentty commented Jul 24, 2018

Ixrec commented Jul 28, 2018

Centril commented Oct 29, 2018

arcturusannamalai commented Jan 17, 2021

steveklabnik commented Jan 20, 2021

sanmai-NL commented May 9, 2023

arcturusannamalai commented May 9, 2023

steveklabnik commented May 9, 2023

pnkfelix commented Nov 1, 2015 •

edited

Loading

gnzlbg commented Jan 17, 2018 •

edited

Loading

gnzlbg commented Jan 18, 2018 •

edited

Loading