Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer: capture non-ASCII identifiers #3748

Closed
wants to merge 1 commit into from

Conversation

smola
Copy link
Contributor

@smola smola commented Jul 31, 2017

\w captures only ASCII letters and numbers. Changed to [[:alnum:]]
to capture any Unicode letter and digit. This makes the tokenizer
work properly on non-English based langiages (e.g. 1C Enterprise).

`\w` captures only ASCII letters and numbers. Changed to [[:alnum:]]
to capture any Unicode letter and digit. This makes the tokenizer
work properly on non-English based langiages (e.g. 1C Enterprise).
@Alhadis
Copy link
Collaborator

Alhadis commented Jul 31, 2017

\w captures only ASCII letters and numbers.

That's dependent on the regex engine in question. Ruby uses Oniguruma, which is Unicode-aware by default, IIRC. Did you test to see if \w doesn't match word-characters in other alphabets?

@smola
Copy link
Contributor Author

smola commented Jul 31, 2017

@Alhadis If I run the test I added without the fix, this is its output:

  1) Failure:
TestTokenizer#test_utf8_tokens [test/test_tokenizer.rb:118]:
Expected: ["Функция", "الكون"]
  Actual: []

@pchaigno
Copy link
Contributor

@Alhadis [[:alnum:]] is safer in any case, no?

@Alhadis
Copy link
Collaborator

Alhadis commented Jul 31, 2017

Possibly; it seems like Oniguruma only defaults to Unicode in Atom, but not Ruby... never mind then. :)

@smola
Copy link
Contributor Author

smola commented Dec 2, 2017

Closing. I guess #3846 makes this obsolete.

@smola smola closed this Dec 2, 2017
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants