Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve punctuation tokenization and word split #5060

Merged
merged 3 commits into from
Jan 25, 2021

Conversation

smola
Copy link
Contributor

@smola smola commented Oct 22, 2020

Description

The tokenizer previously ignored most (but not all) punctuation characters. Now all ASCII printable characters are included in tokens.

These characters are often very salient features when detecting the programming language. In some cases, even more than identifiers.

A sequence of consecutive non-alphanumeric characters will be recognized as a single token (e.g. == != >= <==>) with the following exceptions:

  • Comment tokenization behavior is preserved, so /* or <!-- are still ignored as part of comments. As a side-effect, some edge cases are now fixed (e.g. <!---->, <!--Comment-->).

  • SGML tokens are mow more restrictive, so <a>, <?xml>, <!DOCTYPE> are still SGML recognized, but <!-- or <-- are not. This also simplifies the corresponding lexer action.

  • Parenthesis (()), square brackets ([]) and curly brackets ({}) are tokenized independently.

Word splitting is also altered by this change:

  • [.@#/*] are not considered part of an identifier anymore. But a single character from [.@#] is recognized as part of a word if it's the first one. So .globl in Assembly is still a single token, but console.log is now split as console and .log. In Objective-C, @interface will still be a single token, but mail@example.com will be 3 tokens. This change leads to less unique tokens in the vocabulary, which helps classification too.

Checklist:

Checklist does not apply.

@smola
Copy link
Contributor Author

smola commented Oct 22, 2020

Don't pay much attention to the generated lexer files. Note that they currently include changes from #5006 but I'll rebase when that's merged.

Doing some simple cross-validation, this changes from 58 errors in master to 45 errors after this change. It has a particularly positive effect in some languages, including C and C++.

diff report
--- master.list	2020-10-22 21:59:07.650376934 +0200
+++ tokenizer-symbols3.list	2020-10-22 21:58:34.013900257 +0200
@@ -6 +6 @@
-AGS Script/KeyboardMovement_102.asc GOOD
+AGS Script/KeyboardMovement_102.asc BAD(Public Key)
@@ -16,2 +16,2 @@
-AsciiDoc/list.asc BAD(Public Key)
-Assembly/3D_PRG.I GOOD
+AsciiDoc/list.asc GOOD
+Assembly/3D_PRG.I BAD(Motorola 68K Assembly)
@@ -21,2 +21,2 @@
-Assembly/fp_sqr32_160_comba.inc BAD(C++)
-Assembly/lib.inc BAD(Motorola 68K Assembly)
+Assembly/fp_sqr32_160_comba.inc GOOD
+Assembly/lib.inc GOOD
@@ -24,2 +24,2 @@
-BitBake/gstreamer-libav.bb BAD(BlitzBasic)
-BitBake/qtbase-native.bb BAD(BlitzBasic)
+BitBake/gstreamer-libav.bb GOOD
+BitBake/qtbase-native.bb GOOD
@@ -34 +34 @@
-C++/16F88.h GOOD
+C++/16F88.h BAD(C)
@@ -36 +36 @@
-C/array.h BAD(Objective-C)
+C/array.h GOOD
@@ -44 +44 @@
-C/bootstrap.h BAD(Objective-C)
+C/bootstrap.h GOOD
@@ -48 +48 @@
-C++/ClasspathVMSystemProperties.inc BAD(Pawn)
+C++/ClasspathVMSystemProperties.inc GOOD
@@ -75 +75 @@
-C++/libcanister.h GOOD
+C++/libcanister.h BAD(C)
@@ -77 +77 @@
-C++/metrics.h BAD(C)
+C++/metrics.h GOOD
@@ -80,2 +80,2 @@
-C/Nightmare.h GOOD
-C/ntru_encrypt.h BAD(C++)
+C/Nightmare.h BAD(C++)
+C/ntru_encrypt.h GOOD
@@ -114 +114 @@
-C++/rpc.h BAD(C)
+C++/rpc.h GOOD
@@ -118 +118 @@
-C/syscalldefs.h BAD(Objective-C)
+C/syscalldefs.h GOOD
@@ -166 +166 @@
-Formatted/long_seq.for BAD(Forth)
+Formatted/long_seq.for GOOD
@@ -172 +172 @@
-Fortran/bug-185631.f GOOD
+Fortran/bug-185631.f BAD(Forth)
@@ -197 +197 @@
-GLSL/recurse1.fs GOOD
+GLSL/recurse1.fs BAD(Filterscript)
@@ -253 +253 @@
-INI/defaults.properties BAD(Java Properties)
+INI/defaults.properties GOOD
@@ -266,2 +266,2 @@
-Java Properties/libraries.properties GOOD
-Java Properties/sounds.properties GOOD
+Java Properties/libraries.properties BAD(INI)
+Java Properties/sounds.properties BAD(INI)
@@ -294,2 +294,2 @@
-Mathematica/HeyexImport.m BAD(Objective-C)
-Mathematica/Init.m GOOD
+Mathematica/HeyexImport.m BAD(MATLAB)
+Mathematica/Init.m BAD(Objective-C)
@@ -298 +298 @@
-Mathematica/PacletInfo.m BAD(Mercury)
+Mathematica/PacletInfo.m GOOD
@@ -385 +385 @@
-M/ZDIOUT1.m BAD(Mercury)
+M/ZDIOUT1.m GOOD
@@ -408 +408 @@
-NewLisp/log-to-database.lisp BAD(Common Lisp)
+NewLisp/log-to-database.lisp GOOD
@@ -429 +429 @@
-Objective-C/Siesta.h BAD(C)
+Objective-C/Siesta.h BAD(C++)
@@ -459 +459 @@
-Pawn/Check.inc BAD(PHP)
+Pawn/Check.inc GOOD
@@ -466 +466 @@
-Perl/exception_handler.pl BAD(Raku)
+Perl/exception_handler.pl GOOD
@@ -505 +505 @@
-Proguard/proguard-rules2.pro GOOD
+Proguard/proguard-rules2.pro BAD(IDL)
@@ -532 +532 @@
-Raku/01-dash-uppercase-i.t BAD(Perl)
+Raku/01-dash-uppercase-i.t GOOD
@@ -536 +536 @@
-Raku/A.pm BAD(Perl)
+Raku/A.pm GOOD
@@ -539,2 +539,2 @@
-Raku/calendar.t BAD(Perl)
-Raku/ContainsUnicode.pm BAD(Perl)
+Raku/calendar.t GOOD
+Raku/ContainsUnicode.pm GOOD
@@ -558 +558 @@
-Rebol/booters.r GOOD
+Rebol/booters.r BAD(R)
@@ -561 +561 @@
-R/hello-r.R BAD(Rebol)
+R/hello-r.R GOOD
@@ -579 +579 @@
-SaltStack/eval.sls BAD(Scheme)
+SaltStack/eval.sls GOOD
@@ -599 +599 @@
-Smalltalk/scriptWithPragma.st BAD(HTML)
+Smalltalk/scriptWithPragma.st GOOD
@@ -623 +623 @@
-Text/LIDARLite.ncl GOOD
+Text/LIDARLite.ncl BAD(NCL)

The last error (LIDARLite.ncl) is not relevant. See #5054.

@stale
Copy link

stale bot commented Dec 25, 2020

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Dec 25, 2020
@lildude lildude removed the Stale label Jan 4, 2021
@lildude lildude requested a review from a team as a code owner January 6, 2021 11:44
@lildude
Copy link
Member

lildude commented Jan 8, 2021

@smola it looks like the changes here cause a change in detection of one of the samples causing the only failing test blocking this PR being merged:

  1) Failure:
TestClassifier#test_classify_ambiguous_languages [/home/runner/work/linguist/linguist/test/test_classifier.rb:56]:
/home/runner/work/linguist/linguist/samples/AGS Script/KeyboardMovement_102.asc
[["Public Key", -7.575176404971187], ["AsciiDoc", -10.933410495077375], ["AGS Script", -13.908242416753797]].
Expected: "AGS Script"
  Actual: "Public Key"

Could you please look into why this is. 🙇

The tokenizer previously ignored most (but not all) punctuation
characters. Now all ASCII printable characters are included in tokens.

These characters are often very salient features when detecting the
programming language. In some cases, even more than identifiers.

A sequence of consecutive non-alphanumeric characters will be recognized
as a single token (e.g. == != >= <==>) with the following exceptions:

- Comment tokenization behavior is preserved, so `/*` or `<!--` are still
ignored as part of comments. As a side-effect, some edge cases are now
fixed (e.g. `<!---->`, `<!--Comment-->`).

- SGML tokens are mow more restrictive, so `<a>`, `<?xml>`, `<!DOCTYPE>` are still SGML recognized, but `<!--` or `<--`
are not. This also simplifies the corresponding lexer action.

- Parenthesis (`()`), square brackets (`[]`) and curly brackets (`{}`)
are tokenized independently.

Word splitting is also altered by this change:

- `[.@#/*]` are not considered part of an identifier anymore. But a single
character from [.@#] is recognized as part of a word if it's the first
one. So `.globl` in Assembly is still a single token,
but `console.log` is now split as `console` and `.log`. In Objective-C,
`@interface` will still be a single token, but `mail@example.com` will
be 3 tokens. This change leads to less unique tokens in the vocabulary,
which helps classification too.
@smola
Copy link
Contributor Author

smola commented Jan 14, 2021

@lildude There were a couple of problems that caused incorrect eating of all tokens for some files (bad C comment detection). Those are now fixed. Remaining errors were on things like SQL, which is quite reasonable. I added the test change to skip files with extensions that have a catch-all rule, like .sql. I was going to introduce that in #5061 anyway.

So I think this is ready.

assert_equal %w(foo <!--Comment-->), tokenize("foo <!--Comment-->")
assert_equal %w(foo <!--Comment-->), tokenize("foo<!--Comment-->")
assert_equal %w(foo <!---->), tokenize("foo<!---->")
assert_equal %w(foo), tokenize("foo <!--Comment-->")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this was an unintended effect of this PR 😉

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lildude lildude merged commit 3448237 into github-linguist:master Jan 25, 2021
@smola smola deleted the tokenizer-symbols3 branch January 25, 2021 12:08
@github-linguist github-linguist deleted a comment Aug 2, 2021
@github-linguist github-linguist deleted a comment Aug 2, 2021
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants