Improve punctuation tokenization and word split #5060

smola · 2020-10-22T20:08:17Z

Description

The tokenizer previously ignored most (but not all) punctuation characters. Now all ASCII printable characters are included in tokens.

These characters are often very salient features when detecting the programming language. In some cases, even more than identifiers.

A sequence of consecutive non-alphanumeric characters will be recognized as a single token (e.g. == != >= <==>) with the following exceptions:

Comment tokenization behavior is preserved, so /* or , ).
SGML tokens are mow more restrictive, so <a>, <?xml>, <!DOCTYPE> are still SGML recognized, but <!-- or <-- are not. This also simplifies the corresponding lexer action.
Parenthesis (()), square brackets ([]) and curly brackets ({}) are tokenized independently.

Word splitting is also altered by this change:

[.@#/*] are not considered part of an identifier anymore. But a single character from [.@#] is recognized as part of a word if it's the first one. So .globl in Assembly is still a single token, but console.log is now split as console and .log. In Objective-C, @interface will still be a single token, but mail@example.com will be 3 tokens. This change leads to less unique tokens in the vocabulary, which helps classification too.

Checklist:

Checklist does not apply.

smola · 2020-10-22T20:14:54Z

Don't pay much attention to the generated lexer files. Note that they currently include changes from #5006 but I'll rebase when that's merged.

Doing some simple cross-validation, this changes from 58 errors in master to 45 errors after this change. It has a particularly positive effect in some languages, including C and C++.

diff report

--- master.list	2020-10-22 21:59:07.650376934 +0200
+++ tokenizer-symbols3.list	2020-10-22 21:58:34.013900257 +0200
@@ -6 +6 @@
-AGS Script/KeyboardMovement_102.asc GOOD
+AGS Script/KeyboardMovement_102.asc BAD(Public Key)
@@ -16,2 +16,2 @@
-AsciiDoc/list.asc BAD(Public Key)
-Assembly/3D_PRG.I GOOD
+AsciiDoc/list.asc GOOD
+Assembly/3D_PRG.I BAD(Motorola 68K Assembly)
@@ -21,2 +21,2 @@
-Assembly/fp_sqr32_160_comba.inc BAD(C++)
-Assembly/lib.inc BAD(Motorola 68K Assembly)
+Assembly/fp_sqr32_160_comba.inc GOOD
+Assembly/lib.inc GOOD
@@ -24,2 +24,2 @@
-BitBake/gstreamer-libav.bb BAD(BlitzBasic)
-BitBake/qtbase-native.bb BAD(BlitzBasic)
+BitBake/gstreamer-libav.bb GOOD
+BitBake/qtbase-native.bb GOOD
@@ -34 +34 @@
-C++/16F88.h GOOD
+C++/16F88.h BAD(C)
@@ -36 +36 @@
-C/array.h BAD(Objective-C)
+C/array.h GOOD
@@ -44 +44 @@
-C/bootstrap.h BAD(Objective-C)
+C/bootstrap.h GOOD
@@ -48 +48 @@
-C++/ClasspathVMSystemProperties.inc BAD(Pawn)
+C++/ClasspathVMSystemProperties.inc GOOD
@@ -75 +75 @@
-C++/libcanister.h GOOD
+C++/libcanister.h BAD(C)
@@ -77 +77 @@
-C++/metrics.h BAD(C)
+C++/metrics.h GOOD
@@ -80,2 +80,2 @@
-C/Nightmare.h GOOD
-C/ntru_encrypt.h BAD(C++)
+C/Nightmare.h BAD(C++)
+C/ntru_encrypt.h GOOD
@@ -114 +114 @@
-C++/rpc.h BAD(C)
+C++/rpc.h GOOD
@@ -118 +118 @@
-C/syscalldefs.h BAD(Objective-C)
+C/syscalldefs.h GOOD
@@ -166 +166 @@
-Formatted/long_seq.for BAD(Forth)
+Formatted/long_seq.for GOOD
@@ -172 +172 @@
-Fortran/bug-185631.f GOOD
+Fortran/bug-185631.f BAD(Forth)
@@ -197 +197 @@
-GLSL/recurse1.fs GOOD
+GLSL/recurse1.fs BAD(Filterscript)
@@ -253 +253 @@
-INI/defaults.properties BAD(Java Properties)
+INI/defaults.properties GOOD
@@ -266,2 +266,2 @@
-Java Properties/libraries.properties GOOD
-Java Properties/sounds.properties GOOD
+Java Properties/libraries.properties BAD(INI)
+Java Properties/sounds.properties BAD(INI)
@@ -294,2 +294,2 @@
-Mathematica/HeyexImport.m BAD(Objective-C)
-Mathematica/Init.m GOOD
+Mathematica/HeyexImport.m BAD(MATLAB)
+Mathematica/Init.m BAD(Objective-C)
@@ -298 +298 @@
-Mathematica/PacletInfo.m BAD(Mercury)
+Mathematica/PacletInfo.m GOOD
@@ -385 +385 @@
-M/ZDIOUT1.m BAD(Mercury)
+M/ZDIOUT1.m GOOD
@@ -408 +408 @@
-NewLisp/log-to-database.lisp BAD(Common Lisp)
+NewLisp/log-to-database.lisp GOOD
@@ -429 +429 @@
-Objective-C/Siesta.h BAD(C)
+Objective-C/Siesta.h BAD(C++)
@@ -459 +459 @@
-Pawn/Check.inc BAD(PHP)
+Pawn/Check.inc GOOD
@@ -466 +466 @@
-Perl/exception_handler.pl BAD(Raku)
+Perl/exception_handler.pl GOOD
@@ -505 +505 @@
-Proguard/proguard-rules2.pro GOOD
+Proguard/proguard-rules2.pro BAD(IDL)
@@ -532 +532 @@
-Raku/01-dash-uppercase-i.t BAD(Perl)
+Raku/01-dash-uppercase-i.t GOOD
@@ -536 +536 @@
-Raku/A.pm BAD(Perl)
+Raku/A.pm GOOD
@@ -539,2 +539,2 @@
-Raku/calendar.t BAD(Perl)
-Raku/ContainsUnicode.pm BAD(Perl)
+Raku/calendar.t GOOD
+Raku/ContainsUnicode.pm GOOD
@@ -558 +558 @@
-Rebol/booters.r GOOD
+Rebol/booters.r BAD(R)
@@ -561 +561 @@
-R/hello-r.R BAD(Rebol)
+R/hello-r.R GOOD
@@ -579 +579 @@
-SaltStack/eval.sls BAD(Scheme)
+SaltStack/eval.sls GOOD
@@ -599 +599 @@
-Smalltalk/scriptWithPragma.st BAD(HTML)
+Smalltalk/scriptWithPragma.st GOOD
@@ -623 +623 @@
-Text/LIDARLite.ncl GOOD
+Text/LIDARLite.ncl BAD(NCL)

The last error (LIDARLite.ncl) is not relevant. See #5054.

stale · 2020-12-25T13:45:34Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

lildude · 2021-01-08T09:36:46Z

@smola it looks like the changes here cause a change in detection of one of the samples causing the only failing test blocking this PR being merged:

  1) Failure:
TestClassifier#test_classify_ambiguous_languages [/home/runner/work/linguist/linguist/test/test_classifier.rb:56]:
/home/runner/work/linguist/linguist/samples/AGS Script/KeyboardMovement_102.asc
[["Public Key", -7.575176404971187], ["AsciiDoc", -10.933410495077375], ["AGS Script", -13.908242416753797]].
Expected: "AGS Script"
  Actual: "Public Key"

Could you please look into why this is. 🙇

The tokenizer previously ignored most (but not all) punctuation characters. Now all ASCII printable characters are included in tokens. These characters are often very salient features when detecting the programming language. In some cases, even more than identifiers. A sequence of consecutive non-alphanumeric characters will be recognized as a single token (e.g. == != >= <==>) with the following exceptions: - Comment tokenization behavior is preserved, so `/*` or ``, ``). - SGML tokens are mow more restrictive, so `<a>`, `<?xml>`, `<!DOCTYPE>` are still SGML recognized, but `<!--` or `<--` are not. This also simplifies the corresponding lexer action. - Parenthesis (`()`), square brackets (`[]`) and curly brackets (`{}`) are tokenized independently. Word splitting is also altered by this change: - `[.@#/*]` are not considered part of an identifier anymore. But a single character from [.@#] is recognized as part of a word if it's the first one. So `.globl` in Assembly is still a single token, but `console.log` is now split as `console` and `.log`. In Objective-C, `@interface` will still be a single token, but `mail@example.com` will be 3 tokens. This change leads to less unique tokens in the vocabulary, which helps classification too.

smola · 2021-01-14T18:04:39Z

@lildude There were a couple of problems that caused incorrect eating of all tokens for some files (bad C comment detection). Those are now fixed. Remaining errors were on things like SQL, which is quite reasonable. I added the test change to skip files with extensions that have a catch-all rule, like .sql. I was going to introduce that in #5061 anyway.

So I think this is ready.

smola · 2021-01-14T18:05:05Z

test/test_tokenizer.rb

-    assert_equal %w(foo <!--Comment-->), tokenize("foo <!--Comment-->")
-    assert_equal %w(foo <!--Comment-->), tokenize("foo<!--Comment-->")
-    assert_equal %w(foo <!---->), tokenize("foo<!---->")
+    assert_equal %w(foo), tokenize("foo <!--Comment-->")


Fixing this was an unintended effect of this PR 😉

lildude

LGTM

smola mentioned this pull request Oct 22, 2020

Tokenizer comments #5061

Merged

smola force-pushed the tokenizer-symbols3 branch from c7756c5 to 53af02d Compare October 23, 2020 15:49

smola force-pushed the tokenizer-symbols3 branch from 53af02d to 8a43472 Compare November 14, 2020 14:28

smola mentioned this pull request Dec 5, 2020

New Centroid-based Classifier #5103

Merged

stale bot added the Stale label Dec 25, 2020

lildude removed the Stale label Jan 4, 2021

lildude requested a review from a team as a code owner January 6, 2021 11:44

smola force-pushed the tokenizer-symbols3 branch from 5f47086 to 2643445 Compare January 14, 2021 18:00

smola commented Jan 14, 2021

View reviewed changes

lildude added 2 commits January 22, 2021 17:43

Merge branch 'master' into tokenizer-symbols3

9f3d15e

Merge branch 'master' into tokenizer-symbols3

a8626e5

lildude approved these changes Jan 25, 2021

View reviewed changes

lildude merged commit 3448237 into github-linguist:master Jan 25, 2021

smola deleted the tokenizer-symbols3 branch January 25, 2021 12:08

github-linguist deleted a comment Aug 2, 2021

lildude mentioned this pull request Jun 16, 2022

tokenizer.rb:17: [BUG] Segmentation fault #5938

Closed

lildude mentioned this pull request Jun 30, 2022

Fix tokenizer crash when passed very long strings #5956

Closed

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve punctuation tokenization and word split #5060

Improve punctuation tokenization and word split #5060

smola commented Oct 22, 2020

smola commented Oct 22, 2020

stale bot commented Dec 25, 2020

lildude commented Jan 8, 2021

smola commented Jan 14, 2021 •

edited

Loading

smola Jan 14, 2021

lildude left a comment

Improve punctuation tokenization and word split #5060

Improve punctuation tokenization and word split #5060

Conversation

smola commented Oct 22, 2020

Description

Checklist:

smola commented Oct 22, 2020

stale bot commented Dec 25, 2020

lildude commented Jan 8, 2021

smola commented Jan 14, 2021 • edited Loading

smola Jan 14, 2021

Choose a reason for hiding this comment

lildude left a comment

Choose a reason for hiding this comment

smola commented Jan 14, 2021 •

edited

Loading