Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer comments #5061

Merged
merged 1 commit into from
Feb 4, 2021
Merged

Conversation

smola
Copy link
Contributor

@smola smola commented Oct 22, 2020

Description

Comment style is an important syntactic feature of programming languages. This commit changes the tokenizer to output a special token for each comment. A comment such as /* Comment */ will generate the token COMMENT/*.

This already improves detection of some sames. For example, if Mathematica/PacletInfo.m is excluded from training, it will be classified as Mercury instead of Mathematica (on test_classify_ambiguous_languages), but after this commit, it will be classified correctly, because the COMMENT(* tokens provide enough information to the classifier to make the correct decision.

Tests are adjusted to:

  • Ignore classifier tests for extensions where we have a catch-all rule. This includes .sql, where the classifier output is mostly useless.
  • Allow samples/C++/rpc.h to be classified as C. This file is fully valid C without any single C++-specific syntax or convention.

Checklist:

Checklist does not apply.

@smola
Copy link
Contributor Author

smola commented Oct 22, 2020

This PR is currently stacked on top of #5006 and #5060. Cross validation errors are down to 36 (from 58 in master and 45 in #5060).

diff report
--- tokenizer-symbols3.list	2020-10-22 21:58:34.013900257 +0200
+++ tokenizer-comments2.list	2020-10-23 00:31:04.483269226 +0200
@@ -23 +23 @@
-Assembly/macros.inc BAD(NASL)
+Assembly/macros.inc GOOD
@@ -39 +39 @@
-C#/AssemblyInfo.cs BAD(Smalltalk)
+C#/AssemblyInfo.cs GOOD
@@ -75 +75 @@
-C++/libcanister.h BAD(C)
+C++/libcanister.h GOOD
@@ -114 +114 @@
-C++/rpc.h GOOD
+C++/rpc.h BAD(C)
@@ -197 +197 @@
-GLSL/recurse1.fs BAD(Filterscript)
+GLSL/recurse1.fs GOOD
@@ -256 +256 @@
-INI/ms.properties BAD(Java Properties)
+INI/ms.properties GOOD
@@ -295 +295 @@
-Mathematica/Init.m BAD(Objective-C)
+Mathematica/Init.m GOOD
@@ -429 +429 @@
-Objective-C/Siesta.h BAD(C++)
+Objective-C/Siesta.h BAD(C)
@@ -505 +505 @@
-Proguard/proguard-rules2.pro BAD(IDL)
+Proguard/proguard-rules2.pro GOOD
@@ -558 +558 @@
-Rebol/booters.r BAD(R)
+Rebol/booters.r GOOD
@@ -624,2 +624,2 @@
-Text/min-help.ncl BAD(NCL)
-Text/rmMonAnnCycLLT-help.ncl BAD(NCL)
+Text/min-help.ncl GOOD
+Text/rmMonAnnCycLLT-help.ncl GOOD

@smola
Copy link
Contributor Author

smola commented Oct 22, 2020

After these changes, files discussed in #5015 are correctly classified as C without any further sample.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 22, 2020

Good job! Not familiar with this part of the codebase, but here's how Roff comments are matched:

(?:^|\n)[.'][ \t]*\\" (?# Line comment)
|
(?:^|\n)\.ig\n        (?# Block comment, begin)
(?:^|\n)\.\.\n        (?# Block comment, end)

(These expressions are simplified, but they'll match the most obvious-looking cases)

@smola
Copy link
Contributor Author

smola commented Oct 23, 2020

@Alhadis Added tokenization for Roff comments. You may want to check if I got it right in the new test cases.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 23, 2020

👍 Yup, nailed it.

@smola
Copy link
Contributor Author

smola commented Nov 14, 2020

#5060 does not pass tests standalone, so I could create a new PR with both changes from #5060 and #5061. In principle they are logically different changes, but in terms of accuracy improvement and tests, they work best together.

@smola smola marked this pull request as ready for review January 5, 2021 16:22
@smola smola requested a review from a team as a code owner January 5, 2021 16:22
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Allow samples/C++/rpc.h to be classified as C. This file is fully valid C without any single C++-specific syntax or convention.

If there is nothing C++-specific in this file, I think we should move it to C and adjust the tests accordingly. Given the ongoing problems with the header files in the C-family of languages, it might be prudent to start erring on the side of C by default unless otherwise told or detected. This could be enhanced further if @Alhadis's idea at #1626 (comment) is feasible (I don't know enough about C languages to be 💯 sure).

Other than that, a quick look at the code doesn't show any glaring issues, but updating the samples.json does introduce some new warnings that need addressing:

$ rm -rf tmp/x86_64-darwin19/
$ bundle exec rake samples
mkdir -p tmp/x86_64-darwin19/linguist/2.7.1
cd tmp/x86_64-darwin19/linguist/2.7.1
/Users/lildude/.rbenv/versions/2.7.1/bin/ruby -I. ../../../../ext/linguist/extconf.rb
creating Makefile
cd -
cd tmp/x86_64-darwin19/linguist/2.7.1
/usr/bin/make
compiling ../../../../ext/linguist/lex.linguist_yy.c
lex.linguist_yy.c:1264:13: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation]
            if ( ! yyg->yy_state_buf )
            ^
lex.linguist_yy.c:1262:9: note: previous statement is here
        if ( ! yyg->yy_state_buf )
        ^
1 warning generated.
compiling ../../../../ext/linguist/linguist.c
linking shared-object linguist/linguist.bundle
cd -
mkdir -p tmp/x86_64-darwin19/stage/lib/linguist
[...]

@smola
Copy link
Contributor Author

smola commented Jan 6, 2021

@lildude What's your compiler? Do you use custom CFLAGS? As far as I can tell, these misleading indentations are in flex-generated where we have little control.

@lildude
Copy link
Member

lildude commented Jan 6, 2021

Xcode 12. I don't believe I've deliberately set any custom CFLAGs in a looong time. I've uploaded a copy of the generated Makefile to https://gist.github.com/lildude/eafe81e2339b6c5e1009b56e157fa72d if you want to take a look.

@lildude
Copy link
Member

lildude commented Jan 8, 2021

@smola I've just noticed this doesn't happen with your staging branch so I suspect something across the three PRs "fixes" this.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two little nits, but otherwise looking good 😍

test/test_classifier.rb Outdated Show resolved Hide resolved
test/test_file_blob.rb Outdated Show resolved Hide resolved
Comment style is an important syntactic feature of programming
languages. This commit changes the tokenizer to output a special token
for each comment.

A comment such as "/* Comment */" will generate the token "COMMENT/*".

This already improves detection of some sames. For example, if
Mathematica/PacletInfo.m is excluded from training, it will be
classified as Mercury instead of Mathematica (on
test_classify_ambiguous_languages), but after this commit, it will be
classified correctly, because the "COMMENT(*" tokens provide enough
information to the classifier to make the correct decision.

Also adds Roff-style comments.

Tests are adjusted to:

- Ignore classifier tests for extensions where we have a catch-all rule.
This includes .sql, where the classifier output is mostly useless.

- Allow `samples/C++/rpc.h` to be classified as C. This file is fully
valid C without any single C++-specific syntax or convention.
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lildude lildude merged commit 9f2dc96 into github-linguist:master Feb 4, 2021
@smola smola deleted the tokenizer-comments2 branch February 4, 2021 12:27
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants