Old-style CR terminators can affect classification #2988

Alhadis · 2016-05-06T08:28:18Z

This was discovered in #2971, where one file was still being misclassified because of its unusual line-endings.

With support for WoW-ToC files added, ChatKeys.toc should've been identified as addon metadata. However, Linguist still interpreted it as TeX until the line endings were converted from old-style CR to proper modern LF. Suddenly, it identified as WoW-ToC again.

Could choice of line-ending be potentially skewing the classification?

Converting the file to LF endings wasn't possible in Atom due to a bug (which was recently addressed), so I had to use Perl. I've attached the file with both line-ending types to spare you the trouble.

The text was updated successfully, but these errors were encountered:

stale · 2018-11-06T07:14:34Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

Alhadis · 2018-11-06T08:54:41Z

.

lildude · 2018-11-06T09:37:24Z

.

Removing the Stale label is also enough to indicate activity and stalebot will ignore the issue/PR for another 30 days.

Alhadis · 2018-11-06T09:39:27Z

Oh. 😢 Sorry.

stale · 2018-12-06T10:01:58Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale · 2019-02-04T11:46:31Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

Alhadis · 2019-04-02T18:41:42Z

Alright, so it's not the CR line-terminators that's screwing up classification. If I rerun github-linguist with an edited copy of ChatKeys-CR.toc with all carriage-returns replaced with nothing, it gives me the same result (TeX). It might be line-length that's the factor: because CR hasn't been supported as a line-terminator since the "classic Macintosh" died,

In text, the \r sequence becomes an invisible, zero-width indicator of absolutely nothing... that is, programs act as though they simply aren't there leaving CR-delimited runs of text to flop into one huge ass paragraph, which is probably what's gotten our bot confused with the likeness.

In any case, a few new samples and a carefully-written heuristic should put this matter to rest. 👍

EDIT: If it's worth mentioning, I did scour the TeX samples searching for any stray carriage-returns (which weren't locked into a never-ending partnership with a \n character, I mean 😀). The search yielded nothing, convincing me it wasn't skewing the classifier after all.

Alhadis · 2019-04-02T20:28:57Z

Also, it's kind of strange that this offending fixture has been sitting in Linguist's samples directory for quite some time now, and the analysis is blatantly incorrect:

index a71f827..ec9ba7a 100644
--- /samples/World of Warcraft Addon Data/ChatKeys-CR.toc
+++ /samples/World of Warcraft Addon Data/ChatKeys-LF.toc
@@ -1,16 +1,3 @@
-  ChatKeys-CR.toc                                      
-     10 lines (9 sloc)                                 
-     type:      Text                                   
-     mime type: text/plain                             
-     language:  TeX <~~~~~~~~ Wrong!                   
-               ^^^^^                                   
@@ -67,13 +54,15 @@
+  ChatKeys-LF.toc                                      
+     (10 lines, 9 sloc)                                
+     type:      Text                                   
+     mime type: text/plain                             
+     language:  World of Warcraft Addon Data           
+               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^          
+                        Correct!

Adding them both and writing a test to verify they pass as WoWToC data correctly. Will have a look at giving that TeX confusion a few heuristics...

Alhadis · 2019-04-05T06:55:36Z

Seems as though the heuristics weren't working because ^ doesn't match SOL if the preceding line-break was a carriage return (nor should it). If I tweaked WoW's heuristic to use (?:\A|[\r\n])##, the CR-infected file gets identified as WoW-Addon data, as expected.

However, if I remove the heuristic altogether and rerun Linguist, I get the same results as before anyway, indicating that CR terminators are affecting both Linguist's classifier and its heuristics.

I'm not sure if this issue is even worth considering, actually. I mean, the only system that used \r was the classic Mac OS (pre-2000 Apple), and users these days should either be using CRLF or LF...

Alhadis · 2021-01-08T09:37:55Z

Closing this issue as our upcoming new classifier, courtesy of @smola's heroic efforts, indirectly fixes this issue.

Alhadis mentioned this issue May 6, 2016

Add support for World of Warcraft .toc files #2971

Merged

Alhadis mentioned this issue Aug 17, 2018

Lightshow giving misleading results based on input method #3130

Closed

stale bot added the Stale label Nov 6, 2018

stale bot removed the Stale label Nov 6, 2018

stale bot added the Stale label Dec 6, 2018

Alhadis removed the Stale label Dec 6, 2018

stale bot added the Stale label Feb 4, 2019

Alhadis removed the Stale label Feb 5, 2019

Alhadis self-assigned this Feb 5, 2019

Alhadis mentioned this issue Apr 3, 2019

Force UTF-8 for filenames in breakdown analysis #4465

Merged

Alhadis mentioned this issue Jul 12, 2019

ActionScript detected as AngelScript #4580

Closed

4 tasks

Alhadis mentioned this issue Jan 8, 2021

New Centroid-based Classifier #5103

Merged

Alhadis closed this as completed Jan 8, 2021

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old-style CR terminators can affect classification #2988

Old-style CR terminators can affect classification #2988

Alhadis commented May 6, 2016

stale bot commented Nov 6, 2018

Alhadis commented Nov 6, 2018

lildude commented Nov 6, 2018

Alhadis commented Nov 6, 2018 •

edited

Loading

stale bot commented Dec 6, 2018

stale bot commented Feb 4, 2019

Alhadis commented Apr 2, 2019 •

edited

Loading

Alhadis commented Apr 2, 2019 •

edited

Loading

Alhadis commented Apr 5, 2019

Alhadis commented Jan 8, 2021

Old-style CR terminators can affect classification #2988

Old-style CR terminators can affect classification #2988

Comments

Alhadis commented May 6, 2016

stale bot commented Nov 6, 2018

Alhadis commented Nov 6, 2018

lildude commented Nov 6, 2018

Alhadis commented Nov 6, 2018 • edited Loading

stale bot commented Dec 6, 2018

stale bot commented Feb 4, 2019

Alhadis commented Apr 2, 2019 • edited Loading

Alhadis commented Apr 2, 2019 • edited Loading

Alhadis commented Apr 5, 2019

Alhadis commented Jan 8, 2021

Alhadis commented Nov 6, 2018 •

edited

Loading

Alhadis commented Apr 2, 2019 •

edited

Loading

Alhadis commented Apr 2, 2019 •

edited

Loading