Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old-style CR terminators can affect classification #2988

Closed
Alhadis opened this issue May 6, 2016 · 10 comments
Closed

Old-style CR terminators can affect classification #2988

Alhadis opened this issue May 6, 2016 · 10 comments
Assignees

Comments

@Alhadis
Copy link
Collaborator

Alhadis commented May 6, 2016

This was discovered in #2971, where one file was still being misclassified because of its unusual line-endings.

With support for WoW-ToC files added, ChatKeys.toc should've been identified as addon metadata. However, Linguist still interpreted it as TeX until the line endings were converted from old-style CR to proper modern LF. Suddenly, it identified as WoW-ToC again.

Could choice of line-ending be potentially skewing the classification?

Converting the file to LF endings wasn't possible in Atom due to a bug (which was recently addressed), so I had to use Perl. I've attached the file with both line-ending types to spare you the trouble.

@stale
Copy link

stale bot commented Nov 6, 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Nov 6, 2018
@Alhadis
Copy link
Collaborator Author

Alhadis commented Nov 6, 2018

.

@stale stale bot removed the Stale label Nov 6, 2018
@lildude
Copy link
Member

lildude commented Nov 6, 2018

.

Removing the Stale label is also enough to indicate activity and stalebot will ignore the issue/PR for another 30 days.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Nov 6, 2018

Oh. 😢 Sorry.

@stale
Copy link

stale bot commented Dec 6, 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Dec 6, 2018
@Alhadis Alhadis removed the Stale label Dec 6, 2018
@stale
Copy link

stale bot commented Feb 4, 2019

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Feb 4, 2019
@Alhadis Alhadis removed the Stale label Feb 5, 2019
@Alhadis Alhadis self-assigned this Feb 5, 2019
@Alhadis
Copy link
Collaborator Author

Alhadis commented Apr 2, 2019

Alright, so it's not the CR line-terminators that's screwing up classification. If I rerun github-linguist with an edited copy of ChatKeys-CR.toc with all carriage-returns replaced with nothing, it gives me the same result (TeX). It might be line-length that's the factor: because CR hasn't been supported as a line-terminator since the "classic Macintosh" died,

In text, the \r sequence becomes an invisible, zero-width indicator of absolutely nothing... that is, programs act as though they simply aren't there leaving CR-delimited runs of text to flop into one huge ass paragraph, which is probably what's gotten our bot confused with the likeness.

In any case, a few new samples and a carefully-written heuristic should put this matter to rest. 👍

EDIT: If it's worth mentioning, I did scour the TeX samples searching for any stray carriage-returns (which weren't locked into a never-ending partnership with a \n character, I mean 😀). The search yielded nothing, convincing me it wasn't skewing the classifier after all.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Apr 2, 2019

Also, it's kind of strange that this offending fixture has been sitting in Linguist's samples directory for quite some time now, and the analysis is blatantly incorrect:

index a71f827..ec9ba7a 100644
--- /samples/World of Warcraft Addon Data/ChatKeys-CR.toc
+++ /samples/World of Warcraft Addon Data/ChatKeys-LF.toc
@@ -1,16 +1,3 @@
-  ChatKeys-CR.toc                                      
-     10 lines (9 sloc)                                 
-     type:      Text                                   
-     mime type: text/plain                             
-     language:  TeX <~~~~~~~~ Wrong!                   
-               ^^^^^                                   
@@ -67,13 +54,15 @@
+  ChatKeys-LF.toc                                      
+     (10 lines, 9 sloc)                                
+     type:      Text                                   
+     mime type: text/plain                             
+     language:  World of Warcraft Addon Data           
+               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^          
+                        Correct!                       

Adding them both and writing a test to verify they pass as WoWToC data correctly. Will have a look at giving that TeX confusion a few heuristics...

@Alhadis
Copy link
Collaborator Author

Alhadis commented Apr 5, 2019

Seems as though the heuristics weren't working because ^ doesn't match SOL if the preceding line-break was a carriage return (nor should it). If I tweaked WoW's heuristic to use (?:\A|[\r\n])##, the CR-infected file gets identified as WoW-Addon data, as expected.

However, if I remove the heuristic altogether and rerun Linguist, I get the same results as before anyway, indicating that CR terminators are affecting both Linguist's classifier and its heuristics.

I'm not sure if this issue is even worth considering, actually. I mean, the only system that used \r was the classic Mac OS (pre-2000 Apple), and users these days should either be using CRLF or LF...

@Alhadis
Copy link
Collaborator Author

Alhadis commented Jan 8, 2021

Closing this issue as our upcoming new classifier, courtesy of @smola's heroic efforts, indirectly fixes this issue.

@Alhadis Alhadis closed this as completed Jan 8, 2021
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants