Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misclassified C as C++ in data.table project #5028

Closed
jangorecki opened this issue Sep 29, 2020 · 12 comments
Closed

Misclassified C as C++ in data.table project #5028

jangorecki opened this issue Sep 29, 2020 · 12 comments

Comments

@jangorecki
Copy link

This repository does not use any C++ code but some header files are classified as such:
https://github.com/Rdatatable/data.table
It is strange that h files are not considered as C unless C++ syntax is detected inside, I think that should be the default option.
I would appreciate fixing it.

@smola
Copy link
Contributor

smola commented Oct 1, 2020

@jangorecki This is a known problem: #1626

You can workaround it by overriding the language of *.h files in the repository with .gitattributes: https://github.com/github/linguist#overrides

@jangorecki
Copy link
Author

jangorecki commented Oct 1, 2020

I think it should work other way around. *.h files classified as C but actually being C++, those should require workarounds.

@smola
Copy link
Contributor

smola commented Oct 2, 2020

@jangorecki I agree. But there's some technical limitations that need to be solved before that's possible. There's a lengthy discussion about it at #1626. Short-term, overriding with .gitattributes is the only solution.

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Dec 25, 2020
@jangorecki
Copy link
Author

Issue is still relevant and mentioned technical limitations doesn't seem to be addressed.

@stale stale bot removed the Stale label Dec 26, 2020
@Starwort
Copy link

Starwort commented Jan 6, 2021

Also occurring in my repository:
https://github.com/Starwort/NEA/search?l=c%2B%2B
This seems very common...

@jangorecki
Copy link
Author

Yes, h files should be recognized as C till the proper fix will be implemented.

@smola
Copy link
Contributor

smola commented Jan 14, 2021

(EDITED): I was writing at the wrong issue.

@0hana
Copy link

0hana commented Mar 2, 2021

I also have this issue
https://github.com/0hana/C/test.graph.h

@lildude
Copy link
Member

lildude commented May 6, 2021

I'm closing this issue as it's essentially a duplicate of #1626 which I have now pinned to try and reduce the duplicates.

@lildude lildude closed this as completed May 6, 2021
@jangorecki
Copy link
Author

jangorecki commented May 6, 2021

@lildude so till the proper solution is in place maybe we can just change classification of *.h files to be C rather than C++? that would resolve this particular issue which is not exactly the same as #1626.

@lildude
Copy link
Member

lildude commented May 6, 2021

It's not that simple. .h is already classified as C:

https://github.com/github/linguist/blob/8011321f1394b5c3fc1f8ef242775af640df8f7d/lib/linguist/languages.yml#L614-L620

However, it is also classified as C++:

https://github.com/github/linguist/blob/8011321f1394b5c3fc1f8ef242775af640df8f7d/lib/linguist/languages.yml#L646-L661

... and Objective-C:

https://github.com/github/linguist/blob/8011321f1394b5c3fc1f8ef242775af640df8f7d/lib/linguist/languages.yml#L3847-L3857

... because it's valid in all three. Simply stripping from all but C will cause even more misclassification than currently seen with the current heuristics and classifier.

Where things go wrong is when a file doesn't match any of the heuristics:

https://github.com/github/linguist/blob/aad49acc0624c70d654a8dce447887dbbc713c7a/lib/linguist/heuristics.yml#L226-L231

So it falls through to the classifier which then has to make a guess based on the content of the samples it has. This is where things go wrong most of the time, especially with very small files.

What we could do, which may be the best solution, is to completely remove the classifier step for .h files, which is effectively what #1626 (comment) is suggesting.

I've started a PR at #5357 to implement this.

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants