Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a C sample defining a large initialized array #5015

Closed
wants to merge 1 commit into from
Closed

Add a C sample defining a large initialized array #5015

wants to merge 1 commit into from

Conversation

ib
Copy link

@ib ib commented Sep 20, 2020

This fixes github issue #5012.

  • I am fixing a misclassified language
    • I have included a new sample for the misclassified language:
      • Sample source(s): C/foo.h
        • [URL to each sample source, if applicable]
      • Sample license(s): file created by me, any license you want
    • I have included a change to the heuristics to distinguish my language from others using the same extension.

@lildude
Copy link
Member

lildude commented Sep 21, 2020

🤔 I'm not sure this is really solving the real issue here. The classifier uses real-world samples to try and guess the language based on how it is really used. This contrived sample is being added to deliberately sway the classifier and in turn may start causing legitimate Objective-C or C++ header files (I have no idea if there is such a thing) to be classified as C and thus only add to what is already a tricky problem as detailed in the ongoing discussion in #1626

@ib
Copy link
Author

ib commented Sep 21, 2020

The sample is based on real-world headers: https://github.com/ib/gucharmap/tree/master/gucharmap: unicode-*.h

@lildude
Copy link
Member

lildude commented Sep 21, 2020

The sample is based on real-world headers: https://github.com/ib/gucharmap/tree/master/gucharmap: unicode-*.h

Why not use one of the real-world files?

@smola
Copy link
Contributor

smola commented Sep 21, 2020

The main problem here is that token frequency within documents leads to weird results in classification. For example, if a few more braces appear in C++ files, the classifier will be biased towards C++ when classifying pure C files that have many braces.

I think the right solution is going for the root problem: absolute frequency of tokens is suboptimal, and we should probably move to document frequency. I've been testing this approach and I think it should work decently (and faster). But it still requires some more verification.

@smola smola mentioned this pull request Oct 22, 2020
@stale
Copy link

stale bot commented Dec 25, 2020

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Dec 25, 2020
@lildude
Copy link
Member

lildude commented Mar 31, 2021

Closing as this isn't a real world sample and is only being added to improperly influence the classifier. Recent and pending improvements to the classifier will also help towards addressing the original issue.

@lildude lildude closed this Mar 31, 2021
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants