Add a C sample defining a large initialized array #5015

ib · 2020-09-20T12:53:08Z

This fixes github issue #5012.

I am fixing a misclassified language
- I have included a new sample for the misclassified language:
  - Sample source(s): C/foo.h
    - [URL to each sample source, if applicable]
  - Sample license(s): file created by me, any license you want
- I have included a change to the heuristics to distinguish my language from others using the same extension.

This fixes github issue #5012.

lildude · 2020-09-21T11:14:47Z

🤔 I'm not sure this is really solving the real issue here. The classifier uses real-world samples to try and guess the language based on how it is really used. This contrived sample is being added to deliberately sway the classifier and in turn may start causing legitimate Objective-C or C++ header files (I have no idea if there is such a thing) to be classified as C and thus only add to what is already a tricky problem as detailed in the ongoing discussion in #1626

ib · 2020-09-21T13:49:08Z

The sample is based on real-world headers: https://github.com/ib/gucharmap/tree/master/gucharmap: unicode-*.h

lildude · 2020-09-21T14:15:51Z

The sample is based on real-world headers: https://github.com/ib/gucharmap/tree/master/gucharmap: unicode-*.h

Why not use one of the real-world files?

smola · 2020-09-21T17:36:50Z

The main problem here is that token frequency within documents leads to weird results in classification. For example, if a few more braces appear in C++ files, the classifier will be biased towards C++ when classifying pure C files that have many braces.

I think the right solution is going for the root problem: absolute frequency of tokens is suboptimal, and we should probably move to document frequency. I've been testing this approach and I think it should work decently (and faster). But it still requires some more verification.

stale · 2020-12-25T13:45:47Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

lildude · 2021-03-31T09:57:25Z

Closing as this isn't a real world sample and is only being added to improperly influence the classifier. Recent and pending improvements to the classifier will also help towards addressing the original issue.

Add a C sample defining a large initialized array

44490cf

This fixes github issue #5012.

smola mentioned this pull request Oct 22, 2020

Tokenizer comments #5061

Merged

smola mentioned this pull request Dec 5, 2020

New Centroid-based Classifier #5103

Merged

stale bot added the Stale label Dec 25, 2020

lildude closed this Mar 31, 2021

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a C sample defining a large initialized array #5015

Add a C sample defining a large initialized array #5015

ib commented Sep 20, 2020

lildude commented Sep 21, 2020

ib commented Sep 21, 2020

lildude commented Sep 21, 2020

smola commented Sep 21, 2020

stale bot commented Dec 25, 2020

lildude commented Mar 31, 2021

Add a C sample defining a large initialized array #5015

Add a C sample defining a large initialized array #5015

Conversation

ib commented Sep 20, 2020

lildude commented Sep 21, 2020

ib commented Sep 21, 2020

lildude commented Sep 21, 2020

smola commented Sep 21, 2020

stale bot commented Dec 25, 2020

lildude commented Mar 31, 2021