-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better code examples for the Genero language #6628
Comments
Why? We don't change samples unless they're blatantly wrong and have never been correct. Why? Because if they've been valid once before, there is almost certainly going to be code on GitHub using that old format/approach/whatever. Update: if the language has evolved, new samples should be added for the new usage but this is only really needed if the classifier is having problems identifying files. |
@lildude: |
https://github.com/github-linguist/linguist/blob/master/docs/how-linguist-works.md details how Linguist works, but essentially it works like a funnel with each strategy attempting to whittle the list of languages for a specific file down to a single language. The classifier is the last step and is a bayesian classifier that is a "last resort" and the least reliable strategy. It's important to point out that Linguist considers files in isolation (think gists); the rest of the content of a repo has no impact on the language detection. The classifier is trained using the samples which are broken up in to tokens and these tokens are then used to analyze files individually. The samples are not used for anything else and aren't shipped with the gem (only the resulting token file), hence it's the only place we allow GPL licensed files. As we tokenise the samples, things like case, function names, styling, formatting etc are not relevant and not considered by the classifier so changing samples to meet "best practices" don't have much impact, if any, hence we rarely need to change samples.
The classifier isn't directly based on the extensions as detailed above, it's only the tokenized content that is relevant to the classifier, and as mentioned before, this is considered in isolation. So with that in mind, there are a few issues:
Specifically regarding:
Linguist doesn't care about this. As I mentioned before, we tokenize the samples and analyse in isolation. And after all of that, you probably don't need to do anything as linguist/lib/linguist/languages.yml Lines 2274 to 2281 in c0da81e
... and linguist/lib/linguist/languages.yml Lines 2282 to 2289 in c0da81e
... so the classifier is never used for these files. It'll only come into play if another language is added with this extension in future, and even then we try and encourage the use of heuristics for more accurate classification and only leave it to the classifier if it's not that simple. |
Thanks @lildude for the details.
I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension... |
These aren't used by linguist at all. They're purely collected for the highlighting engine which is a completely independent internal service and are only used once linguist has identified the language of the file. It's more efficient to collect these grammars when the language is added than attempting to maintain the things in two separate places. It's however not very efficient to use them for language detection. |
@lildude: I have reviewed my new "state of the art" samples, having only .4gl and .per files, respectively in TOP/samples/Genero and "TOP/samples/Genero Forms" directories. |
Maybe. You'll need to clearly explain why you're deleting the current samples. As I mentioned above, we don't remove or change samples unless they're blatantly wrong and not reflective of the language at all. |
Oh, and you can update the samples and grammar in the same PR... we're not fussy about this sort of thing. We also squash commits on merge so don't care about the commit history so go wild. |
Changes have shipped. Closing. |
The existing code samples for the Genero language should be reviewed.
We can take care of this and make a pull request from the https://github.com/sebflaesch/linguist-genero fork.
The text was updated successfully, but these errors were encountered: