Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better code examples for the Genero language #6628

Closed
sebflaesch opened this issue Dec 7, 2023 · 10 comments
Closed

Better code examples for the Genero language #6628

sebflaesch opened this issue Dec 7, 2023 · 10 comments

Comments

@sebflaesch
Copy link
Contributor

The existing code samples for the Genero language should be reviewed.
We can take care of this and make a pull request from the https://github.com/sebflaesch/linguist-genero fork.

@lildude
Copy link
Member

lildude commented Dec 7, 2023

The existing code samples for the Genero language should be reviewed.

Why? We don't change samples unless they're blatantly wrong and have never been correct. Why? Because if they've been valid once before, there is almost certainly going to be code on GitHub using that old format/approach/whatever.

Update: if the language has evolved, new samples should be added for the new usage but this is only really needed if the classifier is having problems identifying files.

@sebflaesch sebflaesch changed the title New for better code examples for the Genero language Better code examples for the Genero language Dec 7, 2023
@sebflaesch
Copy link
Contributor Author

@lildude:
Why replacing sample files?
The Linguist doc recommendations is to avoid "Hello World" samples.
We (FourJs) would like to provide state of the art samples here. The original samples were (kindly) provided by one of our customers back in 2020, but we agreed that there are not perfect, contain invalid programming patterns, and that can be replaced by much better samples, following the coding standards of the Genero language (for example, using UPPERCASE for language keywords)
I am new here and do not know what is the "classifier".
If the "classifier" is based on file extensions: There are *.4gl and *.per file changes/replacements, and one new library.sch file, which is a generated file that can be ignored by the Linguist parsers, but is mandatory to compile the provided .4gl / .per sample files with Genero fglcomp/fglform compilers. Do you see any issue with this?

@lildude
Copy link
Member

lildude commented Dec 7, 2023

I am new here and do not know what is the "classifier".

https://github.com/github-linguist/linguist/blob/master/docs/how-linguist-works.md details how Linguist works, but essentially it works like a funnel with each strategy attempting to whittle the list of languages for a specific file down to a single language. The classifier is the last step and is a bayesian classifier that is a "last resort" and the least reliable strategy. It's important to point out that Linguist considers files in isolation (think gists); the rest of the content of a repo has no impact on the language detection.

The classifier is trained using the samples which are broken up in to tokens and these tokens are then used to analyze files individually. The samples are not used for anything else and aren't shipped with the gem (only the resulting token file), hence it's the only place we allow GPL licensed files. As we tokenise the samples, things like case, function names, styling, formatting etc are not relevant and not considered by the classifier so changing samples to meet "best practices" don't have much impact, if any, hence we rarely need to change samples.

If the "classifier" is based on file extensions: There are *.4gl and *.per file changes/replacements, and one new library.sch file, which is a generated file that can be ignored by the Linguist parsers, but is mandatory to compile the provided .4gl / .per sample files with Genero fglcomp/fglform compilers. Do you see any issue with this?

The classifier isn't directly based on the extensions as detailed above, it's only the tokenized content that is relevant to the classifier, and as mentioned before, this is considered in isolation.

So with that in mind, there are a few issues:

  1. We don't need samples for generated files. Generated files should be added to https://github.com/github-linguist/linguist/blob/master/lib/linguist/generated.rb and that's only really to ensure they're suppressed in diffs on github.com.
  2. .sch is not currently associated with Genero so will need to meet usage requirements as with any new extension being added for a language, and in this case, as it is already associated with another language, we'll need a heuristic (ie a regex that uniquely identifies these files) too.

Specifically regarding:

but is mandatory to compile the provided .4gl / .per sample files

Linguist doesn't care about this. As I mentioned before, we tokenize the samples and analyse in isolation.

And after all of that, you probably don't need to do anything as .4gl is unique to "Genero":

Genero:
type: programming
color: "#63408e"
extensions:
- ".4gl"
tm_scope: source.genero
ace_mode: text
language_id: 986054050

... and .per is unique to "Genero Forms":

Genero Forms:
type: markup
color: "#d8df39"
extensions:
- ".per"
tm_scope: source.genero-forms
ace_mode: text
language_id: 902995658

... so the classifier is never used for these files. It'll only come into play if another language is added with this extension in future, and even then we try and encourage the use of heuristics for more accurate classification and only leave it to the classifier if it's not that simple.

@sebflaesch
Copy link
Contributor Author

sebflaesch commented Dec 7, 2023

Thanks @lildude for the details.
All of this is quite new to me and need some time to digest.
Just a thought about the classifier (and maybe I am missing something):

"we tokenize the samples and analyse in isolation"

I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension...

@lildude
Copy link
Member

lildude commented Dec 7, 2023

I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension...

These aren't used by linguist at all. They're purely collected for the highlighting engine which is a completely independent internal service and are only used once linguist has identified the language of the file. It's more efficient to collect these grammars when the language is added than attempting to maintain the things in two separate places. It's however not very efficient to use them for language detection.

@sebflaesch
Copy link
Contributor Author

@lildude: I have reviewed my new "state of the art" samples, having only .4gl and .per files, respectively in TOP/samples/Genero and "TOP/samples/Genero Forms" directories.
Would that change be accepted by a pull request?

@lildude
Copy link
Member

lildude commented Dec 8, 2023

Would that change be accepted by a pull request?

Maybe. You'll need to clearly explain why you're deleting the current samples. As I mentioned above, we don't remove or change samples unless they're blatantly wrong and not reflective of the language at all.

@lildude
Copy link
Member

lildude commented Dec 8, 2023

Oh, and you can update the samples and grammar in the same PR... we're not fussy about this sort of thing. We also squash commits on merge so don't care about the commit history so go wild.

@sebflaesch
Copy link
Contributor Author

Thank you @lildude !
Finally I think you can close this issue as dup for #6629, we want to provide a single PR.

@lildude
Copy link
Member

lildude commented Dec 14, 2023

Changes have shipped. Closing.

@lildude lildude closed this as completed Dec 14, 2023
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants