Better code examples for the Genero language #6628

sebflaesch · 2023-12-07T15:56:38Z

The existing code samples for the Genero language should be reviewed.
We can take care of this and make a pull request from the https://github.com/sebflaesch/linguist-genero fork.

lildude · 2023-12-07T15:58:41Z

The existing code samples for the Genero language should be reviewed.

Why? We don't change samples unless they're blatantly wrong and have never been correct. Why? Because if they've been valid once before, there is almost certainly going to be code on GitHub using that old format/approach/whatever.

Update: if the language has evolved, new samples should be added for the new usage but this is only really needed if the classifier is having problems identifying files.

sebflaesch · 2023-12-07T17:07:43Z

@lildude:
Why replacing sample files?
The Linguist doc recommendations is to avoid "Hello World" samples.
We (FourJs) would like to provide state of the art samples here. The original samples were (kindly) provided by one of our customers back in 2020, but we agreed that there are not perfect, contain invalid programming patterns, and that can be replaced by much better samples, following the coding standards of the Genero language (for example, using UPPERCASE for language keywords)
I am new here and do not know what is the "classifier".
If the "classifier" is based on file extensions: There are *.4gl and *.per file changes/replacements, and one new library.sch file, which is a generated file that can be ignored by the Linguist parsers, but is mandatory to compile the provided .4gl / .per sample files with Genero fglcomp/fglform compilers. Do you see any issue with this?

lildude · 2023-12-07T17:46:29Z

I am new here and do not know what is the "classifier".

https://github.com/github-linguist/linguist/blob/master/docs/how-linguist-works.md details how Linguist works, but essentially it works like a funnel with each strategy attempting to whittle the list of languages for a specific file down to a single language. The classifier is the last step and is a bayesian classifier that is a "last resort" and the least reliable strategy. It's important to point out that Linguist considers files in isolation (think gists); the rest of the content of a repo has no impact on the language detection.

The classifier is trained using the samples which are broken up in to tokens and these tokens are then used to analyze files individually. The samples are not used for anything else and aren't shipped with the gem (only the resulting token file), hence it's the only place we allow GPL licensed files. As we tokenise the samples, things like case, function names, styling, formatting etc are not relevant and not considered by the classifier so changing samples to meet "best practices" don't have much impact, if any, hence we rarely need to change samples.

If the "classifier" is based on file extensions: There are *.4gl and *.per file changes/replacements, and one new library.sch file, which is a generated file that can be ignored by the Linguist parsers, but is mandatory to compile the provided .4gl / .per sample files with Genero fglcomp/fglform compilers. Do you see any issue with this?

The classifier isn't directly based on the extensions as detailed above, it's only the tokenized content that is relevant to the classifier, and as mentioned before, this is considered in isolation.

So with that in mind, there are a few issues:

We don't need samples for generated files. Generated files should be added to https://github.com/github-linguist/linguist/blob/master/lib/linguist/generated.rb and that's only really to ensure they're suppressed in diffs on github.com.
.sch is not currently associated with Genero so will need to meet usage requirements as with any new extension being added for a language, and in this case, as it is already associated with another language, we'll need a heuristic (ie a regex that uniquely identifies these files) too.

Specifically regarding:

but is mandatory to compile the provided .4gl / .per sample files

Linguist doesn't care about this. As I mentioned before, we tokenize the samples and analyse in isolation.

And after all of that, you probably don't need to do anything as .4gl is unique to "Genero":

linguist/lib/linguist/languages.yml

Lines 2274 to 2281 in c0da81e

    
           Genero: 
        
             type: programming 
        
             color: "#63408e" 
        
             extensions: 
        
             - ".4gl" 
        
             tm_scope: source.genero 
        
             ace_mode: text 
        
             language_id: 986054050

... and .per is unique to "Genero Forms":

linguist/lib/linguist/languages.yml

Lines 2282 to 2289 in c0da81e

    
           Genero Forms: 
        
             type: markup 
        
             color: "#d8df39" 
        
             extensions: 
        
             - ".per" 
        
             tm_scope: source.genero-forms 
        
             ace_mode: text 
        
             language_id: 902995658

... so the classifier is never used for these files. It'll only come into play if another language is added with this extension in future, and even then we try and encourage the use of heuristics for more accurate classification and only leave it to the classifier if it's not that simple.

sebflaesch · 2023-12-07T18:49:20Z

Thanks @lildude for the details.
All of this is quite new to me and need some time to digest.
Just a thought about the classifier (and maybe I am missing something):

"we tokenize the samples and analyse in isolation"

I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension...

lildude · 2023-12-07T19:16:29Z

I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension...

These aren't used by linguist at all. They're purely collected for the highlighting engine which is a completely independent internal service and are only used once linguist has identified the language of the file. It's more efficient to collect these grammars when the language is added than attempting to maintain the things in two separate places. It's however not very efficient to use them for language detection.

sebflaesch · 2023-12-08T10:09:32Z

@lildude: I have reviewed my new "state of the art" samples, having only .4gl and .per files, respectively in TOP/samples/Genero and "TOP/samples/Genero Forms" directories.
Would that change be accepted by a pull request?

lildude · 2023-12-08T15:11:59Z

Would that change be accepted by a pull request?

Maybe. You'll need to clearly explain why you're deleting the current samples. As I mentioned above, we don't remove or change samples unless they're blatantly wrong and not reflective of the language at all.

lildude · 2023-12-08T15:13:06Z

Oh, and you can update the samples and grammar in the same PR... we're not fussy about this sort of thing. We also squash commits on merge so don't care about the commit history so go wild.

sebflaesch · 2023-12-08T16:03:33Z

Thank you @lildude !
Finally I think you can close this issue as dup for #6629, we want to provide a single PR.

lildude · 2023-12-14T07:54:21Z

Changes have shipped. Closing.

sebflaesch added the Improvement label Dec 7, 2023

lildude mentioned this issue Dec 7, 2023

Replace Genero language syntax files with official files #6629

Closed

sebflaesch changed the title ~~New for better code examples for the Genero language~~ Better code examples for the Genero language Dec 7, 2023

lildude mentioned this issue Dec 11, 2023

Change grammar source for Genero language and rename #6632

Merged

3 tasks

lildude closed this as completed Dec 14, 2023

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better code examples for the Genero language #6628

Better code examples for the Genero language #6628

sebflaesch commented Dec 7, 2023

lildude commented Dec 7, 2023 •

edited

Loading

sebflaesch commented Dec 7, 2023

lildude commented Dec 7, 2023

sebflaesch commented Dec 7, 2023 •

edited

Loading

lildude commented Dec 7, 2023

sebflaesch commented Dec 8, 2023

lildude commented Dec 8, 2023

lildude commented Dec 8, 2023

sebflaesch commented Dec 8, 2023

lildude commented Dec 14, 2023

Better code examples for the Genero language #6628

Better code examples for the Genero language #6628

Comments

sebflaesch commented Dec 7, 2023

lildude commented Dec 7, 2023 • edited Loading

sebflaesch commented Dec 7, 2023

lildude commented Dec 7, 2023

sebflaesch commented Dec 7, 2023 • edited Loading

lildude commented Dec 7, 2023

sebflaesch commented Dec 8, 2023

lildude commented Dec 8, 2023

lildude commented Dec 8, 2023

sebflaesch commented Dec 8, 2023

lildude commented Dec 14, 2023

lildude commented Dec 7, 2023 •

edited

Loading

sebflaesch commented Dec 7, 2023 •

edited

Loading