Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify disambiguation rules in YAML #4087

Merged
merged 3 commits into from
Sep 3, 2018

Conversation

smola
Copy link
Contributor

@smola smola commented Apr 2, 2018

Specify disambiguation rules in a YAML file. See rationale in #3746.

Description

All disambiguation rules are moved from heuristics.rb into heuristics.yml.

Checklist:

  • I am adding new or changing current functionality
    • I have added or updated the tests for the new or changed functionality.

@smola smola force-pushed the yaml-disambiguation branch 2 times, most recently from 956f735 to 53d3134 Compare July 9, 2018 15:13
@smola smola changed the title [WIP] Specify disambiguation rules in YAML Specify disambiguation rules in YAML Jul 9, 2018
@smola
Copy link
Contributor Author

smola commented Jul 9, 2018

This is now ready for review, although it could use some more testing and cleanup of heuristics.rb.

@vmg
Copy link
Contributor

vmg commented Jul 9, 2018

@smola: Looks like the heuristics got a bit messed up during the translation, as two language identification tests are failing. Can you look at them?

@smola smola force-pushed the yaml-disambiguation branch from 53d3134 to 6cdad03 Compare July 23, 2018 09:14
@smola
Copy link
Contributor Author

smola commented Jul 23, 2018

Fixed negative matches.
Also rebased this on top of #4189 to improve verification of the change. Ideally #4189 would be merged first.

@smola smola force-pushed the yaml-disambiguation branch from 6cdad03 to 42c69e3 Compare July 23, 2018 09:32
@vmg
Copy link
Contributor

vmg commented Jul 23, 2018

Waiting on #4189 to land. 👌

end

# Internal: Array of defined heuristics
@heuristics = []

# Internal
def initialize(exts_and_langs, &heuristic)
def initialize(exts_and_langs, rules)
@exts_and_langs, @candidates = exts_and_langs.partition {|e| e =~ /\A\./}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the YAML file doesn't support candidates (which is fine since we don't use that anymore). Thus, the above line can be simplified, as well as the additional logic in matches? to handle candidates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll perform this change next time I get to work on this.

# key.
# named_patterns - Key-value map of reusable named patterns.
#
# Please keep this list alphabetized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's a test for that in test_pedantic.rb, test_heuristics_are_sorted; could you update it to check the YAML file instead? We poor humans have a hard time keeping things alphabetized (speaking from experience :p).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test.

perl5: '\buse\s+(?:strict\b|v?5\.)'
perl6: '^\s*(?:use\s+v6\b|\bmodule\b|\b(?:my\s+)?class\b)'
#TODO: add tests for missing heuristics
#TODO: add support to return multiple languages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove these? I'll open an issue for the first one so we don't forget and I don't think we supported the second one before anyway. We're very likely to forget TODOs at the end of the file, even if we later fix them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot about these.

Actually, there is currently a disambiguation rule that returns two languages: [Language["Linux Kernel Module"], Language["AMPL"]] for .mod.

@smola smola force-pushed the yaml-disambiguation branch 2 times, most recently from 7efff9d to 2171f29 Compare July 24, 2018 08:11
@smola
Copy link
Contributor Author

smola commented Aug 6, 2018

There is still one case I couldn't convert to a plain regular expression:

https://github.com/github/linguist/blob/ed44c0b86d317b753c607742b137dce9594bed97/lib/linguist/heuristics.rb#L326

However, I found no file that is misclassified with something simpler such as:

(?<!\S)\.(include|globa?l)\s|(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:

According to @Alhadis here 06c049b this is the rationale:

This commit modifies the heuristic to assume the following flow:

  1. If a line contains ".include" or ".global"/".globl" which doesn't
    follow a non-whitespace character, assume GAS.
  2. Otherwise, if the line starts with a command like ".LG7E0" with a
    possible string of whitespace before it, assume it's also GAS.

UNLESS either of the following conditions are true:
2a. The token is enclosed by a string or /* multiline comment */
2b. The previous line ends with a backslash to denote a statement
broken between lines, with possible whitespace and/or comment
sequences between the backslash and the actual newline.

  1. If neither of the above are met, assume the file is MAXScript.

It seems usage of multiline comments with /* */ syntax are rare in Unix Assembly. Supported by HLA Syntax but I still didn't see any file with .ms extension using them.

If we assumed this kind of comments are not there, we can just simplify the expression to look for a label without any preceding /* (in multi-line mode). What do you think?

@smola smola force-pushed the yaml-disambiguation branch from 2171f29 to 26526e7 Compare August 6, 2018 09:53
@Alhadis
Copy link
Collaborator

Alhadis commented Aug 6, 2018

If we assumed this kind of comments are not there, we can just simplify the expression to look for a label without any preceding /* (in multi-line mode). What do you think?

That would work. I was probably being overly cautious with accuracy considering the PR was my first real contribution to Linguist (and the open-ended fallback to MAXScript convinced me that keeping matches airtight was a wise thing to do).

I don't remember seeing any GAS files with multi-line /* */ comments, however.

@smola
Copy link
Contributor Author

smola commented Aug 6, 2018

@Alhadis 👍 Created a PR simplifying it: #4224

I'll change this PR to support combining negative and positive patterns at the same time. And also supporting multiple patterns with && instead of ||. The later also affects QMake, which was originally data.include?("HEADERS") && data.include?("SOURCES") but was effectively changed to data.include?("HEADERS") || data.include?("SOURCES")

@smola smola changed the title Specify disambiguation rules in YAML [WIP] Specify disambiguation rules in YAML Aug 6, 2018
@smola smola mentioned this pull request Aug 27, 2018
4 tasks
@smola smola force-pushed the yaml-disambiguation branch from 26526e7 to 71ab440 Compare September 3, 2018 08:53
@smola smola changed the title [WIP] Specify disambiguation rules in YAML Specify disambiguation rules in YAML Sep 3, 2018
@smola
Copy link
Contributor Author

smola commented Sep 3, 2018

Now with the support of and: blocks in heuristic.yml, functionality matches heuristics.rb behavior.
Extensions for QMake and .ms were the remaining ones, fixed in this iteration.

Copy link
Contributor

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @smola!

@vmg
Copy link
Contributor

vmg commented Sep 3, 2018

This looks great to me. I'm not even sure who has the final say on these PRs these days, but if nobody is opposed we can go ahead and merge. :)

@pchaigno
Copy link
Contributor

pchaigno commented Sep 3, 2018

I'm not even sure who has the final say on these PRs these days

You and I.

@pchaigno pchaigno merged commit 8bf9efa into github-linguist:master Sep 3, 2018
@pchaigno
Copy link
Contributor

pchaigno commented Sep 3, 2018

And, just like that, I broke @Alhadis' pull request (cf. #4258). Sorry 😛

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 3, 2018

Lovely. Now what am I gonna do? 😕 Change the YAML parsing logic to accept a filename property?

@pchaigno
Copy link
Contributor

pchaigno commented Sep 3, 2018

@Alhadis I was thinking we could add something along the lines of filename_pattern:?

@smola smola deleted the yaml-disambiguation branch September 3, 2018 22:07
@Alhadis
Copy link
Collaborator

Alhadis commented Sep 7, 2018

Alright, I'll add it. But this really needs to be more object-oriented, IMHO.

For example, rather than specifying filename_pattern or negative_pattern as separate fields, we should have them as properties of a single Pattern object:

pattern:
- source: '/^foo/'
  target: data
- source: '/^bar'
  target: name
- source: '^qux'
  negated: true

Strings could still be used; simply treated as shorthand for an object with only the source field defined:

pattern: '/^foo/'

# Which is shorthand for:
pattern:
- source: '/^foo/'
  negated: false
  target: data

That way, we could adopt a cleaner implementation in heuristics.rb by subclassing Regexp and storing rule-specific properties on instances (such as whether to test against a file's data or its name).

This would also solve the problem of not being able to specify modifier flags for the regular expression, such as /i or /x (the latter would be useful if very lengthy expressions need to be defined in the YML file.

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants