-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync to linguist 7.2.0: heuristics.yml support #189
Conversation
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
8999d31
to
73e84fd
Compare
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Includes only the generated code. Re-generated all `./data/*` heuristics matchers using Github Linguist [e761f9b013e5b61161481fcb898b59721ee40e3d](https://github.com/github/linguist/tree/e761f9b013e5b61161481fcb898b59721ee40e3d) commit - many new languages - better vendoring detection Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Includes: - update to content heuristic generator - generated code in data/content.go to keep commits atomic. Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Includes generated code, to keep commits atomic. Consits of: - code generator for alias produces new API - retrofiting all clients to a new API - generated code data/aliases.go Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
73e84fd
to
df7844e
Compare
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Got back from vacation and keep debugging the case of failing Bayesian classifier for content on Seems like this has to do with difference in how linguist and enry tokenize the content. LinguistAdding to def test_classify_sql
results = Classifier.classify(Samples.cache, fixture("SQL/drop_stuff.sql"), ["PLpgSQL", "SQL", "PLSQL", "SQLPL"])
assert_equal "SQL", results.first[0]
end
Enry
As seen above, Bayesian classifier token weights for SQL are very different for the same language disambiguation case. Resolution: fixing this is tacked under #194 |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
CI passes, although there clearly are failing tests :/
scope in PR description updated. |
Co-Authored-By: bzz <bzz@users.noreply.github.com>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@creachadair @juanjux thank you for taking a look while it's still WIP - all initial feedback addressed in 5fbadc8 |
What is super annoying is that
with a different output, then on CI. Allthough it clearly should fail on CI as well, as test do not pass
|
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
f7228d3
to
ef9311e
Compare
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@creachadair @juanjux all feedback addressed, tests pass, ready to be merged. Sorry for such a long set of changes but scope of this PR was already limited to only a single part of the original #152 (see it's description for updated full scope of the github<->linguist sync) |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@creachadair feedback addressed in c57bc4a and c4f3dbe |
Thank you for prompt reviews @juanjux, @creachadair 🚀 Also thanks for kind explanations and rising the concerns about public API structure, @creachadair ! |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
ec00f1d
to
97ab29a
Compare
All feedback addressed, @creachadair it's ready for another round 🙏 |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@creachadair thank you for your kind and useful feedback, I belive it all has been addressed and is ready for another pass. |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
New v7.2.0 of linguist has been released, it includes a fix to one of the issues that affected us so will bump to e4560984058b4726010ca4b8f03ed9d0f8f464db |
Sync to https://github.com/github/linguist/releases/tag/v7.2.0 and update instructions for test generation. Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Sync \w Github Linguist v7.2.0 Includes new way of handling `heuristics.yml` and all `./data/*` re-generated using Github Linguist [v7.2.0](https://github.com/github/linguist/releases/tag/v7.2.0) release tag. - many new languages - better vendoring detection - update doc on update&known issues.
Fixes part of the #155 - generate heuristics from
heuristics.yml
instead of parsingheuristics.rb
.Major code changes include:
./internal/code-generator/heuristics.go
to consumeheuristics.yml
instead ofheuristics.rb
and produce matchable rule tree./data/
./internal/code-generator/test_files/*.gold
TODOs:
heuristics.yml
fix new Classifier strategy failures-> moved to Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL" #194.gold
test fixtures