Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync to linguist 7.2.0: heuristics.yml support #189

Merged
merged 27 commits into from
Feb 14, 2019

Conversation

bzz
Copy link
Contributor

@bzz bzz commented Jan 9, 2019

Fixes part of the #155 - generate heuristics from heuristics.yml instead of parsing heuristics.rb.

Major code changes include:

  • update code generator for heuristics strategy in ./internal/code-generator/heuristics.go to consume heuristics.yml instead of heuristics.rb and produce matchable rule tree
  • re-generated all datastructures in ./data/
  • re-generated gold test results in ./internal/code-generator/test_files/*.gold

TODOs:

bzz added 2 commits January 9, 2019 21:19
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz mentioned this pull request Jan 9, 2019
4 tasks
@bzz bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from 8999d31 to 73e84fd Compare January 9, 2019 22:28
bzz added 7 commits January 9, 2019 23:40
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Includes only the generated code.
Re-generated all `./data/*` heuristics matchers using Github Linguist
[e761f9b013e5b61161481fcb898b59721ee40e3d](https://github.com/github/linguist/tree/e761f9b013e5b61161481fcb898b59721ee40e3d)
commit

 - many new languages
 - better vendoring detection

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Includes:
 - update to content heuristic generator
 - generated code in data/content.go

to keep commits atomic.

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Includes generated code, to keep commits atomic.

Consits of:
 - code generator for alias produces new API
 - retrofiting all clients to a new API
 - generated code data/aliases.go

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from 73e84fd to df7844e Compare January 9, 2019 22:40
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented Jan 25, 2019

Got back from vacation and keep debugging the case of failing Bayesian classifier for content on samples/SQL/drop_stuff.sql - SQL (linguist) vs PLpgSQL (enry).

Seems like this has to do with difference in how linguist and enry tokenize the content.

Linguist

Adding to test/test_classifier.rb

  def test_classify_sql
    results = Classifier.classify(Samples.cache, fixture("SQL/drop_stuff.sql"), ["PLpgSQL", "SQL", "PLSQL", "SQLPL"])
    assert_equal "SQL", results.first[0]
  end
$ LINGUIST_DEBUG=2 bundle exec rake test TEST=test/test_classifier.rb TESTOPTS="--name=test_classify_sql -v"

                    #     PLSQL   PLpgSQL       SQL     SQLPL
               ;   10     19.914    22.974         -    21.265
            body    1      7.934         -     5.638         -
         cascade    1          -         -     5.638     8.044
            drop   10          -   102.321    79.408         -
        function    1      9.032    10.969     5.638         -
   functionname1    1          -         -     5.638         -
linguist_package    2          -         -    12.663         -
         package    2     17.254         -    12.663         -
       procedure    1      8.850         -     5.638     9.431
           table    2     16.678    15.632    14.049         -
            type    2      3.569     1.810         -     1.593
       typename1    1          -         -     5.638         -
       typename2    1          -         -     5.638         -
            view    2          -         -    13.474         -
       viewname1    1          -         -     5.638         -
       viewname2    1          -         -     5.638         -
   who_called_me    1      7.934         -     5.638         -
               x    1          -         -     6.331         -
               y    1          -         -     6.331         -

   PLpgSQL =   -330.972 +  -5.697 =   -336.668
       SQL =   -283.377 +  -5.158 =   -288.535
     PLSQL =   -393.512 +  -5.563 =   -399.075
     SQLPL =   -444.345 +  -5.851 =   -450.196

Enry

$ ENRY_DEBUG=1 ENRY_TEST_REPO="$PWD/.linguist" go test -run '^TestRegexpEdgeCases$'

                    #     PLSQL   PLpgSQL       SQL     SQLPL
               ;   10     34.270    37.049         -    35.420
            body    1      7.990         -     4.405         -
         cascade    1          -         -     4.405     8.217
            drop   10          -   103.600    67.077         -
        function    1      9.936    11.097     4.405         -
    functionname    1          -         -     4.405         -
linguist_package    2          -         -    10.197         -
         package    2     17.366         -    10.197         -
       procedure    1      8.906         -     4.405     9.603
           table    2     17.366    15.888    11.583         -
            type    2      7.169     5.554         -     5.426
        typename    2          -         -    10.197         -
            view    2          -         -    11.008         -
        viewname    2          -         -    10.197         -
   who_called_me    1      7.990         -     4.405         -
               x    1          -         -     5.098         -
               y    1          -         -     5.098         -

   PLpgSQL =   -334.671 +  -5.693 =   -340.364
       SQL =   -340.778 +  -5.154 =   -345.932
     SQLPL =   -449.195 +  -5.847 =   -455.042
     PLSQL =   -396.868 +  -5.559 =   -402.427

As seen above, Bayesian classifier token weights for SQL are very different for the same language disambiguation case.

Resolution: fixing this is tacked under #194

@bzz bzz changed the title WIP Sync to linguist 7.1.3 WIP Sync to linguist 7.1.3: heuristics.yml support Jan 28, 2019
@bzz bzz self-assigned this Jan 28, 2019
bzz added 3 commits January 28, 2019 14:10
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented Jan 28, 2019

CI passes, although there clearly are failing tests :/

--- FAIL: TestRegexpEdgeCases (0.00s)
	require.go:794: 
			Error Trace:	common_test.go:45
			Error:      	Received unexpected error:
			            	open .linguist/samples/ActionScript/FooBar.as: no such file or directory
			Test:       	TestRegexpEdgeCases
=== RUN   Test_EnryTestSuite
Cloning Linguist repo to '/tmp/linguist-735567248' as ENRY_TEST_REPO was not set

scope in PR description updated.

common.go Outdated Show resolved Hide resolved
@bzz bzz requested review from dennwc, juanjux and creachadair January 28, 2019 18:15
internal/code-generator/generator/generator.go Outdated Show resolved Hide resolved
internal/code-generator/generator/generator_test.go Outdated Show resolved Hide resolved
internal/code-generator/generator/generator_test.go Outdated Show resolved Hide resolved
internal/code-generator/generator/heuristics_test.go Outdated Show resolved Hide resolved
internal/code-generator/generator/heuristics_test.go Outdated Show resolved Hide resolved
internal/code-generator/generator/heuristics_test.go Outdated Show resolved Hide resolved
internal/code-generator/generator/heuristics_test.go Outdated Show resolved Hide resolved
creachadair and others added 2 commits January 28, 2019 19:49
Co-Authored-By: bzz <bzz@users.noreply.github.com>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented Jan 28, 2019

@creachadair @juanjux thank you for taking a look while it's still WIP - all initial feedback addressed in 5fbadc8

@bzz
Copy link
Contributor Author

bzz commented Jan 28, 2019

What is super annoying is that make test does fail on my local env but not on CI

2019/01/28 20:23:37 exit status 128
FAIL	gopkg.in/src-d/enry.v1	0.062s

with a different output, then on CI.

Allthough it clearly should fail on CI as well, as test do not pass

=== RUN   TestRegexpEdgeCases
--- FAIL: TestRegexpEdgeCases (0.00s)
	require.go:794: 
			Error Trace:	common_test.go:45
			Error:      	Received unexpected error:
			            	open .linguist/samples/ActionScript/FooBar.as: no such file or directory
			Test:       	TestRegexpEdgeCases

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz changed the title WIP Sync to linguist 7.1.3: heuristics.yml support Sync to linguist 7.1.3: heuristics.yml support Jan 29, 2019
@bzz bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from f7228d3 to ef9311e Compare January 29, 2019 17:02
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented Jan 29, 2019

@creachadair @juanjux all feedback addressed, tests pass, ready to be merged.

Sorry for such a long set of changes but scope of this PR was already limited to only a single part of the original #152 (see it's description for updated full scope of the github<->linguist sync)

CONTRIBUTING.md Show resolved Hide resolved
README.md Show resolved Hide resolved
common_test.go Outdated Show resolved Hide resolved
data/heuristics.go Show resolved Hide resolved
data/heuristics.go Outdated Show resolved Hide resolved
data/heuristics.go Outdated Show resolved Hide resolved
data/heuristics.go Outdated Show resolved Hide resolved
internal/code-generator/assets/alias.go.tmpl Show resolved Hide resolved
bzz added 2 commits January 30, 2019 00:34
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented Jan 29, 2019

@creachadair feedback addressed in c57bc4a and c4f3dbe

@bzz
Copy link
Contributor Author

bzz commented Jan 30, 2019

Thank you for prompt reviews @juanjux, @creachadair 🚀

Also thanks for kind explanations and rising the concerns about public API structure, @creachadair !
Will address those first thing tomorrow and let you know.

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from ec00f1d to 97ab29a Compare February 5, 2019 22:01
@bzz
Copy link
Contributor Author

bzz commented Feb 5, 2019

All feedback addressed, @creachadair it's ready for another round 🙏

data/rule/rule.go Show resolved Hide resolved
data/rule/rule.go Show resolved Hide resolved
data/rule/rule.go Outdated Show resolved Hide resolved
data/rule/rule.go Outdated Show resolved Hide resolved
data/rule/rule.go Outdated Show resolved Hide resolved
data/rule/rule.go Outdated Show resolved Hide resolved
data/rule/rule.go Show resolved Hide resolved
bzz added 4 commits February 13, 2019 22:23
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented Feb 13, 2019

@creachadair thank you for your kind and useful feedback, I belive it all has been addressed and is ready for another pass.

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz changed the title Sync to linguist 7.1.3: heuristics.yml support Sync to linguist 7.2.0: heuristics.yml support Feb 14, 2019
@bzz
Copy link
Contributor Author

bzz commented Feb 14, 2019

New v7.2.0 of linguist has been released, it includes a fix to one of the issues that affected us so will bump to e4560984058b4726010ca4b8f03ed9d0f8f464db

Sync to https://github.com/github/linguist/releases/tag/v7.2.0
and update instructions for test generation.

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz merged commit 3499750 into src-d:master Feb 14, 2019
@bzz bzz deleted the maintenance/sync-to-linguist-7.1.3 branch February 14, 2019 11:47
bzz added a commit to bzz/enry that referenced this pull request Feb 20, 2019
Sync \w Github Linguist v7.2.0

Includes new way of handling `heuristics.yml` and
all `./data/*` re-generated using Github Linguist [v7.2.0](https://github.com/github/linguist/releases/tag/v7.2.0)
release tag.

 - many new languages
 - better vendoring detection
 - update doc on update&known issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants