Skip to content

Commit

Permalink
Sync to linguist 7.2.0: heuristics.yml support (#189)
Browse files Browse the repository at this point in the history
Sync \w Github Linguist v7.2.0

Includes new way of handling `heuristics.yml` and
all `./data/*` re-generated using Github Linguist [v7.2.0](https://github.com/github/linguist/releases/tag/v7.2.0)
release tag.

 - many new languages
 - better vendoring detection
 - update doc on update&known issues.
  • Loading branch information
bzz authored Feb 14, 2019
1 parent 13d3d66 commit 3499750
Show file tree
Hide file tree
Showing 45 changed files with 114,957 additions and 84,118 deletions.
61 changes: 61 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# source{d} Contributing Guidelines

source{d} projects accept contributions via GitHub pull requests.
This document outlines some of the
conventions on development workflow, commit message formatting, contact points,
and other resources to make it easier to get your contribution accepted.

## Certificate of Origin

By contributing to this project, you agree to the [Developer Certificate of
Origin (DCO)](DCO). This document was created by the Linux Kernel community and is a
simple statement that you, as a contributor, have the legal right to make the
contribution.

In order to show your agreement with the DCO you should include at the end of the commit message,
the following line: `Signed-off-by: John Doe <john.doe@example.com>`, using your real name.

This can be done easily using the [`-s`](https://github.com/git/git/blob/b2c150d3aa82f6583b9aadfecc5f8fa1c74aca09/Documentation/git-commit.txt#L154-L161) flag on the `git commit`.

If you find yourself pushed a few commits without `Signed-off-by`, you can still add it afterwards. We wrote a manual which can help: [fix-DCO.md](https://github.com/src-d/guide/blob/master/developer-community/fix-DCO.md).

## Support Channels

The official support channels, for both users and contributors, are:

- GitHub issues: each repository has its own list of issues.
- Slack: join the [source{d} Slack](https://join.slack.com/t/sourced-community/shared_invite/enQtMjc4Njk5MzEyNzM2LTFjNzY4NjEwZGEwMzRiNTM4MzRlMzQ4MmIzZjkwZmZlM2NjODUxZmJjNDI1OTcxNDAyMmZlNmFjODZlNTg0YWM) community.

*Before opening a new issue or submitting a new pull request, it's helpful to
search the project - it's likely that another user has already reported the
issue you're facing, or it's a known issue that we're already aware of.


## How to Contribute

Pull Requests (PRs) are the main and exclusive way to contribute code to source{d} projects.
In order for a PR to be accepted it needs to pass this list of requirements:

- The contribution must be correctly explained with natural language and providing a minimum working example that reproduces it.
- All PRs must be written idiomaticly:
- for Go: formatted according to [gofmt](https://golang.org/cmd/gofmt/), and without any warnings from [go lint](https://github.com/golang/lint) nor [go vet](https://golang.org/cmd/vet/)
- for other languages, similar constraints apply.
- They should in general include tests, and those shall pass.
- If the PR is a bug fix, it has to include a new unit test that fails before the patch is merged.
- If the PR is a new feature, it has to come with a suite of unit tests, that tests the new functionality.
- In any case, all the PRs have to pass the personal evaluation of at least one of the [maintainers](MAINTAINERS) of the project.


### Format of the commit message

Every commit message should describe what was changed, under which context and, if applicable, the GitHub issue it relates to:

```
plumbing: packp, Skip argument validations for unknown capabilities. Fixes #623
```

The format can be described more formally as follows:

```
<package>: <subpackage>, <what changed>. [Fixes #<issue-number>]
```
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ clean: clean-linguist clean-shared
code-generate: $(LINGUIST_PATH)
mkdir -p data && \
go run internal/code-generator/main.go
ENRY_TEST_REPO="$${PWD}/.linguist" go test -v \
-run Test_GeneratorTestSuite \
./internal/code-generator/generator \
-testify.m TestUpdateGeneratorTestSuiteGold \
-update_gold

benchmarks: $(LINGUIST_PATH)
go test -run=NONE -bench=. && \
Expand Down
21 changes: 9 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,14 +154,17 @@ Generated Java bindings using a C-shared library and JNI are located under [`jav
Development
------------

*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate the necessary code you must run:
*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate all the necessary code you must run:

git clone https://github.com/github/linguist.git .linguist
# update commit in generator_test.go (to re-generate .gold fixtures)
# https://github.com/src-d/enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18
go generate

We update enry when changes are done in linguist's master branch on the following files:

* [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml)
* [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb)
* [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml)
* [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml)
* [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml)

Expand All @@ -183,17 +186,11 @@ Divergences from linguist
Using [linguist/samples](https://github.com/github/linguist/tree/master/samples)
as a set for the tests, the following issues were found:

* With [hello.ms](https://github.com/github/linguist/blob/master/samples/Unix%20Assembly/hello.ms) we can't detect the language (Unix Assembly) because we don't have a matcher in contentMatchers (content.go) for Unix Assembly. Linguist uses this [regexp](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L300) in its code,
* [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine

`elsif /(?<!\S)\.(include|globa?l)\s/.match(data) || /(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:/.match(data.gsub(/"([^\\"]|\\.)*"|'([^\\']|\\.)*'|\\\s*(?:--.*)?\n/, ""))`
* As of (Linguist v5.3.2)[https://github.com/github/linguist/releases/tag/v5.3.2] it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry stil uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. Tracked under https://github.com/src-d/enry/issues/193

which we can't port.

* All files for the SQL language fall to the classifier because we don't parse
this [disambiguator
expression](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L433)
for `*.sql` files right. This expression doesn't comply with the pattern for the
rest in [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb).
* Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL". Tracked under https://github.com/src-d/enry/issues/194

`enry` [CLI tool](#cli) does not require a full Git repository to be present in filesystem in order to report languages.

Expand Down Expand Up @@ -232,7 +229,7 @@ As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
If you want to reproduce the same benchmarks as reported above:
- Make sure all [dependencies](#benchmark-dependencies) are installed
- Install [gnuplot](http://gnuplot.info) (in order to plot the histogram)
- Run `ENRY_TEST_REPO=.linguist benchmarks/run.sh` (takes ~15h)
- Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)

It will run the benchmarks for enry and linguist, parse the output, create csv files and plot the histogram. This takes some time.

Expand Down
5 changes: 1 addition & 4 deletions benchmark_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,6 @@ var (
)

func TestMain(m *testing.M) {
var exitCode int
defer os.Exit(exitCode)

flag.BoolVar(&slow, "slow", false, "run benchmarks per sample for strategies too")
flag.Parse()

Expand All @@ -47,7 +44,7 @@ func TestMain(m *testing.M) {
log.Fatal(err)
}

exitCode = m.Run()
os.Exit(m.Run())
}

func cloneLinguist(linguistURL string) error {
Expand Down
11 changes: 5 additions & 6 deletions common.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ const OtherLanguage = ""
// Strategy type fix the signature for the functions that can be used as a strategy.
type Strategy func(filename string, content []byte, candidates []string) (languages []string)

// DefaultStrategies is the strategies' sequence GetLanguage uses to detect languages.
// DefaultStrategies is a sequence of strategies used by GetLanguage to detect languages.
var DefaultStrategies = []Strategy{
GetLanguagesByModeline,
GetLanguagesByFilename,
Expand Down Expand Up @@ -397,12 +397,13 @@ func GetLanguagesByContent(filename string, content []byte, _ []string) []string
}

ext := strings.ToLower(filepath.Ext(filename))
fnMatcher, ok := data.ContentMatchers[ext]

heuristic, ok := data.ContentHeuristics[ext]
if !ok {
return nil
}

return fnMatcher(content)
return heuristic.Match(content)
}

// GetLanguagesByClassifier uses DefaultClassifier as a Classifier and returns a sorted slice of possible languages ordered by
Expand Down Expand Up @@ -455,9 +456,7 @@ func GetLanguageType(language string) (langType Type) {
// GetLanguageByAlias returns either the language related to the given alias and ok set to true
// or Otherlanguage and ok set to false if the alias is not recognized.
func GetLanguageByAlias(alias string) (lang string, ok bool) {
a := strings.Split(alias, `,`)[0]
a = strings.ToLower(a)
lang, ok = data.LanguagesByAlias[a]
lang, ok = data.LanguageByAlias(alias)
if !ok {
lang = OtherLanguage
}
Expand Down
75 changes: 51 additions & 24 deletions common_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"gopkg.in/src-d/enry.v1/data"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/stretchr/testify/suite"
)

Expand All @@ -19,9 +20,36 @@ const linguistClonedEnvVar = "ENRY_TEST_REPO"

type EnryTestSuite struct {
suite.Suite
repoLinguist string
samplesDir string
cloned bool
tmpLinguist string
needToClone bool
samplesDir string
}

func (s *EnryTestSuite) TestRegexpEdgeCases() {
var regexpEdgeCases = []struct {
lang string
filename string
}{
{lang: "ActionScript", filename: "FooBar.as"},
{lang: "Forth", filename: "asm.fr"},
{lang: "X PixMap", filename: "cc-public_domain_mark_white.pm"},
//{lang: "SQL", filename: "drop_stuff.sql"}, // https://github.com/src-d/enry/issues/194
{lang: "Fstar", filename: "Hacl.Spec.Bignum.Fmul.fst"},
{lang: "C++", filename: "Types.h"},
}

for _, r := range regexpEdgeCases {
filename := fmt.Sprintf("%s/samples/%s/%s", s.tmpLinguist, r.lang, r.filename)

content, err := ioutil.ReadFile(filename)
require.NoError(s.T(), err)

lang := GetLanguage(r.filename, content)
s.T().Logf("File:%s, lang:%s", filename, lang)

expLang, _ := data.LanguageByAlias(r.lang)
require.EqualValues(s.T(), expLang, lang)
}
}

func Test_EnryTestSuite(t *testing.T) {
Expand All @@ -30,25 +58,24 @@ func Test_EnryTestSuite(t *testing.T) {

func (s *EnryTestSuite) SetupSuite() {
var err error
s.repoLinguist = os.Getenv(linguistClonedEnvVar)
s.cloned = s.repoLinguist == ""
if s.cloned {
s.repoLinguist, err = ioutil.TempDir("", "linguist-")
assert.NoError(s.T(), err)
}

s.samplesDir = filepath.Join(s.repoLinguist, "samples")

if s.cloned {
cmd := exec.Command("git", "clone", linguistURL, s.repoLinguist)
s.tmpLinguist = os.Getenv(linguistClonedEnvVar)
s.needToClone = s.tmpLinguist == ""
if s.needToClone {
s.tmpLinguist, err = ioutil.TempDir("", "linguist-")
require.NoError(s.T(), err)
s.T().Logf("Cloning Linguist repo to '%s' as %s was not set\n",
s.tmpLinguist, linguistClonedEnvVar)
cmd := exec.Command("git", "clone", linguistURL, s.tmpLinguist)
err = cmd.Run()
assert.NoError(s.T(), err)
require.NoError(s.T(), err)
}
s.samplesDir = filepath.Join(s.tmpLinguist, "samples")
s.T().Logf("using samples from %s", s.samplesDir)

cwd, err := os.Getwd()
assert.NoError(s.T(), err)

err = os.Chdir(s.repoLinguist)
err = os.Chdir(s.tmpLinguist)
assert.NoError(s.T(), err)

cmd := exec.Command("git", "checkout", data.LinguistCommit)
Expand All @@ -60,8 +87,8 @@ func (s *EnryTestSuite) SetupSuite() {
}

func (s *EnryTestSuite) TearDownSuite() {
if s.cloned {
err := os.RemoveAll(s.repoLinguist)
if s.needToClone {
err := os.RemoveAll(s.tmpLinguist)
assert.NoError(s.T(), err)
}
}
Expand All @@ -88,7 +115,7 @@ func (s *EnryTestSuite) TestGetLanguage() {
}

func (s *EnryTestSuite) TestGetLanguagesByModelineLinguist() {
var modelinesDir = filepath.Join(s.repoLinguist, "test/fixtures/Data/Modelines")
var modelinesDir = filepath.Join(s.tmpLinguist, "test/fixtures/Data/Modelines")

tests := []struct {
name string
Expand Down Expand Up @@ -400,15 +427,16 @@ func (s *EnryTestSuite) TestGetLanguageByAlias() {
func (s *EnryTestSuite) TestLinguistCorpus() {
const filenamesDir = "filenames"
var cornerCases = map[string]bool{
"hello.ms": true,
"drop_stuff.sql": true, // https://github.com/src-d/enry/issues/194
// .es and .ice fail heuristics parsing, but do not fail any tests
}

var total, failed, ok, other int
var expected string
filepath.Walk(s.samplesDir, func(path string, f os.FileInfo, err error) error {
if f.IsDir() {
if f.Name() != filenamesDir {
expected = f.Name()
expected, _ = data.LanguageByAlias(f.Name())
}

return nil
Expand All @@ -431,17 +459,16 @@ func (s *EnryTestSuite) TestLinguistCorpus() {
} else {
status = "failed"
failed++

}

if _, ok := cornerCases[filename]; ok {
fmt.Printf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
s.T().Logf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
} else {
assert.Equal(s.T(), expected, obtained, fmt.Sprintf("%s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status))
}

return nil
})

fmt.Printf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
s.T().Logf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
}
Loading

0 comments on commit 3499750

Please sign in to comment.