Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve license detection bugs #1963

Conversation

AyanSinhaMahapatra
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra commented Mar 10, 2020

Fixes #1907 #1910 #1911 #1912 #1914

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

@AyanSinhaMahapatra
Copy link
Member Author

AyanSinhaMahapatra commented Mar 10, 2020

Also I've reproduced and checked other bugs #1933 #1932 #1931 #1930 #1929 #1928 #1927 #1926 #1925 #1924 #1923 #1922 #1921 #1920 #1919 #1918 #1917 #1915. And all these are already solved by you @pombredanne in the 12-03-license-updates branch.

I've collected all these files to reproduce these bugs, initial scan reproduces them - scan results file. Then all the rules added by @pombredanne are indexed and the scan results are here - scan results files . Comparing you can see that the bugs are solved. Some them were remaining, are solved in this PR - #1908 #1910 #1911 #1912 #1914.
Remaining - #1905
All of these were reported by @armijnhemel . Thanks!

@codecov
Copy link

codecov bot commented Mar 10, 2020

Codecov Report

Merging #1963 into 12-03-license-updates will decrease coverage by 0.66%.
The diff coverage is n/a.

Impacted file tree graph

@@                    Coverage Diff                    @@
##           12-03-license-updates    #1963      +/-   ##
=========================================================
- Coverage                  79.61%   78.94%   -0.67%     
=========================================================
  Files                        131      131              
  Lines                      17642    16946     -696     
=========================================================
- Hits                       14045    13378     -667     
+ Misses                      3597     3568      -29     
Impacted Files Coverage Δ
src/packagedcode/build.py 74.44% <0.00%> (-5.56%) ⬇️
src/packagedcode/rubygems.py 70.89% <0.00%> (-5.38%) ⬇️
src/scancode/resource.py 86.98% <0.00%> (-3.62%) ⬇️
src/summarycode/plugin_consolidate.py 94.21% <0.00%> (-2.36%) ⬇️
src/scancode/cli.py 76.96% <0.00%> (-2.23%) ⬇️
src/scancode/api.py 95.54% <0.00%> (-0.64%) ⬇️
src/licensedcode/match.py 77.69% <0.00%> (-0.37%) ⬇️
src/scancode/outdated.py 81.25% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a90d9f...7984a2e. Read the comment docs.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyanSinhaMahapatra This is looking great! You really grok license detection now!
See my comments in line.

implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the License for more information.

Copyright (C) 1984, 1989, 1990, 2000, 2001, 2002, 2003, 2004, 2005, 2006
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases , we are trying to avoid keeping copyright statements in license detection rules. So IMHO you could remove these two lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works perfectly. Pushing changes with removed lines.

@@ -0,0 +1,20 @@
GNU Libltdl is free software; you can redistribute it and/or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to also have another rule (and may be have only that rule) WITHOUT the libltdl reference?
For instance rather than start with:
GNU Libltdl is free software; you can redistribute it and/or
use this:
GNU is free software; you can redistribute it and/or

and remove the libtdl word elsewhere?
Because of the way the license detection ignores words it doe not know about (such as libtdl) this will not have any impact and will be detected with the automaton... And it can also detected exactly several other variants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works perfectly for this case, i.e. (having only that rule) it would also detect if the text's exactly same apart from "Libltdl". I was wondering if I can keep the original rule with a higher relevance, and the non-liblltdl one with a lower relevance score, so it detects other similar stuff but in case of multiple matches would prioritize other more relevant matches. I'm pushing that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO I think there's some work to be done in standardizing relevance scores and minimum coverages across all the rules, after performing statistics on false positive/unknown matches, as these are kind of happening regularly in bugs. This would be one of the smaller important tasks of the GSoC project, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some doc on how relevance scores are given out to rules? I see the general trend in rule names/texts vs their relevance/relevance scores, and how they affect license scores, but are there any strict guidelines I mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

liblltdl

IMHO only the one without liblltdl would be enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO I think there's some work to be done in standardizing relevance scores and minimum coverages across all the rules, after performing statistics on false positive/unknown matches, as these are kind of happening regularly in bugs. This would be one of the smaller important tasks of the GSoC project, what do you think?

excellent idea 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some doc on how relevance scores are given out to rules? I see the general trend in rule names/texts vs their relevance/relevance scores, and how they affect license scores, but are there any strict guidelines I mean?

The relevance is computed dynamically based on the size of a rule OR stored and assigned manually as a curated value based on judgment.
See

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO only the one without liblltdl would be enough

Done.

@AyanSinhaMahapatra
Copy link
Member Author

AyanSinhaMahapatra commented Mar 11, 2020

  1. bison 2.4.3: GPL-3.0-or-later file also recognized as "Free unknown" #1907 Even after having a 100% match coverage with the correct gpl-3.0-plus LICENSE(not rule), matches with another free-unknown Rule. File
    {
      "path": "Issues/1907-bison-2.4.3-getargs.c",
      "licenses": [
        {
          "key": "gpl-3.0-plus",
          "score": 100.0,
          "matched_rule": {
            "identifier": "gpl-3.0-plus.LICENSE",
            "license_expression": "gpl-3.0-plus",
            "licenses": [
              "gpl-3.0-plus"
            ],,
            "matcher": "2-aho",
            "rule_length": 102,
            "matched_length": 102,
            "match_coverage": 100.0,
            "rule_relevance": 100
          },
          "matched_text": "   This program is free software: you can redistribute it and/or modify\n   it under the terms of the GNU General Public License as published by\n   the Free Software Foundation, either version 3 of the License, or\n   (at your option) any later version.\n\n   This program is distributed in the hope that it will be useful,\n   but WITHOUT ANY WARRANTY; without even the implied warranty of\n   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n   GNU General Public License for more details.\n\n   You should have received a copy of the GNU General Public License\n   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */"
        },
        {
          "key": "free-unknown",
          "score": 100.0,
          "matched_rule": {
            "identifier": "free-unknown_1.RULE",
            "license_expression": "free-unknown",
            "licenses": [
              "free-unknown"
            ],
            "matcher": "2-aho",
            "rule_length": 23,
            "matched_length": 23,
            "match_coverage": 100.0,
            "rule_relevance": 100
          },
          "matched_text": "This is free software; see the source for copying conditions.  There is NO \\\nwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. \\"
        }
      ],
      "license_expressions": [
        "gpl-3.0-plus",
        "free-unknown"
      ],

Even if the free-unknown rule's relevance is decreased to 50, nothing changes.

@pombredanne
Copy link
Member

@AyanSinhaMahapatra in order to have an efficient discussion on each issue it may be simpler to have the comments you pasted above each in their own respective ticket, otherwise this is going to be a tad hairy to track comments and replies.
Some other points:

  • it would be best to avoid link to Google docs for external files as these may go away: either paste the file in a comment or attach it as a .txt.
  • when you paste snippets of scans do not truncate them in the middle (e.g. do not remove the line numbers for instance) as it makes things difficult to understand. You can always attach a whole scan as a .txt too.

@AyanSinhaMahapatra
Copy link
Member Author

@pombredanne Added commits solving #1908 .

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyanSinhaMahapatra may be you could find a slightly more descriptive commit message?

Add requested changes is not super useful: think about it this way: 6 months from now, you are looking at this same message. What information does this message tells you? IMHO not much.
Try to find something more explicit... tell a (mini) story or enough for this to be useful now for others and for you and others in the future.

@@ -1,7 +1,7 @@
license_expression: lgpl-2.0-plus WITH libtool-exception-2.0
is_license_notice: yes
minimum_coverage: 90
relevance: 100
relevance: 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not 100% relevant? is there any ambiguity that is a lgpl-2.0-plus WITH libtool-exception-2.0? I do not think so, furthermore, this is a large enough rule, so you could/should eschew storing a relevance there IMHO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this change, i.e. after doing this. Fixing this and the commit message too. Thanks for pointing out.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ++
We are almost there!

@@ -0,0 +1,4 @@
license_expression: gpl-2.0-plus WITH bison-exception-2.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should instead:

license_expression: bsd-new AND gpl-2.0-plus WITH bison-exception-2.2
 is_license_notice: yes
minimum_coverage: 99
referenced_filenames:
    - Copyright.txt

You may also want to create a second rule with only the BSD parts:

 Distributed under the OSI-approved BSD License (the "License");
  see accompanying file Copyright.txt for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Updating.

Copy link
Member Author

@AyanSinhaMahapatra AyanSinhaMahapatra Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've created a second rule with the BSD parts beforehand only, in this commit but didn't add bsd-new AND to the license_expression of gpl-2.0-plus WITH bison-exception-2.2, which I've pushed now. Does this work?

@@ -0,0 +1 @@
label = "GPL2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is very close to a possible license tag of sorts, could there be a word before or after that could be included to make this more specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Adding a line before.

@@ -0,0 +1 @@
label = "GPL3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the label = gpl2 rule

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add a line before.

@@ -0,0 +1 @@
S5PC100_GPL4(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this really detected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only GPL1, GPL2 and GPL3 were detected. On second thought this is really unnecessary, removing and renaming these.

@@ -0,0 +1 @@
S5PC100_GPL0(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this really detected?

Copy link
Member Author

@AyanSinhaMahapatra AyanSinhaMahapatra Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. On second thought this is really unnecessary, removing these.

@@ -0,0 +1 @@
label = "GPL4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this really detected?

Copy link
Member Author

@AyanSinhaMahapatra AyanSinhaMahapatra Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. On second thought this is really unnecessary, removing these.

@@ -0,0 +1 @@
label = "GPL0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this really detected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. On second thought this is really unnecessary, removing these.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks... Almost there: there is still a minor nitpicking/rename needed.

@@ -0,0 +1,34 @@
Distributed under the OSI-approved BSD License (the "License");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a convention we name the rule after the license expression...
So may be use something like bsd-new_and_gpl-2.0-plus_with_bison-exception_....
(within reason, unless there many many licenses)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pombredanne and others added 5 commits April 2, 2020 15:52
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Solves aboutcode-org#1912

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Solves aboutcode-org#1914 aboutcode-org#1911

Add negative and gpl with exceptions rules
Add rules and minimum coverage for false positive case.
Add new GPL rule
Add new bsd license rules
Add bzip2 Rule

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Add bsd-new to gpl with bison exception yml file.
Also add referenced copyright file.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne force-pushed the license-updates-bugs branch from 3e09782 to 7984a2e Compare April 2, 2020 14:54
@pombredanne
Copy link
Member

This looks ready to merge to me. The tests failing are not related to this.
Thanks!

@pombredanne pombredanne merged commit a4b133f into aboutcode-org:12-03-license-updates Apr 2, 2020
AyanSinhaMahapatra added a commit to AyanSinhaMahapatra/scancode-analyzer that referenced this pull request Jun 29, 2020
Adds functions that takes input data from Scancode Scan Results, in JSON, and
structures them similarly as the ClearlyDefined DataBase Data, so the same
fuctions can be used on them. Adds Jupyter Notebook to explain the Fuction Calls,
and Data, and JSON files as sample, from the issues in aboutcode-org/scancode-toolkit#1963.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants