Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive AGPL detection from a mere URL #2877

Open
pombredanne opened this issue Feb 25, 2022 · 3 comments
Open

False positive AGPL detection from a mere URL #2877

pombredanne opened this issue Feb 25, 2022 · 3 comments

Comments

@pombredanne
Copy link
Member

We are detecting an AGPL with agpl-3.0-plus_152.RULE and this text http://www.ghostscript.com ... for instance from https://github.com/ReactiveX/rxjs/blob/6.x/README.md

This is noisy.

There are two ways out:

  1. remove these short URL and related rules since they are not enough of their own to be a license detection, or
  2. treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection
@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Aug 3, 2022

@pombredanne I have a couple of possible cases here which could be clues, as opposed to detections. what do you think?

i.e. they will have is_clue as True in their .yml files and will be reported in license_clues, and not at license_detections.

Cases where is_clue = True:

  1. urls to github/other code repo licenses: example: http://github.com/dotnetcore/Util/blob/master/LICENSE or https://devshed.codeplex.com/license
  2. links to websites like ghostscript like above
  3. unknown references like http://licenses.nuget.org/
  4. links to github repos (and not licenses) like https://github.com/micahlmartin/OAuth2Provider
  5. links to github licenses which are unknown like: https://raw.github.com/markwoodhall/MFlow/master/license.txt
  6. references/words which are know to be a license, like: mupdf or ghostscript or Affero
  7. words that are license names but generic (?): like beerware or borceux
  8. hash values: example: md5=1a6d268fd218675ffea8be556788b780" is lgpl-2.1
  9. abbreviations of licenses (could tags like gpl be also included in this?): like PSFL
  10. other unknown license references which are references to files/websites/packages and could not be resolved successfully

Cases where is_clue is False: i.e. these are valid detections

  1. link to license texts or specific licenses https://spdx.org/licenses/bsd-2-clause

Also attaching a csv file with a subset of the rules (is_license_reference = True and relevance < 100):
clues_possible.csv

@pombredanne
Copy link
Member Author

This makes 100% sense... we have to thread lightly though..

  1. urls to github/other code repo licenses: example: http://github.com/dotnetcore/Util/blob/master/LICENSE or https://devshed.codeplex.com/license : ==> IMHO several are bona fide detection rules not mere clues.
  2. links to websites like ghostscript like above: this is the case for mere clues
  3. unknown references like http://licenses.nuget.org/ for this bare URL, likely yes, but https://licenses.nuget.org/(LGPL-2.0-only WITH FLTK-exception OR Apache-2.0+) would need to be detected possibly with a new matcher or by extending the matcher for SPDX license identifiers and would in all cases not be a mere clue in a rule
  4. links to github repos (and not licenses) like https://github.com/micahlmartin/OAuth2Provider agreed
  5. links to github licenses which are unknown like: https://raw.github.com/markwoodhall/MFlow/master/license.txt **it depends. In many cases these are well know repos with stable licensing... another possibility could be to have a step to fetch things at the URL end and detect that instead ... but that out of scope for this issue ;) **
  6. references/words which are know to be a license, like: mupdf or ghostscript or Affero agreed
  7. words that are license names but generic (?): like beerware or borceux beerware is surely a proper rule and not a mere clue, borceux would be clue alright ... so it really depends
  8. hash values: example: md5=1a6d268fd218675ffea8be556788b780" is lgpl-2.1 this is borderline and could be a proper rule rather than a clue... some thinking needed
  9. abbreviations of licenses (could tags like gpl be also included in this?): like PSFL : agreed. For the GPL one I think we would need to have a special post-matching processing possibly looking at case and mixed case... which BTW would mean that the is_clue is an attribute of a license rule alright BUT could be overriden in a license match and therefore licensematch should also have one IMHO
  10. other unknown license references which are references to files/websites/packages and could not be resolved successfully agreed

@rspier
Copy link

rspier commented Aug 12, 2022

treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

This seems like a recipe for noise in the output. Or possibly the need for more granular levels of clue. (strong clue / weak clue). But I would probably lean towards "if it isn't actually useful signal, it's not interesting". What are you going to do with the clues once you have them?

One of the challenges with these heuristics is context, or the lack thereof.

I had a case a few weeks ago where https://github.com/svaarala/duktape/blob/master/website/index/index.html got scanned.

It contains

Similar engines
There are multiple Javascript engines targeting similar use cases as Duktape, at least:

[Espruino](https://github.com/espruino/Espruino) (MPL v2.0)
[JerryScript](http://jerryscript.net/) (Apache License v2.0)
[MuJS](http://mujs.com/) (Affero GPL)
[quad-wheel](https://code.google.com/p/quad-wheel/) (MIT License)
[QuickJS](https://bellard.org/quickjs/) (MIT License)
[tiny-js](https://github.com/gfwilliams/tiny-js) (MIT license)
[v7](https://github.com/cesanta/v7) (GPL v2.0)

Triggering off license name results in false positives for Duktape, even though this section is actually talking about other products.

This particular example is more complicated/subtle than most of the other examples in this bug, so might be a distraction, but it's still interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants