RFC: a plan for false positive license detection #2878

pombredanne · 2022-02-25T08:07:05Z

Context

We are reporting too many false positive licenses. We need to fix this!

Problem

There are several false cases, yet they boil down to these types:

False detection of very short and weak license detection rules detected exactly such as:
- a URL or a project name such as a URL to a well known AGPL-licensed which is not always a sign of AGPL as in False positive AGPL detection from a mere URL #2877
- the detection of the word GPL in a binary Tracing "Start Line" of ScanCode report back to the Binary file. #2874
- the detection of longer may not be modified in False-positive proprietary-license finding in Guava source code #2865
Detection of a license text or notice fragment which is too weak to represent a bona fide license detection alone.
Detection of longer unknown license references such as
- a "license introduction" (as in "This is licensed under....") that may be noisy when followed by a bona fide license notice or text.
- a license reference to the license in a file (as in "See file COPYING for license") where we can follow the reference
Lack of proper detection of a structured license tag found in a package manifest which is returned as an unknown license
When fragments of the same license are detected with only copyrights added in between as in license detection: Add the nunit license #2859
When sequence of SPDX licenses id are found in license detection tools
Please add yours!

Solution elements

We could treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

The upcoming two-step process where license matches are grouped in a license detection is another way to consider. We could detect patterns of license matches that could be resolved in a detection. For instance a license intro followed by a license notice.

The scancode-analyzer heuristics and ML-based detection of false positive is another way

The text was updated successfully, but these errors were encountered:

porsche-rishisaxena · 2022-02-28T08:26:02Z

Hi Philippe,

In reference to our collective ORT community meeting, we touch base on the false positive license detection 2 weeks ago on version v30.1.0 where Porsche AG OSO also consolidated a report of false-positive cases. Please find attached the report for your kind reference and review.

report_false_positives.xlsx

CC: @sschuberth

PatteSI · 2022-02-28T18:44:24Z

Thank you for taking action here.
I will now have a deeper look into our false positive findings as well.
EDIT: I forgot to mention that everything I mention below was found using scanCode 30.1.0
At first glance it seems that many LicenseRef-scancode-free-unknown and LicenseRef-scancode-unknown-license-reference findings in our Java projects are actually found in META-INF/LICENSE files created by Maven inside the JARs Example: https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.1/jackson-dataformat-yaml-2.13.1.jar
In this case line 3: "Jackson is a high-performance, Free/Open Source JSON processing library." and line 11-13: "Jackson core and extension components may be licensed under different licenses.
To find the details that apply to this artifact see the accompanying LICENSE file.
For more information, including possible other licensing options, contact"
The latter one would probably fall into point 3 mentioned above, a reference to another license file. No idea about the first one though. Not sure why they are only found in the JAR binary and not also in the actual source repo with the same text: https://github.com/FasterXML/jackson-dataformats-text/blob/2.14/properties/src/main/resources/META-INF/NOTICE

Another interesting example is okhttp3 because the false positive that is found was actually introduced by yourself @pombredanne ;-) : square/okhttp#4569 , The current file in my example is this one: https://github.com/square/okhttp/blob/parent-5.0.0-alpha.3/okhttp/src/main/resources/okhttp3/internal/publicsuffix/NOTICE . The license LicenseRef-scancode-unknown-license-reference is found in line 4: "It is subject to the terms of the Mozilla Public License, v. 2.0:" I don't understand why in this case MPL v.2.0 is not recognized correctly.

PatteSI · 2022-03-01T15:06:17Z

I created a python parser that can parse the evaluated-model.json file create by the ORT Reporter.
It is currently scanning a list of problematic licenseRefscan codes which are mostly (always?) causing false positives: https://gist.github.com/PatteSI/5904f4bdfb149dc1ce8c73da53e2f6ae
I parse a couple of our component and this is the result. Of course it still contains a lot of duplicates (it's a json file but github won't let me upload it as .json):
falsePosFindingFinal.txt

pombredanne · 2022-03-03T06:50:56Z

@porsche-rishisaxena Thank you ++ for the list of false positive in #2878 (comment) ... this is great and actionable!

pombredanne · 2022-03-03T07:07:22Z

@PatteSI re: #2878 (comment)

In this case line 3: "Jackson is a high-performance, Free/Open Source JSON processing library." and line 11-13: "Jackson core and extension components may be licensed under different licenses.

These tow look like basic license-related clues, but are not real license statement alright.

Here is the detection I get:

headers:
    -   tool_name: scancode-toolkit
        tool_version: 31.0.0
        options:
            input:
                - jackson-dataformat-yaml-2.13.1.jar-extract/META-INF/NOTICE
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-03-03T065555.808073'
        end_timestamp: '2022-03-03T065557.886174'
        output_format_version: 2.0.0
        duration: '2.078113555908203'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.16'
            files_count: 1
files:
    -   path: NOTICE
        type: file
        licenses:
            -   key: free-unknown
                score: '100.0'
                name: Free unknown license detected but not recognized
                short_name: Free unknown
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/free-unknown
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/free-unknown.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/free-unknown.yml
                spdx_license_key: LicenseRef-scancode-free-unknown
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/free-unknown.LICENSE
                start_line: 3
                end_line: 3
                matched_rule:
                    identifier: free-unknown_85.RULE
                    license_expression: free-unknown
                    licenses:
                        - free-unknown
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: yes
                    matcher: 2-aho
                    rule_length: 3
                    matched_length: 3
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: Free/Open Source
            -   key: unknown-license-reference
                score: '92.86'
                name: Unknown License file reference
                short_name: Unknown License reference
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/unknown-license-reference
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
                spdx_license_key: LicenseRef-scancode-unknown-license-reference
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                start_line: 11
                end_line: 13
                matched_rule:
                    identifier: unknown-license-reference_224.RULE
                    license_expression: unknown-license-reference
                    licenses:
                        - unknown-license-reference
                    referenced_filenames:
                        - LICENSE
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: yes
                    matcher: 3-seq
                    rule_length: 28
                    matched_length: 26
                    match_coverage: '92.86'
                    rule_relevance: 100
                matched_text: |
                    licensed under different licenses.
                    To find the details that apply to this artifact see the accompanying LICENSE file.
                    For more information, including possible other licensing options,
        license_expressions:
            - free-unknown
            - unknown-license-reference
        percentage_of_license_text: '24.37'
        scan_errors: []

Not sure why they are only found in the JAR binary and not also in the actual source repo with the same text:

This is weird and I got them the same way in both case. Could it be ORT handling things differently in these cases?

Another interesting example is okhttp3 because the false positive that is found was actually introduced by yourself

Oh well.... as the saying goes, "no good deed goes unpunished!"

https://raw.githubusercontent.com/square/okhttp/parent-5.0.0-alpha.3/okhttp/src/main/resources/okhttp3/internal/publicsuffix/NOTICE scans this way:

headers:
    -   tool_name: scancode-toolkit
        tool_version: 31.0.0
        options:
            input:
                - NOTICE.1
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-03-03T065940.353825'
        end_timestamp: '2022-03-03T065942.146784'
        output_format_version: 2.0.0
        duration: '1.792968988418579'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.16'
            files_count: 1
files:
    -   path: NOTICE.1
        type: file
        licenses:
            -   key: unknown-license-reference
                score: '60.0'
                name: Unknown License file reference
                short_name: Unknown License reference
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/unknown-license-reference
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
                spdx_license_key: LicenseRef-scancode-unknown-license-reference
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                start_line: 4
                end_line: 4
                matched_rule:
                    identifier: license-intro_3.RULE
                    license_expression: unknown-license-reference
                    licenses:
                        - unknown-license-reference
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: yes
                    has_unknown: yes
                    matcher: 2-aho
                    rule_length: 4
                    matched_length: 4
                    match_coverage: '100.0'
                    rule_relevance: 60
                matched_text: subject to the terms
            -   key: mpl-2.0
                score: '100.0'
                name: Mozilla Public License 2.0
                short_name: MPL 2.0
                category: Copyleft Limited
                is_exception: no
                is_unknown: no
                owner: Mozilla
                homepage_url: http://mpl.mozilla.org/2012/01/03/announcing-mpl-2-0/
                text_url: http://www.mozilla.com/MPL/2.0/
                reference_url: https://scancode-licensedb.aboutcode.org/mpl-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.yml
                spdx_license_key: MPL-2.0
                spdx_url: https://spdx.org/licenses/MPL-2.0
                start_line: 4
                end_line: 4
                matched_rule:
                    identifier: mpl-2.0_90.RULE
                    license_expression: mpl-2.0
                    licenses:
                        - mpl-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 6
                    matched_length: 6
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: 'Mozilla Public License, v. 2.0:'
            -   key: mpl-2.0
                score: '50.0'
                name: Mozilla Public License 2.0
                short_name: MPL 2.0
                category: Copyleft Limited
                is_exception: no
                is_unknown: no
                owner: Mozilla
                homepage_url: http://mpl.mozilla.org/2012/01/03/announcing-mpl-2-0/
                text_url: http://www.mozilla.com/MPL/2.0/
                reference_url: https://scancode-licensedb.aboutcode.org/mpl-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.yml
                spdx_license_key: MPL-2.0
                spdx_url: https://spdx.org/licenses/MPL-2.0
                start_line: 5
                end_line: 5
                matched_rule:
                    identifier: spdx_license_id_mpl-2.0_for_mpl-2.0.RULE
                    license_expression: mpl-2.0
                    licenses:
                        - mpl-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 3
                    matched_length: 3
                    match_coverage: '100.0'
                    rule_relevance: 50
                matched_text: MPL/2.0/
        license_expressions:
            - unknown-license-reference
            - mpl-2.0
            - mpl-2.0
        percentage_of_license_text: '33.33'
        scan_errors: []

You will note that I am using these command line options:
--license --license-text --license-text-diagnostics --yaml- which means: license, with actual license text, but limited to the exact portion of text that was matched and reported as yaml directly on screen (with the dash) rather than to a file.

This overall looks like a case of where we could merge the license intro "subject to the terms" with a following notice.
There are also some missing rules separately that will help catch more of the MPL URL and more of the MPL details in general.

pombredanne · 2022-03-03T07:16:54Z

@PatteSI re: #2878 (comment)

I created a python parser that can parse the evaluated-model.json file create by the ORT Reporter.

This is great! Ideally what I would need is a script that would fetch the code. With that I could run extractcode to extract any archive and run a scan to get the actual details. I think this can be derived from you JSON.

pombredanne · 2022-03-03T07:19:43Z

@porsche-rishisaxena re: #2878 (comment)

The CSV is super useful and I can derive a script to automate re scanning from this too.

In your case and @PatteSI case, creating these data required a lot of (useful) work.
I am wondering what could be the tools that would make it easier to help you report these false positive.

sschuberth · 2022-03-03T07:27:26Z

Could it be ORT handling things differently in these cases?

ORT is not handling findings in binaries or sources differently per se, and is taking ScanCode findings mostly as-is (except some post-processing to remedy #2873). But it might be that some project-specific path excludes were applied in that particular case.

sschuberth · 2022-03-03T07:29:57Z

I am wondering what could be the tools that would make it easier to help you report these false positive.

For ORT, if false-positives were addressed via package configurations, we could quite easily extract the detected_license vs. the concluded_license.

@fviernau, is that something that could be done from HERE's (probably massive) amount of package configurations?

pombredanne · 2022-03-03T08:37:04Z

@fviernau

is that something that could be done from HERE's (probably massive) amount of package configurations?

If there is something that can be shared, that could be used to fix massively some of these false positive! :)

pombredanne · 2022-03-05T21:49:31Z

Here are some related issues:

Is this a false positive? a long list of license ids in code that is license-related #270 reported by @yahalom5776
False positive AGPL detection from a mere URL #2877
False-positive proprietary-license finding in Guava source code #2865 by @sschuberth and @PatteSI
False positive for LicenseRef-scancode-elastic-license-2018 #2815 by @sschuberth
SCTK 30.1.0 detects classpath-exception-2.0 based only on word "classpath" in Java comments #2769 by @mjherzog (which contains unknown words interspersed between the words of a license name)
False positive license? "may not be modified" in SQLIte source code #2735
ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726
Improve false positive license detection for license lists #2651
Wrong spdx detection for file generator #2502 by @tardyp
Code wrongly detected as gpl-1.0 #2371 @xu1119
Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish #2403
gpl-1.0-plus false positives #2374
C source code line mistaken for MPL licence #2304 by @Thalley
Duplicates in license detection result #2170 by @qduanmu
License expression of 'GPLv3+ and LGPLv3+' is incorrect #1895
license_expression for 'ASL 2.0' is unknown #1731

These ones can likely be fixed with the new key phrases feature:

These could help with the diagnostic false positive:

This is an example of an weak detection for a new license:

Wrong detection for Apple MFi License as apple-excl #2503 by @tardyp

This may help with some false positives:

@rspier ping too

I think we should have a live call to discuss the options to fix these. What do you think?

bennati · 2022-03-07T10:10:19Z

@pombredanne I attach the false positives from a bunch of HERE curations, as produced by @PatteSI 's script.
Hope this helps,
falsepositives.txt

sschuberth · 2022-03-08T07:50:36Z

I think we should have a live call to discuss the options to fix these. What do you think?

To be frank, I believe having a live call with all reporters of false-positive mentioned here would be overkill. Also, I guess most people don't care too much how their issue is fixed as long as it is fixed.

From my side, however, I'd strongly vote against hard-coding just the reported cases as false-positives. Instead, we should

ensure that rules always contain enough words / context to confidently identify licenses in general.
never allow a score of 100% for unknown licenses.
think about tweaking the score to be based on user feedback instead of being calculated: If a rule reportedly causes many false-positives, its score could be manually lowered.

armijnhemel · 2022-03-08T10:03:22Z

From my side, however, I'd strongly vote against hard-coding just the reported cases as false-positives. Instead, we should

* ensure that rules always contain enough words / context to confidently identify licenses in general.

* never allow a score of 100% for unknown licenses.

* think about tweaking the score to be based on user feedback instead of being calculated: If a rule reportedly causes many false-positives, its score could be manually lowered.

Be careful to not fall into the "perfect is the enemy of good" trap. If trying to avoid the false positives from happening in the first place significantly complicates the code (making it harder to maintain/change/etc.) then I don't see a problem with hardcoding the false positives.

But this depends on how many of the results are false positives. @pombredanne do you have an idea of the scale of false positives? How many results are false positives? 1%? 10%? 0.0000001%?

sschuberth · 2022-03-16T08:59:28Z

But this depends on how many of the results are false positives.

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

pombredanne · 2022-03-16T09:25:05Z

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong.
That's why the hard data input is key here.

PatteSI · 2022-03-18T10:16:20Z

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

armijnhemel · 2022-03-18T12:10:30Z

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify?

Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files.

So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"?

porsche-rishisaxena · 2022-03-18T14:30:24Z

Hi @pombredanne
We have found further false-positive license detection where scan-code reported blank SPDX expression but when checked manually the actual license was present in the library on GitHub. Please find attached report to this thread for your kind review.
Note: This time scan-code did not even report "Unknown" and was just blank.

NoLicenseDetection-report.xlsx

CC: @sschuberth

pombredanne · 2022-03-18T15:32:51Z

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

@PatteSI Thank you ++ that's super valuable input.

pombredanne · 2022-03-18T15:33:18Z

@porsche-rishisaxena re:

We have found further false-positive license detection where scan-code reported blank SPDX expression but when checked manually the actual license was present in the library on GitHub. Please find attached report to this thread for your kind review.
Note: This time scan-code did not even report "Unknown" and was just blank.

Thanks. Super useful too.

PatteSI · 2022-03-18T16:12:49Z

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify?

Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files.

So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"?

It's not about "open source" license. I am pretty sure we can detect almost all known open source licenses. It's about "hints" to "unknown" (usually proprietary) licenses. As far as I know there is no standard on what wording has to be used in a source code file in order to place it under some arbitrary license. I am not even sure if one has to use the word "license" in order so do so. So basically if some troll wanted to place certain parts of code under a proprietary license while the rest of the project is under a different known license he could do that with some wording or weird character encoding obfuscating the automatic detection of this section. So I would argue that we always have to do a trade-off here if we want to talk about "unknown" licenses as it will never be possible to 100%. Like you said we need to rely on heuristics that hopefully will trigger on wording that someone is using when he is announcing proprietary license (not yet available in any database) while having a high fidelity in such findings. In the end the end-user should be able to decide how many of those "unkown" hints he wants to have. Some projects require very high fidelity on their license usage while other don't and also do not have the capacity to check that many findings.

armijnhemel · 2022-03-18T16:16:50Z

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify?
Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files.
So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"?

It's not about "open source" license. I am pretty sure we can detect almost all known open source licenses. It's about "hints" to "unknown" (usually proprietary) licenses. As far as I know there is no standard on what wording has to be used in a source code file in order to place it under some arbitrary license. I am not even sure if one has to use the word "license" in order so do so. So basically if some troll wanted to place certain parts of code under a proprietary license while the rest of the project is under a different known license he could do that with some wording or weird character encoding obfuscating the automatic detection of this section. So I would argue that we always have to do a trade-off here if we want to talk about "unknown" licenses as it will never be possible to 100%. Like you said we need to rely on heuristics that hopefully will trigger on wording that someone is using when he is announcing proprietary license (not yet available in any database) while having a high fidelity in such findings. In the end the end-user should be able to decide how many of those "unkown" hints he wants to have. Some projects require very high fidelity on their license usage while other don't and also do not have the capacity to check that many findings.

So the core question really is: what do you think "unknown license" means? Is it "there is a license but scancode doesn't know which one because it is not in its knowledgebase" (whether or not it is open or closed) or "scancode couldn't detect which license it is and threw its hands up"? This is conceptually a big difference.

PatteSI · 2022-03-23T10:04:09Z

So the core question really is: what do you think "unknown license" means? Is it "there is a license but scancode doesn't know which one because it is not in its knowledgebase" (whether or not it is open or closed) or "scancode couldn't detect which license it is and threw its hands up"? This is conceptually a big difference.

I think this discussion is a bit deviation from the original problem here. It doesn't matter what anyone thinks "unkown license" means. We are discussing how the heuristics could be improved and ways to give the end users more options to evaluate findings. I am talking here as an end-user of ORT, which is using ScanCode as a scanner. Now we started to migrate away from NexusIQ to ORT and we see sometime hundreds of those "unknown-license" findings in big projects. There are many examples of trivial finding where the heuristics/rules used in ScanCode are just to broad and get triggered for simple comments using the word "license". We are not only discussing the general problem here of how to improve the rule based findings. There are many example given in the first post. It's not only about "unknown" licenses.

pombredanne · 2022-03-26T08:30:25Z

I have attached a presentation to better grap a summary of the issue:

ScanCode-licenses-false-positive-2022-03.pptx.pdf

pombredanne · 2022-03-26T08:52:29Z

@richardfontana I would be interested to get some feedback too
@opensourcepilot I was re-reading your (thought-provoking) article in https://opensource.com/article/21/7/open-source-scanning-error and we are trying to fix false positives license detections for ScanCode with this issue. Your insights would be much appreciated!

richardfontana · 2022-03-26T15:34:49Z

@sutula may find this of interest

alext34ms · 2022-03-28T07:20:48Z

On the topic of making the current rule set a bit more stringent:
The feature of making certain tokens required by putting them in {{ }} is an awesome tool to sharpen SCTK even more.
The challenge is that we have 31.000+ rules. But that should not stop us. 😉

Proposal: doing an "automated" retro-fit of all rules to include SPDX identifiers in {{ }}.
I made a very KISS "one liner" that does just that. The result seems quite OK when looking at some sample rules.

Some thoughts for this update:

How does it affect AND/OR rules
How does it affect (false-positive*) rules and what are those even?
Does it add any accuracy?
Does it pass integration tests?
Any others that I have missed?

cd src/scancode-toolkit/src/licensedcode/data/rules
for identifier in `tac ~/tmp/spdx_identifier.list`;
do
  echo $identifier;
  for rule in `egrep -l '([^A-Z]|^)('$identifier')([^A-Z]|$)' *.RULE`;
  do
    sed -i -E s/\(\[\^A-Z\{\{\\]\|\^\)\($identifier\)\(\[\^A-Z\}\}\]\|\$\)/\\1\{\{$identifier\}\}\\3/g $rule;
  done;
done

spdx_identifier.list

Above one liner makes changes to about 6400/31000 rules.

pombredanne · 2022-03-29T14:07:17Z

@alext34ms re:

doing an "automated" retro-fit of all rules to include SPDX identifiers in {{ }}.
I made a very KISS "one liner" that does just that. The result seems quite OK when looking at some sample rules.

Sleek! very smart. I like it

How does it affect AND/OR rules

I do not think there it would have any impact.

How does it affect (false-positive*) rules and what are those even?

These should be left alone. These are rules that can be matched only exactly and that are about licenses but are NOT license notices or texts. They should be used sparingly as a last resort. For instance, this text:

copyright info have been adapted to avoid the violation of the GPL license

is NOT a GPL-related notice, but some commentary about the GPL license and would be a typical case for a "false positive" rule.

Does it add any accuracy?

It should, but it may also degrade and miss some matches in a few corner cases. These could be caught separately by the --unknow-license option though.

Does it pass integration tests?

It will likely make some fail.

Any others that I have missed?

I think the approach could be refined using a Python script as we have code that handle the RULEs and has all SPDX licenses alright and we could also expand this to a few more things:

SPDX id alright as you suggest
the license short and long name
potentially the SPDX names, and URL

Some scripts examples to use as a base are in https://github.com/nexB/scancode-toolkit/blob/develop/etc/scripts/licenses/

pombredanne · 2022-05-12T07:58:24Z

@vargenau Another case to track here #2905

borisbaldassari · 2022-05-17T09:04:52Z

Hi Philippe, all,

First things first: thanks for the good work people! You're great!
I'd like to contribute some feedback regarding false positives (/wrong license) we get at the Eclipse Foundation. One of our main issues is with the canonical headers used with the EPL-2.

/*********************************************************************
 * Copyright (c) 2019 Red Hat, Inc.
 *
 * This program and the accompanying materials are made
 * available under the terms of the Eclipse Public License 2.0
 * which is available at https://www.eclipse.org/legal/epl-2.0/
 *
 * SPDX-License-Identifier: EPL-2.0
 **********************************************************************/

Some lines (typically {4,5}) are recognised as LicenseRef-scancode-unknown-license-reference even with the SPDX tag sitting right behind. It seems that another license text (license-intro_29.RULE) is matched before the EPL-2.0 text, so even adding various variations of the headers (line ends, etc.) doesn't help. Setting the license-score to 100 helps a bit (i.e. less wrong violations), but still not enough to make this case go away.

Would it be helpful to provide a list of false positives / wrong identifications? I'll be happy to provide one if so.

AyanSinhaMahapatra · 2022-05-17T18:15:24Z

@borisbaldassari Thanks for your feedback and report.

We are working on this and this specific issue of a license_intro being present before detections is going to be fixed, this is WIP and hasn't landed yet.

Would it be helpful to provide a list of false positives / wrong identifications? I'll be happy to provide one if so.

This would be extremely helpful, we will use this for testing this new feature extensively, as we are using the other lists contributed here. Thanks a lot!

borisbaldassari · 2022-05-18T14:14:06Z

Hi @AyanSinhaMahapatra Thanks for the head-up! I'll wait for the landing and give it a try. :-)

Please find below a list of unknown-license false positives found in a few Eclipse projects (Che, JGit, CDT, Tycho). If needed I can analyse more projects -- but since we're using ORT we don't have direct access to the scancode output, so I need to run it separately (and manually).

scancode_fp_eclipse.tar.gz

borisbaldassari · 2022-05-18T14:43:53Z

Please also find attached the Python script used to generate the csv's, if it's useful.

extract_false_positives.py.tar.gz

pombredanne · 2022-05-20T04:58:38Z

@borisbaldassari Thank you ++

pombredanne · 2022-05-23T21:03:16Z

Another short SSPL false positive #2975

PatteSI · 2023-08-02T10:00:58Z

As @pombredanne also asked in my initial issue for a list of false positives I just wanted to mention that the ORT community also started sharing curations for false positives. I guess ScanCode ist still one of the most widely used scanning component in ORT so they might all be relevant for you. Check out their curantions and package configurations: https://github.com/oss-review-toolkit/ort-config

pombredanne added the bug label Feb 25, 2022

pombredanne mentioned this issue Feb 25, 2022

license detection: Add the nunit license #2859

Open

pombredanne added the improve-license-detection label Feb 25, 2022

pombredanne mentioned this issue Feb 25, 2022

False-positive proprietary-license finding in Guava source code #2865

Open

pombredanne mentioned this issue Mar 2, 2022

RFC: Revamp "unknown" license detection #1675

Closed

PatteSI mentioned this issue Mar 3, 2022

Reports should show the license scores for findings oss-review-toolkit/ort#5128

Closed

AyanSinhaMahapatra mentioned this issue Mar 31, 2022

False positive email detection #2810

Open

pombredanne mentioned this issue Mar 31, 2022

Many duplicates in SPDX files #2905

Closed

AyanSinhaMahapatra mentioned this issue May 17, 2022

Combine license matches in new LicenseDetection #2961

Merged

4 tasks

AyanSinhaMahapatra mentioned this issue Jul 13, 2022

[RFC] Use new license key undetected-license #3021

Closed

AyanSinhaMahapatra mentioned this issue Aug 4, 2022

Consider CLA's as license clues? #3038

Closed

pombredanne mentioned this issue Oct 28, 2022

GPL-2.0-only detected in GPL-2.0-or-later #3128

Open

pombredanne mentioned this issue Jan 4, 2023

Proposal for avoiding false positives #1838

Closed

AyanSinhaMahapatra mentioned this issue Feb 15, 2023

Add required phrase rules automatically #3254

Closed

4 tasks

AyanSinhaMahapatra mentioned this issue Mar 26, 2023

Reduce license detection false positives #3300

Open

AyanSinhaMahapatra mentioned this issue Aug 2, 2023

Use curations from ORT to check license detections #3481

Open

AyanSinhaMahapatra mentioned this issue Apr 26, 2024

Add new Apache or MIT license rule #3738 #3750

Merged

3 tasks

pombredanne mentioned this issue Jul 3, 2024

MIT license not detected in package.json #3843

Open

AyanSinhaMahapatra mentioned this issue Sep 17, 2024

Update rules with required phrases automatically #3924

Open

4 tasks

RFC: a plan for false positive license detection #2878

RFC: a plan for false positive license detection #2878

Comments

pombredanne commented Feb 25, 2022 • edited Loading

Context

Problem

Solution elements

porsche-rishisaxena commented Feb 28, 2022

PatteSI commented Feb 28, 2022 • edited Loading

PatteSI commented Mar 1, 2022

pombredanne commented Mar 3, 2022 • edited Loading

pombredanne commented Mar 3, 2022

pombredanne commented Mar 3, 2022

pombredanne commented Mar 3, 2022

sschuberth commented Mar 3, 2022

sschuberth commented Mar 3, 2022

pombredanne commented Mar 3, 2022

pombredanne commented Mar 5, 2022

bennati commented Mar 7, 2022

sschuberth commented Mar 8, 2022

armijnhemel commented Mar 8, 2022

sschuberth commented Mar 16, 2022 • edited Loading

pombredanne commented Mar 16, 2022

PatteSI commented Mar 18, 2022 • edited Loading

armijnhemel commented Mar 18, 2022

porsche-rishisaxena commented Mar 18, 2022

pombredanne commented Mar 18, 2022

pombredanne commented Mar 18, 2022

PatteSI commented Mar 18, 2022

armijnhemel commented Mar 18, 2022 • edited Loading

PatteSI commented Mar 23, 2022 • edited Loading

pombredanne commented Mar 26, 2022

pombredanne commented Mar 26, 2022

richardfontana commented Mar 26, 2022

alext34ms commented Mar 28, 2022 • edited Loading

pombredanne commented Mar 29, 2022

pombredanne commented May 12, 2022 • edited Loading

borisbaldassari commented May 17, 2022 • edited Loading

AyanSinhaMahapatra commented May 17, 2022 • edited Loading

borisbaldassari commented May 18, 2022

borisbaldassari commented May 18, 2022

pombredanne commented May 20, 2022

pombredanne commented May 23, 2022

PatteSI commented Aug 2, 2023

pombredanne commented Feb 25, 2022 •

edited

Loading

PatteSI commented Feb 28, 2022 •

edited

Loading

pombredanne commented Mar 3, 2022 •

edited

Loading

sschuberth commented Mar 16, 2022 •

edited

Loading

PatteSI commented Mar 18, 2022 •

edited

Loading

armijnhemel commented Mar 18, 2022 •

edited

Loading

PatteSI commented Mar 23, 2022 •

edited

Loading

alext34ms commented Mar 28, 2022 •

edited

Loading

pombredanne commented May 12, 2022 •

edited

Loading

borisbaldassari commented May 17, 2022 •

edited

Loading

AyanSinhaMahapatra commented May 17, 2022 •

edited

Loading