Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive license from info.license over classifiers in pypi registry data #586

Merged
merged 2 commits into from
Sep 26, 2024

Conversation

qtomlinson
Copy link
Collaborator

@qtomlinson qtomlinson commented Jul 9, 2024

Based on the Python Packaging User Guide, it appears that the license field holds more specific information. Change to parsing license field before the classifiers. When classifier has the version info and the license field does not, use classifier first in that case.

Also updated the toolVersion in pypiExtract.

Task: #523
Task: #519
Task: #429

Based on the Python Packaging User Guide (https://packaging.python.org/en/latest/specifications/core-metadata/#license), it appears that the license field holds more specific information.  Change to parsing license field before the classifiers.  When classifier has the version info and license field does not, use classifier first in that case.
@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Aug 14, 2024

@Jeffrey-Luszcz @elrayle Here is the file containing the extracted license information for 1000 pypi components. The structure of each entry is as follows:

{
    "coordinates": "clearlydefined coordinates",
    "info.license": "info.license field from pypi registry data",
    "info.classifiers": "license extracted from info.classifiers field in pypi registry data"
}

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Aug 15, 2024

@Jeffrey-Luszcz Please find a more detailed JSON file with extracted SPDX licenses from the specified fields here: licenses3.json. In 150 instances, the non-empty SPDX licenses from the info.license field differ from those in the info.classifiers field. It appears that the licenses from the info.license field provide more detailed information.

@Jeffrey-Luszcz
Copy link

I think we have a BSD vs BSD-2 vs BSD-3 clause problem we need to figure out how to handle. By reviewing the JSON file I see data with the following styles of problems:

License in repo is BSD-3 [ https://github.com/encode/starlette/blob/master/LICENSE.md ]but json says variously BSD & BSD-2:
{
"coordinates": "pypi/pypi/-/starlette/0.38.0",
"info.classifiers": " BSD License",
"info.license": null,
"spdxFomClassifier": "BSD-2-Clause",
"versionInClassifier": false,
"versionInLicenseInfo": false
},

List in repo is BSD-3 [ https://github.com/andialbrecht/sqlparse/blob/master/LICENSE ] but JSON says BSD or BSD-2
{
"coordinates": "pypi/pypi/-/sqlparse/0.5.1",
"info.classifiers": " BSD License",
"info.license": null,
"spdxFomClassifier": "BSD-2-Clause",
"versionInClassifier": false,
"versionInLicenseInfo": false
},

License in repo is BSD-3 [ https://github.com/google/re2/blob/main/LICENSE ]but json says variously BSD & BSD-2:
{
"coordinates": "pypi/pypi/-/google-re2/1.1.20240702",
"info.classifiers": " BSD License",
"info.license": null,
"spdxFomClassifier": "BSD-2-Clause",
"versionInClassifier": false,
"versionInLicenseInfo": false
},

@Jeffrey-Luszcz
Copy link

Also for LGPL we have a similar problem

https://pypi.org/project/astroid/ License on pypi.org says "License: GNU Lesser General Public License v2 (LGPLv2) (LGPL-2.1-or-later)" but there's a mix of 2.0 and 2.1 and only and + in the json
{
"coordinates": "pypi/pypi/-/astroid/3.2.4",
"info.classifiers": " GNU Lesser General Public License v2 (LGPLv2)",
"info.license": "LGPL-2.1-or-later",
"spdxFomClassifier": "LGPL-2.0-only",
"versionInClassifier": true,
"spdxFromLicenseInfo": "LGPL-2.1-or-later",
"versionInLicenseInfo": true
},

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Sep 19, 2024

@Jeffrey-Luszcz This pull request is focused on extracting license information from the core metadata of the Python registry. This does not involve inspecting the license file, as that task is handled by other tools such as Licensee and ScanCode. The discrepancies you mentioned here are related to the cases where there is a mismatch between the registry metadata and the license file. The JSON file only contains the information extracted from the registry metadata and does not reflect the information present in the license file.

@Jeffrey-Luszcz
Copy link

To close out this issue on my side. I support the change described (e.g. Derive license from info.license over classifiers)
@qtomlinson thanks for pulling together the json data, it really makes it clear.

My point in the comments above is that we should take another look at the mapping we have when we have a bare "BSD" license declared in these fields. We currently map it to BSD-2 and likely should map it to BSD-3 (the more common license choice)
This is a separate issue and I'll open one for discussion.

Copy link
Collaborator

@elrayle elrayle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good. Thanks for adding tests to capture the new possible conditions.

@elrayle
Copy link
Collaborator

elrayle commented Sep 26, 2024

I spoke with @Jeffrey-Luszcz. These are his comments. Bottom line is this PR is a go for merging.

I think we have follow on action to look closer into the BSD => BSD-2 spdxFomClassifier choice. Most if not all of the actual licenses I've seen for these projects are BSD-3 not BSD-2.
Black Duck data shows the BSD-3 as more popular than other BSD variants. https://en.wikipedia.org/wiki/BSD_licenses

I'm fine with the merging of this PR

@qtomlinson qtomlinson merged commit 85544f8 into clearlydefined:master Sep 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants