-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derive license from info.license over classifiers in pypi registry data #586
Derive license from info.license over classifiers in pypi registry data #586
Conversation
Based on the Python Packaging User Guide (https://packaging.python.org/en/latest/specifications/core-metadata/#license), it appears that the license field holds more specific information. Change to parsing license field before the classifiers. When classifier has the version info and license field does not, use classifier first in that case.
@Jeffrey-Luszcz @elrayle Here is the file containing the extracted license information for 1000 pypi components. The structure of each entry is as follows:
|
@Jeffrey-Luszcz Please find a more detailed JSON file with extracted SPDX licenses from the specified fields here: licenses3.json. In 150 instances, the non-empty SPDX licenses from the info.license field differ from those in the info.classifiers field. It appears that the licenses from the info.license field provide more detailed information. |
I think we have a BSD vs BSD-2 vs BSD-3 clause problem we need to figure out how to handle. By reviewing the JSON file I see data with the following styles of problems: License in repo is BSD-3 [ https://github.com/encode/starlette/blob/master/LICENSE.md ]but json says variously BSD & BSD-2: List in repo is BSD-3 [ https://github.com/andialbrecht/sqlparse/blob/master/LICENSE ] but JSON says BSD or BSD-2 License in repo is BSD-3 [ https://github.com/google/re2/blob/main/LICENSE ]but json says variously BSD & BSD-2: |
Also for LGPL we have a similar problem https://pypi.org/project/astroid/ License on pypi.org says "License: GNU Lesser General Public License v2 (LGPLv2) (LGPL-2.1-or-later)" but there's a mix of 2.0 and 2.1 and only and + in the json |
@Jeffrey-Luszcz This pull request is focused on extracting license information from the core metadata of the Python registry. This does not involve inspecting the license file, as that task is handled by other tools such as Licensee and ScanCode. The discrepancies you mentioned here are related to the cases where there is a mismatch between the registry metadata and the license file. The JSON file only contains the information extracted from the registry metadata and does not reflect the information present in the license file. |
To close out this issue on my side. I support the change described (e.g. Derive license from info.license over classifiers) My point in the comments above is that we should take another look at the mapping we have when we have a bare "BSD" license declared in these fields. We currently map it to BSD-2 and likely should map it to BSD-3 (the more common license choice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good. Thanks for adding tests to capture the new possible conditions.
I spoke with @Jeffrey-Luszcz. These are his comments. Bottom line is this PR is a go for merging.
|
Based on the Python Packaging User Guide, it appears that the license field holds more specific information. Change to parsing license field before the classifiers. When classifier has the version info and the license field does not, use classifier first in that case.
Also updated the toolVersion in pypiExtract.
Task: #523
Task: #519
Task: #429