Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Increase PyPI parsing flexibility #3423

Merged
merged 12 commits into from
Aug 26, 2023
Merged

✨ Increase PyPI parsing flexibility #3423

merged 12 commits into from
Aug 26, 2023

Conversation

joshgc
Copy link
Contributor

@joshgc joshgc commented Aug 24, 2023

What kind of change does this PR introduce?

This is a likely a non-breaking change, but due to PyPI diversity I can't prove it doesn't break on any real-world package. I analyzed 396 packages (those I'm using in my current project). Of those only 63 could be correctly parsed by current parsing (before this PR) and I break none of them with this PR.

However after this PR, of the 396 packages 329 get parsed correctly..

Improvement opportunities:

  1. Some sphinx packages list a central repo for issues but a different, valid, repo for the source code. I don't think our approach of "parse everything" can handle these.
  2. Similar to sphinx, many of the types-* have the same problem. See types-protobuf

What is the current behavior?

Only use ["Info"]["project_urls"]["Source"] to find a github URL

What is the new behavior (if this is a feature change)?**

Look through all the ["Info"]["project_urls"] and ["Info"]["project_url"] to find all possible repos. Fail if that number is anything but 1.

  • Tests for the changes have been added (for bug fixes/features)

Which issue(s) this PR fixes

Fixes #3249

Special notes for your reviewer

None

Does this PR introduce a user-facing change?

Changes to the PyPI parsing will make scorecard fail on a small number of previously working packages 

@joshgc joshgc changed the title Increase PyPI parsing flexibility ✨ Increase PyPI parsing flexibility Aug 24, 2023
cmd/package_managers.go Show resolved Hide resolved
cmd/package_managers.go Outdated Show resolved Hide resolved
@spencerschrock
Copy link
Member

Note, you'll need to fix DCO on the commits you have already. Another repo has written a nice guide on fixing it here:
https://github.com/src-d/guide/blob/7bb18c0577b779a1d80c3f6cd6e3a825200e7dad/developer-community/fix-DCO.md#how-to-add-sign-offs-retroactively

I've enabled the rest of the CI to run, where the linter may pickup some things. I'll review the rest of the change when I have some time.

@joshgc joshgc temporarily deployed to gitlab August 24, 2023 18:29 — with GitHub Actions Inactive
@joshgc joshgc temporarily deployed to integration-test August 24, 2023 18:29 — with GitHub Actions Inactive
cmd/package_managers.go Fixed Show fixed Hide fixed
cmd/package_managers.go Fixed Show fixed Hide fixed
cmd/package_managers.go Fixed Show fixed Hide fixed
@codecov
Copy link

codecov bot commented Aug 24, 2023

Codecov Report

Merging #3423 (fdba658) into main (383e556) will increase coverage by 2.73%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3423      +/-   ##
==========================================
+ Coverage   63.80%   66.54%   +2.73%     
==========================================
  Files         183      183              
  Lines       12951    12989      +38     
==========================================
+ Hits         8263     8643     +380     
+ Misses       4228     3861     -367     
- Partials      460      485      +25     

… hope its unique

Signed-off-by: Josh Cogan <joshgc@google.com>
…t to support gitlab better

Signed-off-by: Josh Cogan <joshgc@google.com>
…t check there is a single valid url

Signed-off-by: Josh Cogan <joshgc@google.com>
@joshgc joshgc temporarily deployed to gitlab August 24, 2023 19:53 — with GitHub Actions Inactive
@joshgc joshgc temporarily deployed to integration-test August 24, 2023 19:53 — with GitHub Actions Inactive
Signed-off-by: Josh Cogan <joshgc@google.com>
cmd/package_managers.go Outdated Show resolved Hide resolved
cmd/package_managers.go Outdated Show resolved Hide resolved
cmd/package_managers.go Outdated Show resolved Hide resolved
cmd/package_managers.go Outdated Show resolved Hide resolved
cmd/package_managers.go Outdated Show resolved Hide resolved
@spencerschrock
Copy link
Member

make check-linter should help flag some of these linter warnings locally.

@spencerschrock
Copy link
Member

spencerschrock commented Aug 24, 2023

inform,maturin,quantiphy lists the same github url twice but with different URLs. Github redirects these to the same repo but I don't think thats documented behavior

These just look like capitalization issues. I think it's safe to normalize for your deduplication logic with strings.ToLower or something, we do it elsewhere in the code as well.

{
  "changelog": "https://github.com/KenKundert/inform/blob/master/doc/releases.rst",
  "documentation": "https://inform.readthedocs.io",
  "homepage": "https://inform.readthedocs.io",
  "repository": "https://github.com/kenkundert/inform"
}

Some sphinx packages list a central repo for issues but a different, valid, repo for the source code. I don't think our approach of "parse everything" can handle these.

agree. unless we have a list of tags ("source code", "repository", etc) we prefer, before falling back to the "parse everything" approach.

Similar to sphinx, many of the types-* have the same problem. See types-protobuf

Some sort of counting approach could help? Of course this may introduce issues for other repos.

{
  "Changes": "https://github.com/typeshed-internal/stub_uploader/blob/main/data/changelogs/protobuf.md",
  "Chat": "https://gitter.im/python/typing",
  "GitHub": "https://github.com/python/typeshed",
  "Homepage": "https://github.com/python/typeshed",
  "Issue tracker": "https://github.com/python/typeshed/issues"
}

…fix filter to remove .git

Signed-off-by: Josh Cogan <joshgc@google.com>
Signed-off-by: Josh Cogan <joshgc@google.com>
@joshgc
Copy link
Contributor Author

joshgc commented Aug 24, 2023

inform,maturin,quantiphy lists the same github url twice but with different URLs. Github redirects these to the same repo but I don't think thats documented behavior

These just look like capitalization issues. I think it's safe to normalize for your deduplication logic with strings.ToLower or something, we do it elsewhere in the code as well.

{
  "changelog": "https://github.com/KenKundert/inform/blob/master/doc/releases.rst",
  "documentation": "https://inform.readthedocs.io",
  "homepage": "https://inform.readthedocs.io",
  "repository": "https://github.com/kenkundert/inform"
}

Some sphinx packages list a central repo for issues but a different, valid, repo for the source code. I don't think our approach of "parse everything" can handle these.

agree. unless we have a list of tags ("source code", "repository", etc) we prefer, before falling back to the "parse everything" approach.

Similar to sphinx, many of the types-* have the same problem. See types-protobuf

Some sort of counting approach could help? Of course this may introduce issues for other repos.

{
  "Changes": "https://github.com/typeshed-internal/stub_uploader/blob/main/data/changelogs/protobuf.md",
  "Chat": "https://gitter.im/python/typing",
  "GitHub": "https://github.com/python/typeshed",
  "Homepage": "https://github.com/python/typeshed",
  "Issue tracker": "https://github.com/python/typeshed/issues"
}

Yes I made this case insensitive. Even without the ranked key look up we have x5 the hit rate! I can also add the list look up method too. I ended up having ~10 items in it before

Signed-off-by: Josh Cogan <joshgc@google.com>
Copy link
Member

@spencerschrock spencerschrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks, just the small question. And the PR will need to be marked ready for review.

cmd/package_managers.go Show resolved Hide resolved
@joshgc joshgc marked this pull request as ready for review August 25, 2023 18:20
@spencerschrock
Copy link
Member

Just need DCO fixed again for your very last commit

Signed-off-by: Josh Cogan <joshgc@google.com>
@joshgc
Copy link
Contributor Author

joshgc commented Aug 25, 2023

Just need DCO fixed again for your very last commit

Done. Sorry/Thanks!!

Signed-off-by: Josh Cogan <joshgc@google.com>
@joshgc joshgc temporarily deployed to gitlab August 26, 2023 00:33 — with GitHub Actions Inactive
@joshgc joshgc temporarily deployed to integration-test August 26, 2023 00:33 — with GitHub Actions Inactive
Copy link
Member

@spencerschrock spencerschrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@spencerschrock spencerschrock merged commit 9a844ab into ossf:main Aug 26, 2023
38 checks passed
ashearin pushed a commit to kgangerlm/scorecard-gitlab that referenced this pull request Nov 13, 2023
* Make PyPI parsing more flexible to find any github or gitlab url, and hope its unique

Signed-off-by: Josh Cogan <joshgc@google.com>

* Refactor the addRepo to not pass around a mutable object. Tweak a test to support gitlab better

Signed-off-by: Josh Cogan <joshgc@google.com>

* Ignore users called sponsors for github repos. Remove the set and just check there is a single valid url

Signed-off-by: Josh Cogan <joshgc@google.com>

* Remove unneeded variables and code

Signed-off-by: Josh Cogan <joshgc@google.com>

* Reducing indentation

Signed-off-by: Josh Cogan <joshgc@google.com>

* Make github url path parts case insensitive and use more explicit suffix filter to remove .git

Signed-off-by: Josh Cogan <joshgc@google.com>

* Appease the linter--may its wisdown never wane.

Signed-off-by: Josh Cogan <joshgc@google.com>

* CamelCase -> camelCase to prevent export

Signed-off-by: Josh Cogan <joshgc@google.com>

* Add test and allowance for gitlab to also be case insensitive

Signed-off-by: Josh Cogan <joshgc@google.com>

* hub vs lab typo

Signed-off-by: Josh Cogan <joshgc@google.com>

---------

Signed-off-by: Josh Cogan <joshgc@google.com>
Signed-off-by: Allen Shearin <allen.p.shearin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can I contribute to "Repo lookup from PyPI too conservative to use"
3 participants