Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing GitHub API calls to scale scanning repositories #202

Closed
naveensrinivasan opened this issue Feb 21, 2021 · 5 comments
Closed

Reducing GitHub API calls to scale scanning repositories #202

naveensrinivasan opened this issue Feb 21, 2021 · 5 comments
Labels
duplicate This issue or pull request already exists GitHub priority/must-do Upcoming release

Comments

@naveensrinivasan
Copy link
Member

The GitHub API calls are throttled which makes it hard to scale the number of repositories to scan and provide results.

The code would have to wait for tens of minutes before continuing
{"level":"warn","ts":1613869247.8747272,"caller":"roundtripper/roundtripper.go:139","msg":"Rate limit exceeded. Waiting 44m34.125286853s to retry..."}

Scorecard checks for these don't need GitHub API, it requires a Git API

  1. Active
  2. Frozen-Deps
  3. CodeQLInCheckDefinitions
  4. Security-Policy
  5. Packaging

Potential solution

  1. Clone the Git Repo
  2. Git pull on these repo's on a cron - to get the updates
  3. Use an API to query these repositories directly instead of the GitHub

The https://github.com/go-git/go-git project provides an API on Git which could be used for avoiding the GitHub API limitations.

With httpcache #80 (comment) and reducing the number of GitHub API calls, we should be able to scale the scanning number of repositoreis.

related to #80

@naveensrinivasan
Copy link
Member Author

Also, this approach could be used to Scan not GitHub repositories like Gitlab which helps one API irrespective of the provider.

@inferno-chromium
Copy link
Contributor

Cloning might be too heavy weight for some big repos, and slow too.

Maybe lets start with httpcache first. We can also scale the number of github tokens. right now, we have 2, we can easily go into 4-5 (seperated with comma).

@naveensrinivasan
Copy link
Member Author

Cloning might be too heavy weight for some big repos, and slow too.

Maybe lets start with httpcache first. We can also scale the number of github tokens. right now, we have 2, we can easily go into 4-5 (seperated with comma).

Cloning can be async as another cron job and it is a one-time effort. Cloning should not be run as part of scorecard , probably give an additional option to look at a location for cached git repo's if not fetch them from github.com.

@inferno-chromium
Copy link
Contributor

Cloning might be too heavy weight for some big repos, and slow too.
Maybe lets start with httpcache first. We can also scale the number of github tokens. right now, we have 2, we can easily go into 4-5 (seperated with comma).

Cloning can be async as another cron job and it is a one-time effort. Cloning should not be run as part of scorecard , probably give an additional option to look at a location for cached git repo's if not fetch them from github.com.

Makes sense.

@azeemshaikh38
Copy link
Contributor

Closing this since we are already tracking this here: #318

@azeemshaikh38 azeemshaikh38 added the duplicate This issue or pull request already exists label May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists GitHub priority/must-do Upcoming release
Projects
None yet
Development

No branches or pull requests

3 participants