Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scorecard scalability limitation: Reduce GitHub API calls #80

Closed
htuch opened this issue Nov 25, 2020 · 8 comments
Closed

Scorecard scalability limitation: Reduce GitHub API calls #80

htuch opened this issue Nov 25, 2020 · 8 comments
Labels
priority/must-do Upcoming release

Comments

@htuch
Copy link

htuch commented Nov 25, 2020

I've been running scorecard over ~40 Envoy dependencies a few times in the past hour and even using a personal access token for GitHub, I've hit the rate limit. I'm wondering what can be done to make this work better in scorecard. Suggestions:

  • Profile GitHub API calls, determine the most costly, optimize to API alternatives. Share details on which are most expensive in public documentation.
  • Add support for local caching of results for some time bound period.
  • Provide a scorecard network service that collects results for projects periodically.
  • Have scorecard JSON examples generalized to some contributor editable file that will collect projects nightly.
@htuch htuch changed the title Minimize and/or cache GitHub API calls Scorecard scalability limitation: GitHub API calls Nov 25, 2020
@dlorenc
Copy link
Contributor

dlorenc commented Nov 26, 2020

  • Provide a scorecard network service that collects results for projects periodically.

This is the approach we've taken so far. There's a cron that runs every day on ~100 projects and publishes the results in GCS/BigQuery.

I'd be happy to add the envoy deps into that list, it's here: https://github.com/ossf/scorecard/blob/main/cron/projects.txt

@dlorenc
Copy link
Contributor

dlorenc commented Nov 28, 2020

Adding the envoy deps here: #84

@naveensrinivasan
Copy link
Member

naveensrinivasan commented Feb 18, 2021

GitHub API supports conditional requests https://docs.github.com/en/rest/overview/resources-in-the-rest-api#conditional-requests

Most responses return an ETag header. Many responses also return a Last-Modified header. You can use the values of these headers to make subsequent requests to those resources using the If-None-Match and If-Modified-Since headers, respectively. If the resource has not changed, the server will return a 304 Not Modified.
Making a conditional request and receiving a 304 response does not count against your Rate Limit, so we encourage you to use it whenever possible.

Making a conditional request and receiving a 304 response does not count against your Rate Limit, so we encourage you to use it whenever possible.

https://github.com/google/go-github supports Conditional requests https://github.com/google/go-github#conditional-requests

As we are scaling more and more projects this would add a lot of value.

https://github.com/gregjones/httpcache

cc @inferno-chromium

@naveensrinivasan
Copy link
Member

A visual representation of the proposed solution.

image

  1. k8s cron job runs on a schedule.
  2. Initial run fetches information using httpcache as a middleware, which caches the HTTP response initially in a large disk (PVC), probably move to Redis later as a cache instead of disk.
  3. Subsequent cron runs will utilize the httpcache for checking content modification and load it from the cache if it isn't modified, which reduces the hitting the Rate Limit of the GitHub API.

@dlorenc
Copy link
Contributor

dlorenc commented Feb 21, 2021

Awesome! What would the cache keys be? URLs?

@naveensrinivasan
Copy link
Member

naveensrinivasan commented Feb 21, 2021

Awesome! What would the cache keys be? URLs?

Here it is.
https://github.com/gregjones/httpcache/blob/901d90724c7919163f472a9812253fb26761123d/httpcache.go#L42

@naveensrinivasan
Copy link
Member

This should reduce the GitHub API usage #227

@inferno-chromium inferno-chromium added the priority/must-do Upcoming release label Mar 22, 2021
@inferno-chromium inferno-chromium changed the title Scorecard scalability limitation: GitHub API calls Scorecard scalability limitation: Reduce GitHub API calls Mar 30, 2021
@azeemshaikh38
Copy link
Contributor

Close this, since the solution is being tracked/implemented in #318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/must-do Upcoming release
Projects
None yet
Development

No branches or pull requests

5 participants