Skip to content

Latest commit

 

History

History

enumerate_github

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

GitHub Enumeration Tool

This tool is used to reliably enumerate projects on GitHub.

The output of this tool is can be used as an input for the criticality_score tool, or for input for the collect_signals worker.

Example

$ export GITHUB_TOKEN=ghp_x  # Personal Access Token Goes Here 
$ enumerate_github \
    -start 2008-01-01 \
    -min-stars=10 \
    -workers=1 \
    -out=github_projects.txt

Install

$ go install github.com/ossf/criticality_score/v2/cmd/enumerate_github@latest

Usage

$ enumerate_github [FLAGS]...

The URL for each repository is written to the output. By default stdout is used for output.

FLAGS are optional. See below for documentation.

Authentication

A comma delimited environment variable with one or more GitHub Personal Access Tokens must be set

Supported environment variables are GITHUB_AUTH_TOKEN, GITHUB_TOKEN, GH_TOKEN, or GH_AUTH_TOKEN.

Example:

$ export GITHUB_TOKEN=ghp_abc,ghp_123

Flags

Output flags

  • -out FILE specify the FILE to use for output. By default stdout is used.
  • -append appends output to FILE if it already exists.
  • -force overwrites FILE if it already exists and -append is not set.
  • -format {text|scorecard} indicates the format to use for output. text is used by default and consists of one URL per line. scorecard outputs a CSV file compatible with the scorecard project.

If FILE exists and neither -append nor -force is set the command will fail.

Date flags

  • -start date the start date to enumerate back to. Must be at or after 2008-01-01. Defaults to 2008-01-01.
  • -end date the end date to enumerate from. Defaults to today's date.

Query/Star flags

  • -min-stars int only enumerates repositories with this or more of stars Defaults to 10.
  • -query string sets the base query to use for enumeration. Defaults to is:public. See GitHub's search help for more detail.
  • -require-min-stars abort execution if -min-stars can't be reached during enumeration. If not set some repositories created on a certain date may not be included.
  • -star-overlap int the number of stars to overlap between queries. Defaults to 5. A an overlap is used to avoid missing repositories whose star count changes during enumeration.

Misc flags

  • -log level set the level of logging. Can be debug, info (default), warn or error.
  • -workers int the total number of concurrent workers to use. Default is 1.
  • -help displays help text.

How It Works

Refer to Milestone 1 for details on the algorithm.

Q&A

Q: What is the lowest practical setting for -min-stars

10 has been successfully tested, although lower may be possible.

TODO -- more detail

Q: How long does it take?

A single GitHub Personal Access Token took about 4 hours to return all projects with >= 20 stars.

Faster performance can be achieved with more Personal Access Tokens and additional workers.

Q: How many workers should I use?

Generally, use 1 worker for each Personal Access Token.

More workers than tokens may result in secondary rate limits.

It is possible that more restricted searches will succeed with more workers per token.

Development

Rather than installing the binary, use go run to run the command.

For example:

$ go run ./cmd/enumerate_github [FLAGS]...

Limiting the data allows for runs to be completed quickly. For example:

$ go run ./cmd/enumerate_github \
    -log=debug \
    -start=2022-06-14 \
    -end=2022-06-21 \
    -min-stars=20