A dockerized HTTP server for scraping user and repository information from GitHub.
- Docker
- Install Docker
- OR brew users: run
brew cask install docker
- sbt
- Install sbt
- OR brew users: run
brew install sbt
- Environment variables
- In order to avoid being rate limited by GitHub's API, the app authenticates using username/token or username/password
- Create an access token. The access token does not need any additional permissions.
- run
export GITHUB_USERNAME={your_username}
- run
export GITHUB_TOKEN={your_token}
- OR (not recommended) run
export GITHUB_PASSWORD={your_password}
$ make start
and you're up and running on http://localhost:8181!
make start
- Start the servermake stop
- Stop the servermake restart
- Restart the servermake drop-db
- Drop the volume holding the database data
The server is designed to run Scraping jobs on a cron schedule. The schedule can be modified in scrape.cron. A server restart is required to pick up the changes.
The scraping jobs are configured via 2 json files: users.json and repos.json.
Example users.json configuration
{
"start": 0,
"count": 100,
"additional_users": [
"foo",
"bar"
]
}
This file tells the GitHub User Scraping Job to start at the first ID after 0 and query 100 users. In a perfect world this would be the users with ids 1-100, but some users have been deleted and there are gaps in the IDs.
It will also fetch additional users by username as specified by additional_users
.
The server has a REST API that allows the querying of scraped data. It will also allow the user to synchronously scrape more data.
Get all persisted user information
URL: http://localhost:8181/users
Code: 200 OK
Response:
[
{
"login": "travisemichael",
"id": 7723569,
"type": "User",
"name": "Travis E Michael",
"company": "Caffeine",
"blog": "",
"location": "Redwood City, CA",
"email": "travisemichael@gmail.com",
"publicRepos": 6,
"publicGists": 0,
"followers": 1,
"following": 0,
"createdAt": "2014-05-28T12:16:54Z",
"updatedAt": "2019-08-08T11:35:58Z"
},
{
"login": "Tesorio",
"id": 8165102,
"type": "Organization",
"name": "Tesorio",
"blog": "https://www.tesorio.com/",
"location": "San Francisco Bay Area, CA",
"email": "hello@tesorio.com",
"publicRepos": 13,
"publicGists": 0,
"followers": 0,
"following": 0,
"createdAt": "2014-07-15T04:09:06Z",
"updatedAt": "2018-12-11T19:22:06Z"
}
]
Get persisted user information by user id
URL: http://localhost:8181/users/7723569
Code: 200 OK
Response:
{
"login": "travisemichael",
"id": 7723569,
"type": "User",
"name": "Travis E Michael",
"company": "Caffeine",
"blog": "",
"location": "Redwood City, CA",
"email": "travisemichael@gmail.com",
"publicRepos": 6,
"publicGists": 0,
"followers": 1,
"following": 0,
"createdAt": "2014-05-28T12:16:54Z",
"updatedAt": "2019-08-08T11:35:58Z"
}
Get persisted user information by user name
URL: http://localhost:8181/users/travisemichael
Code: 200 OK
Response:
{
"login": "travisemichael",
"id": 7723569,
"type": "User",
"name": "Travis E Michael",
"company": "Caffeine",
"blog": "",
"location": "Redwood City, CA",
"email": "travisemichael@gmail.com",
"publicRepos": 6,
"publicGists": 0,
"followers": 1,
"following": 0,
"createdAt": "2014-05-28T12:16:54Z",
"updatedAt": "2019-08-08T11:35:58Z"
}
Scrape and persist user information from GitHub, starting with the first user with a valid ID greater than start
and continuing until count
repos have been scraped.
URL: http://localhost:8181/users?start=0&count=10
Code: 200 OK
Scrape and persist user information from GitHub by user name
URL: http://localhost:8181/users/travisemichael
Code: 200 OK
Response:
{
"login": "travisemichael",
"id": 7723569,
"type": "User",
"name": "Travis E Michael",
"company": "Caffeine",
"blog": "",
"location": "Redwood City, CA",
"email": "travisemichael@gmail.com",
"publicRepos": 6,
"publicGists": 0,
"followers": 1,
"following": 0,
"createdAt": "2014-05-28T12:16:54Z",
"updatedAt": "2019-08-08T11:35:58Z"
}
Get all persisted repo information
URL: http://localhost:8181/repos
Code: 200 OK
Response:
[
{
"id": 87983904,
"name": "django-saml2-auth",
"fullName": "Tesorio/django-saml2-auth",
"ownerId": 8165102,
"htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
"description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
"fork": true
},
{
"id": 124510032,
"name": "charts",
"fullName": "google/charts",
"ownerId": 1342004,
"htmlUrl": "https://github.com/google/charts",
"fork": false
}
]
Get persisted repo information using the repo id
URL: http://localhost:8181/repos/87983904
Code: 200 OK
Response:
{
"id": 87983904,
"name": "django-saml2-auth",
"fullName": "Tesorio/django-saml2-auth",
"ownerId": 8165102,
"htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
"description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
"fork": true
}
Get persisted repo information using the repo's owner
name and the repo name
URL: http://localhost:8181/repos/tesorio/django-saml2-auth
Code: 200 OK
Response:
{
"id": 87983904,
"name": "django-saml2-auth",
"fullName": "Tesorio/django-saml2-auth",
"ownerId": 8165102,
"htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
"description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
"fork": true
}
Scrape and persist repo information from GitHub, starting with the first repo with a valid ID greater than start
and continuing until count
repos have been scraped.
URL: http://localhost:8181/repos?start=0&count=10
Code: 200 OK
Scrape and persist repo information from GitHub by repo's owner
name and repo name
URL: http://localhost:8181/repos/tesorio/django-saml2-auth
Code: 200 OK
Response:
{
"id": 87983904,
"name": "django-saml2-auth",
"fullName": "Tesorio/django-saml2-auth",
"ownerId": 8165102,
"htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
"description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
"fork": true
}