github-scraper

A dockerized HTTP server for scraping user and repository information from GitHub.

Setup

Docker
- Install Docker
- OR brew users: run brew cask install docker
sbt
- Install sbt
- OR brew users: run brew install sbt
Environment variables
- In order to avoid being rate limited by GitHub's API, the app authenticates using username/token or username/password
- Create an access token. The access token does not need any additional permissions.
- run export GITHUB_USERNAME={your_username}
- run export GITHUB_TOKEN={your_token}
- OR (not recommended) run export GITHUB_PASSWORD={your_password}

Getting started

$ make start

and you're up and running on http://localhost:8181!

Make recipes

make start - Start the server
make stop - Stop the server
make restart - Restart the server
make drop-db - Drop the volume holding the database data

Configuration

CRON

The server is designed to run Scraping jobs on a cron schedule. The schedule can be modified in scrape.cron. A server restart is required to pick up the changes.

Job configuration

The scraping jobs are configured via 2 json files: users.json and repos.json.

Example users.json configuration

{
  "start": 0,
  "count": 100,
  "additional_users": [
    "foo",
    "bar"
  ]
}

This file tells the GitHub User Scraping Job to start at the first ID after 0 and query 100 users. In a perfect world this would be the users with ids 1-100, but some users have been deleted and there are gaps in the IDs.

It will also fetch additional users by username as specified by additional_users.

Rest API

The server has a REST API that allows the querying of scraped data. It will also allow the user to synchronously scrape more data.

Users API

Get Users: `GET /users`

Get all persisted user information

URL: http://localhost:8181/users

Code: 200 OK

Response:

[  
    {
        "login": "travisemichael",
        "id": 7723569,
        "type": "User",
        "name": "Travis E Michael",
        "company": "Caffeine",
        "blog": "",
        "location": "Redwood City, CA",
        "email": "travisemichael@gmail.com",
        "publicRepos": 6,
        "publicGists": 0,
        "followers": 1,
        "following": 0,
        "createdAt": "2014-05-28T12:16:54Z",
        "updatedAt": "2019-08-08T11:35:58Z"
    },
    {
        "login": "Tesorio",
        "id": 8165102,
        "type": "Organization",
        "name": "Tesorio",
        "blog": "https://www.tesorio.com/",
        "location": "San Francisco Bay Area, CA",
        "email": "hello@tesorio.com",
        "publicRepos": 13,
        "publicGists": 0,
        "followers": 0,
        "following": 0,
        "createdAt": "2014-07-15T04:09:06Z",
        "updatedAt": "2018-12-11T19:22:06Z"
    } 
]

Get User: `GET /users/:id`

Get persisted user information by user id

URL: http://localhost:8181/users/7723569

Code: 200 OK

Response:

{
    "login": "travisemichael",
    "id": 7723569,
    "type": "User",
    "name": "Travis E Michael",
    "company": "Caffeine",
    "blog": "",
    "location": "Redwood City, CA",
    "email": "travisemichael@gmail.com",
    "publicRepos": 6,
    "publicGists": 0,
    "followers": 1,
    "following": 0,
    "createdAt": "2014-05-28T12:16:54Z",
    "updatedAt": "2019-08-08T11:35:58Z"
}

Get User: `GET /users/:name`

Get persisted user information by user name

URL: http://localhost:8181/users/travisemichael

Code: 200 OK

Response:

{
    "login": "travisemichael",
    "id": 7723569,
    "type": "User",
    "name": "Travis E Michael",
    "company": "Caffeine",
    "blog": "",
    "location": "Redwood City, CA",
    "email": "travisemichael@gmail.com",
    "publicRepos": 6,
    "publicGists": 0,
    "followers": 1,
    "following": 0,
    "createdAt": "2014-05-28T12:16:54Z",
    "updatedAt": "2019-08-08T11:35:58Z"
}

Scrape Users `POST /users`

Scrape and persist user information from GitHub, starting with the first user with a valid ID greater than start and continuing until count repos have been scraped.

URL: http://localhost:8181/users?start=0&count=10

Code: 200 OK

Scrape User: `POST /users/:name`

Scrape and persist user information from GitHub by user name

URL: http://localhost:8181/users/travisemichael

Code: 200 OK

Response:

{
    "login": "travisemichael",
    "id": 7723569,
    "type": "User",
    "name": "Travis E Michael",
    "company": "Caffeine",
    "blog": "",
    "location": "Redwood City, CA",
    "email": "travisemichael@gmail.com",
    "publicRepos": 6,
    "publicGists": 0,
    "followers": 1,
    "following": 0,
    "createdAt": "2014-05-28T12:16:54Z",
    "updatedAt": "2019-08-08T11:35:58Z"
}

Repos API

Get Repos: `GET /repos`

Get all persisted repo information

URL: http://localhost:8181/repos

Code: 200 OK

Response:

[
    {
        "id": 87983904,
        "name": "django-saml2-auth",
        "fullName": "Tesorio/django-saml2-auth",
        "ownerId": 8165102,
        "htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
        "description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
        "fork": true
    },
    {
        "id": 124510032,
        "name": "charts",
        "fullName": "google/charts",
        "ownerId": 1342004,
        "htmlUrl": "https://github.com/google/charts",
        "fork": false
    }
]

Get Repo: `GET /repos/:id`

Get persisted repo information using the repo id

URL: http://localhost:8181/repos/87983904

Code: 200 OK

Response:

{
    "id": 87983904,
    "name": "django-saml2-auth",
    "fullName": "Tesorio/django-saml2-auth",
    "ownerId": 8165102,
    "htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
    "description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
    "fork": true
}

Get Repo: `GET /repos/:owner/:name`

Get persisted repo information using the repo's owner name and the repo name

URL: http://localhost:8181/repos/tesorio/django-saml2-auth

Code: 200 OK

Response:

{
    "id": 87983904,
    "name": "django-saml2-auth",
    "fullName": "Tesorio/django-saml2-auth",
    "ownerId": 8165102,
    "htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
    "description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
    "fork": true
}

Scrape Repos: `POST /repos`

Scrape and persist repo information from GitHub, starting with the first repo with a valid ID greater than start and continuing until count repos have been scraped.

URL: http://localhost:8181/repos?start=0&count=10

Code: 200 OK

Scrape Repo: `POST /repos/:owner/:name`

Scrape and persist repo information from GitHub by repo's owner name and repo name

URL: http://localhost:8181/repos/tesorio/django-saml2-auth

Code: 200 OK

Response:

{
    "id": 87983904,
    "name": "django-saml2-auth",
    "fullName": "Tesorio/django-saml2-auth",
    "ownerId": 8165102,
    "htmlUrl": "https://github.com/Tesorio/django-saml2-auth",
    "description": "Django SAML2 Authentication Made Easy. Easily integrate with SAML2 SSO identity providers like Okta",
    "fork": true
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
project		project
src/main/scala/com/travisemichael		src/main/scala/com/travisemichael
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml
repos.json		repos.json
scrape.cron		scrape.cron
users.json		users.json
wait-for-it.sh		wait-for-it.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

github-scraper

Setup

Getting started

Make recipes

Configuration

CRON

Job configuration

Rest API

Users API

Get Users: `GET /users`

Get User: `GET /users/:id`

Get User: `GET /users/:name`

Scrape Users `POST /users`

Scrape User: `POST /users/:name`

Repos API

Get Repos: `GET /repos`

Get Repo: `GET /repos/:id`

Get Repo: `GET /repos/:owner/:name`

Scrape Repos: `POST /repos`

Scrape Repo: `POST /repos/:owner/:name`

About

Releases

Packages

Languages

travisemichael/github-scraper

Folders and files

Latest commit

History

Repository files navigation

github-scraper

Setup

Getting started

Make recipes

Configuration

CRON

Job configuration

Rest API

Users API

Get Users: GET /users

Get User: GET /users/:id

Get User: GET /users/:name

Scrape Users POST /users

Scrape User: POST /users/:name

Repos API

Get Repos: GET /repos

Get Repo: GET /repos/:id

Get Repo: GET /repos/:owner/:name

Scrape Repos: POST /repos

Scrape Repo: POST /repos/:owner/:name

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Get Users: `GET /users`

Get User: `GET /users/:id`

Get User: `GET /users/:name`

Scrape Users `POST /users`

Scrape User: `POST /users/:name`

Get Repos: `GET /repos`

Get Repo: `GET /repos/:id`

Get Repo: `GET /repos/:owner/:name`

Scrape Repos: `POST /repos`

Scrape Repo: `POST /repos/:owner/:name`

Packages