Hub Crawl

Hub Crawl finds broken links in Github repositories. It finds links in the readme portions of the repos (or the wiki-content section for wiki pages), scrapes the links of those sections, and continues the crawl beginning with those newfound links. Additionally, the requests are made in parallel to ensure a speedy crawl. It essentially performs a concurrent, breadth-first graph traversal and logs broken links as it goes.

Installation

Global Use

To begin using Hub Crawl, install it globally with npm.

npm install -g hub-crawl

Or, if you use yarn:

yarn global add hub-crawl

Use in Projects

To add Hub Crawl to your project, install it with npm:

npm install hub-crawl

Or, if you use yarn:

yarn add hub-crawl

Terminology

Regardless of how you choose to implement Hub Crawl, the following are important terms:

`entry`

The entry is the url that is first visited and scraped for links.

`scope`

The scope is a url that defines the limit of link scraping. For example, let's assume the scope is set to https://github.com/louisscruz/hub-crawl. If https://github.com/louisscruz/hub-crawl/other is in the queue, it will be both visited and scraped. However, if https://google.com is in the queue, it will be visited, but not scraped because the url does not begin with the scope url. This keeps Hub Crawl from scouring the entire internet. If you do not provide a scope, Hub Crawl defaults to using the entry that was provided.

`workers`

The number of workers determines the maximum number of parallel requests to be open at any given time. The optimal number of workers depends on your hardware and internet speed.

Usage

There are two ways to use Hub Crawl. For common usage, it is likely preferable to use the command line. If you are integrating Hub Crawl into a bigger project, it is probably worth importing or requiring the Hub Crawl class.

Command Line

After Hub Crawl is installed globally, you can run hub-crawl in the command line. It accepts the following arguments and options in the following format:

hub-crawl [entry] [scope] -l -w 12

Arguments

`[entry]`

If not provided, the program will prompt you for this.

`[scope]`

If not provided, the program will prompt you for this.

Options

`-l`

If this option is provided, then an initial log in window will appear so that the crawl is authenticated while running. This is useful for private repos.

`-w`

If this option is provided, it will set the maximum number of workers. For instance, -w 24 would set a maximum of 24 workers.

`-V`

This option shows the current version of hub-crawl.

Importing

If you would like to use Hub Crawl in a project, feel free to import it as such:

import HubCrawl from 'hub-crawl';

Or, if you're still not using import:

var HubCrawl = require('hub-crawl');

Create an instance:

const crawler = new HubCrawl(12, 'https://google.com')

HubCrawl takes the following as arguments at instantiation:

HubCrawl(workers, entry[, scope]);

The methods available on HubCrawl instances can be seen here. However, the most important methods follow below.

`traverseAndLogOutput(login)`

This method performs the traversal and logs the broken links. The login argument determines whether or not an initial window will pop up to log in.

`traverse(login)`

This method performs the traversal. The login argument determines whether or not an initial window will pop up to log in. The traverse method returns the broken links. Note that the workers are left alive afterwards.

False Positives

As it currently stands, the crawler only makes a single, breadth-first, concurrent graph traversal. If it happens to be the case that a server was only temporarily down during the traversal, it will still count as a broken link. In the future, a second check will be made on each of the broken links to ensure that they are indeed broken.

Future Improvements

Set the scope through user input, rather than defaulting to the entry
Run the queries in parallel, rather than synchronously
Make into NPM package
- Allow for CLI usage
- Also allow for fallback prompts
Perform a second check on all broken links to minimize false positives
Make the output look better
Allow for the crawler to be easily distributed

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
dist		dist
src		src
.babelrc		.babelrc
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hub Crawl

Installation

Global Use

Use in Projects

Terminology

`entry`

`scope`

`workers`

Usage

Command Line

Arguments

`[entry]`

`[scope]`

Options

`-l`

`-w`

`-V`

Importing

`traverseAndLogOutput(login)`

`traverse(login)`

False Positives

Future Improvements

About

Releases

Packages

Languages

louisscruz/hub-crawl

Folders and files

Latest commit

History

Repository files navigation

Hub Crawl

Installation

Global Use

Use in Projects

Terminology

entry

scope

workers

Usage

Command Line

Arguments

[entry]

[scope]

Options

-l

-w

-V

Importing

traverseAndLogOutput(login)

traverse(login)

False Positives

Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`entry`

`scope`

`workers`

`[entry]`

`[scope]`

`-l`

`-w`

`-V`

`traverseAndLogOutput(login)`

`traverse(login)`

Packages