Skip to content

Node application that will crawl a website and gather all links that are subdirectories of the provided url

Notifications You must be signed in to change notification settings

g5search/link-discoverer

Repository files navigation

Link Discoverer

For our scraping needs, we found it useful to break up the crawl and the scrape into different libraries. This is just the crawler, which returns a deduped array of urls for a given website.

Clone and Install Dependencies

git clone https://github.com/tylrhas/link-discoverer.git
cd link-discoverer
npm i

# TO RUN
npm run dev
# OR
node index.js

GCloud Cloud Run Deploy

You need an authenticated GCloud account with a project will billing configured and access to IAM and Cloud Run services.

You also need the project name, which is referenced as $GCP_PROJECT_ID below.

gcloud iam service-accounts create link-discoverer-identity

gcloud run deploy link-discoverer \
  --image gcr.io/$GCP_PROJECT_ID/link-discoverer \
  --service-account rover-identity \
  --no-allow-unauthenticated

Select [1] for Cloud Run (fully managed) Select [21] for us-west1 region

Record Service URL from response and the sevice-account name you provided. Those will be used by the service invoking this run.

About

Node application that will crawl a website and gather all links that are subdirectories of the provided url

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •