JobsDB Scraper

A few cool highlights about this scraper:

Lightweight,and made to run on commodity computers - Low memory/cpu utilization due to efficient use of modern web-scraping framework (https://github.com/ulixee).
Avoids detection along the entire stack - High guarantees on ability to safely scrape jobs and bypass Cloudflare.
Customize how many pages you want to scrape - You can specify how many pages of jobs you want to scrape up to a maximum of all.

Installation

Requirements:

Node.js version 18 or higher. If not installed, go here to download it.
git required. If not installed, go here to download it.

Steps:

clone the repo

git clone https://github.com/krishgalani/jobsdb-scraper.git

cd into the repo

cd jobsdb-scraper

install dependencies

npm install

compile typescript

npm run build

Usage

Warning: This operation is NOT thread-safe.

To find the maxPages available to scrape for a region (hk or th):

node --no-warnings build/src/scrape_jobsdb maxPages <region>

To run the scraper (can take up to ~10m):

node --no-warnings build/src/scrape_jobsdb [options]
Options:
  -r, --region <two_letters>  hk (Hong Kong) or th (Thailand) (required)
  -n, --numPages <number>     Number of pages to scrape (default: "all")
  -s, --saveDir <pathToDir>   Directory to store results file  (default: "./jobsdb_scrape_results")

Examples

Find maxPages available to scrape for Hong Kong

node --no-warnings build/src/scrape_jobsdb maxPages hk

Scrape all pages in thailand

node --no-warnings build/src/scrape_jobsdb -r th

The name format of the result file is jobsdb-<region>-<pages>-<date>.json and saved in a folder called jobsdb_scrape_results by default.

How it works

The server part of the program launches two @ulixee/cloud locally hosted server nodes as the engines behind page navigation and fetches, both hosting a browser with many browsing sessions.

The client program uses the ulixee framework (github.com/ulixee), where each worker (a @ulixee/hero instance connected to a respective @ulixee/cloud server node) has a browser environment and goes page by page on its page range chunk making GETS and POST fetches to the backend db. All workers have a shared page task queue. For each page, first the jobIds are parsed from the returned HTML, then for each jobId a fetch to the backend graphql API is initiated. The results are received in real time and written to a JSON file locally.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
assets		assets
src		src
tests		tests
LICENSE		LICENSE
README.md		README.md
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
sample_output.json		sample_output.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JobsDB Scraper

Installation

Requirements:

Steps:

Usage

Warning: This operation is NOT thread-safe.

Examples

How it works

License

About

Releases

Packages

Languages

License

krishgalani/jobsdb-scraper

Folders and files

Latest commit

History

Repository files navigation

JobsDB Scraper

Installation

Requirements:

Steps:

Usage

Warning: This operation is NOT thread-safe.

Examples

How it works

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages