Skip to content

Latest commit

 

History

History
101 lines (67 loc) · 3.56 KB

CONTRIBUTING.md

File metadata and controls

101 lines (67 loc) · 3.56 KB

Hello you,

Getting started

Setting up the Python environment

pipenv is used to manage the whole Python environment.

  • Install pipenv (this must be done only once)
  • Run pipenv install --dev to create a Python virtual environment if unexisting and to install the Scraper's dependencies (this will have to be done each time dependencies are modified)
  • Run pipenv shell to enter the Python virtual environment (this will have to be done each time you get back to work on the Scraper)

Authentication

Disclaimer: When Algolia runs the scraper on Algolia's infrastructure, no authentication methods are possible since DocSearch focus on publicly available websites.

WARNING: Please be aware that the scraper sends authentication headers to every scraped site, so use allowed_domains to adjust the scope accordingly!

Basic HTTP:

Basic HTTP authentication is supported by setting these environment variables:

  • DOCSEARCH_BASICAUTH_USERNAME
  • DOCSEARCH_BASICAUTH_PASSWORD

Cloudflare Access: Identity and Access Management

If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers. Values for these headers are taken from env variables CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.

In case of Google Cloud Identity-Aware Proxy, please specify these env variables:

Installing Chrome Headless

Websites that need JavaScript for rendering are passed through ChromeDriver. Download the version suited to your OS and then update the CHROMEDRIVER_PATH in your .env file.

You should be ready to go.

Running

See the dedicated page on Algolia's documentation web site.

Lint code

The code is checked against linting rules by the CI, with pylint (which is installed by pipenv as a dev package).

To run the linter, run the following command at the root of your clone:

pipenv run pylint scraper cli deployer

Run the tests

To run the full test suite, run ./docsearch test.

Special instructions for Algolia employees

Ready to deploy

If you are Algolia employee and want to manage a DocSearch account, you'll need to copy and paste the required variables in your .env file.

Ping the @DocSearch team to get access to those credentials.

Clone the configurations repository in your docsearch-scraper directory. Command to run from the scraper root:

git clone git@github.com:algolia/docsearch-configs.git configs/public

The CLI will then have more commands for you to run.

Reason of a fail from the docker image

To spot why a crawl fail without watching the logs, we have defined some custom exit status:

Exit code Reason
3 No record extracted from the crawl
4 Too much hits returned from the crawl
5 The configuration provided is not a valid JSON
6 The endpoint to call is incorrect
7 Credentials used to request are not set