-
Notifications
You must be signed in to change notification settings - Fork 9
CLI Reference
qdequele edited this page Nov 9, 2024
·
1 revision
Scrapix can be run as a CLI tool in two ways:
For single crawling tasks, you can run Scrapix with a configuration file:
# Using a config file path
npm run start -- -p /path/to/config.json
# Using inline JSON config
npm run start -- -c '{"start_urls":["https://example.com"],...}'
You can also specify a custom browser binary path:
npm run start -- -p /path/to/config.json -b /path/to/browser
To run Scrapix as a server that handles crawling requests:
npm run serve
The following environment variables can be configured:
-
PORT
: Server port number (default: 8080) -
REDIS_URL
: Redis connection URL for task queue management -
WEBHOOK_URL
: Default webhook URL for notifications -
WEBHOOK_TOKEN
: Authentication token for webhook requests
The configuration file (JSON format) supports the following options:
{
// Required Meilisearch Configuration
"meilisearch_index_uid": "string", // Unique identifier for the Meilisearch index
"meilisearch_url": "string", // URL of the Meilisearch server instance
"meilisearch_api_key": "string", // API key for Meilisearch authentication
"start_urls": ["string"], // Initial URLs to begin crawling from
// Crawler Configuration
"crawler_type": "cheerio" | "puppeteer" | "playwright", // Web scraping engine (default: "cheerio")
"strategy": "docssearch" | "default" | "schema" | "markdown" | "custom", // Content extraction strategy
// URL Control
"urls_to_exclude": ["string"], // URLs to skip during crawling
"urls_to_index": ["string"], // Specific URLs to index (overrides start_urls)
"urls_to_not_index": ["string"], // URLs to exclude from indexing but still crawl
// Performance Configuration
"max_concurrency": number, // Maximum concurrent requests (default: Infinity)
"max_requests_per_minute": number, // Rate limit for requests (default: Infinity)
"batch_size": number, // Documents per indexing batch (default: 1000)
// Webhook Configuration
"webhook_url": "string", // URL for webhook notifications
"webhook_payload": object, // Custom data for webhook payloads
// Meilisearch Configuration
"primary_key": "string", // Unique identifier field for documents
"meilisearch_settings": object, // Custom Meilisearch index settings
// Additional Options
"additional_request_headers": object, // Custom HTTP headers for requests
"user_agents": ["string"], // Custom User-Agent strings to rotate
"not_found_selectors": ["string"], // Selectors indicating 404 pages
"schema_settings": { // Settings for schema-based extraction
"convert_dates": boolean,
"only_type": "string"
}
}