Skip to content

CLI Reference

qdequele edited this page Nov 9, 2024 · 1 revision

CLI Reference

Scrapix can be run as a CLI tool in two ways:

One-shot Process

For single crawling tasks, you can run Scrapix with a configuration file:

# Using a config file path
npm run start -- -p /path/to/config.json

# Using inline JSON config
npm run start -- -c '{"start_urls":["https://example.com"],...}'

You can also specify a custom browser binary path:

npm run start -- -p /path/to/config.json -b /path/to/browser

Server Mode

To run Scrapix as a server that handles crawling requests:

npm run serve

Environment Variables

The following environment variables can be configured:

  • PORT: Server port number (default: 8080)
  • REDIS_URL: Redis connection URL for task queue management
  • WEBHOOK_URL: Default webhook URL for notifications
  • WEBHOOK_TOKEN: Authentication token for webhook requests

Configuration Options

The configuration file (JSON format) supports the following options:

{
  // Required Meilisearch Configuration
  "meilisearch_index_uid": "string",  // Unique identifier for the Meilisearch index
  "meilisearch_url": "string",        // URL of the Meilisearch server instance
  "meilisearch_api_key": "string",    // API key for Meilisearch authentication
  "start_urls": ["string"],           // Initial URLs to begin crawling from

  // Crawler Configuration
  "crawler_type": "cheerio" | "puppeteer" | "playwright", // Web scraping engine (default: "cheerio")
  "strategy": "docssearch" | "default" | "schema" | "markdown" | "custom", // Content extraction strategy

  // URL Control
  "urls_to_exclude": ["string"],      // URLs to skip during crawling
  "urls_to_index": ["string"],        // Specific URLs to index (overrides start_urls)
  "urls_to_not_index": ["string"],    // URLs to exclude from indexing but still crawl

  // Performance Configuration
  "max_concurrency": number,          // Maximum concurrent requests (default: Infinity)
  "max_requests_per_minute": number,  // Rate limit for requests (default: Infinity)
  "batch_size": number,               // Documents per indexing batch (default: 1000)

  // Webhook Configuration
  "webhook_url": "string",            // URL for webhook notifications
  "webhook_payload": object,          // Custom data for webhook payloads

  // Meilisearch Configuration
  "primary_key": "string",            // Unique identifier field for documents
  "meilisearch_settings": object,     // Custom Meilisearch index settings

  // Additional Options
  "additional_request_headers": object, // Custom HTTP headers for requests
  "user_agents": ["string"],           // Custom User-Agent strings to rotate
  "not_found_selectors": ["string"],   // Selectors indicating 404 pages
  "schema_settings": {                 // Settings for schema-based extraction
    "convert_dates": boolean,
    "only_type": "string"
  }
}