Crawler Configuration
The Quillium-Crawler is highly configurable through environment variables. This document provides a detailed explanation of all available configuration options.
Environment Variables
Configuration is managed through environment variables, which can be set in a .env
file or directly in the environment. The crawler includes a .env.example
file that you can copy and modify.
cp .env.example .env
# Edit .env with your preferred settings
Core Configuration
Start URLs
CRAWLER_START_URL=https://example.com
CRAWLER_START_URLS=https://example.com,https://example2.com
CRAWLER_START_URL
: Single start URL (used ifCRAWLER_START_URLS
is not set)CRAWLER_START_URLS
: Comma-separated list of start URLs (each will spawn a separate crawler instance)
At least one of these must be set. If both are set, CRAWLER_START_URLS
takes precedence.
Depth and Limits
CRAWLER_MAX_DEPTH=3
CRAWLER_MAX_VISITS=1000
CRAWLER_MAX_DEPTH
: Maximum crawl depth from the start URLCRAWLER_MAX_VISITS
: Maximum number of pages to visit per crawler instance
Parallelism and Delays
CRAWLER_PARALLEL_REQUESTS=10
CRAWLER_DELAY_MS=50
CRAWLER_RANDOM_DELAY_MS=50
CRAWLER_TIMEOUT_SEC=10
CRAWLER_PARALLEL_REQUESTS
: Number of parallel requests per crawlerCRAWLER_DELAY_MS
: Fixed delay between requests in millisecondsCRAWLER_RANDOM_DELAY_MS
: Random delay added to each request in millisecondsCRAWLER_TIMEOUT_SEC
: Request timeout in seconds
Domain and URL Filtering
CRAWLER_ALLOWED_DOMAINS=example.com,example2.com
CRAWLER_DISALLOWED_DOMAINS=
CRAWLER_ALLOWED_URLS=
CRAWLER_DISALLOWED_URLS=
CRAWLER_IGNORE_QUERY_STRINGS=false
CRAWLER_RESPECT_ROBOTS_TXT=true
CRAWLER_ALLOWED_DOMAINS
: Only crawl these domains (comma-separated, empty for all)CRAWLER_DISALLOWED_DOMAINS
: Domains to skip (comma-separated)CRAWLER_ALLOWED_URLS
: Only crawl URLs containing these patterns (comma-separated)CRAWLER_DISALLOWED_URLS
: Skip URLs containing these patterns (comma-separated)CRAWLER_IGNORE_QUERY_STRINGS
: Whether to ignore query strings in URLs (true/false)CRAWLER_RESPECT_ROBOTS_TXT
: Whether to respect robots.txt rules (true/false)
Elasticsearch Configuration
CRAWLER_ELASTICSEARCH_ADDRESSES=http://elasticsearch:9200
CRAWLER_ELASTICSEARCH_USERNAME=elastic
CRAWLER_ELASTICSEARCH_PASSWORD=changeme
CRAWLER_INDEX_NAME=crawled_data
CRAWLER_ENABLE_FULL_CONTENT=false
CRAWLER_ELASTICSEARCH_ADDRESSES
: Comma-separated list of Elasticsearch endpointsCRAWLER_ELASTICSEARCH_USERNAME
: Elasticsearch username for authenticationCRAWLER_ELASTICSEARCH_PASSWORD
: Elasticsearch password for authenticationCRAWLER_INDEX_NAME
: Name of the Elasticsearch index to store crawled dataCRAWLER_ENABLE_FULL_CONTENT
: Whether to store the full HTML content of pages (true/false)
Metrics
CRAWLER_ENABLE_METRICS=true
CRAWLER_ENABLE_METRICS
: Enable Prometheus metrics endpoint (true/false)
Proxy Configuration
CRAWLER_PROXIES=http://user:pass@proxyserver:8080
CRAWLER_PROXIES
: Comma-separated list of proxies to rotate through
Proxies should be specified in the format protocol://[username:password@]host:port
Anti-Bot Measures
CRAWLER_ENABLE_USER_AGENT_ROTATION=true
CRAWLER_ENABLE_HEADER_RANDOMIZATION=true
CRAWLER_ENABLE_COOKIE_HANDLING=true
CRAWLER_ENABLE_SOPHISTICATED_DELAYS=true
CRAWLER_RANDOM_DELAY_FACTOR=1.5
CRAWLER_CUSTOM_USER_AGENTS=
CRAWLER_CUSTOM_ACCEPT_LANGUAGES=
CRAWLER_ENABLE_USER_AGENT_ROTATION
: Enable random user agent rotationCRAWLER_ENABLE_HEADER_RANDOMIZATION
: Enable HTTP header randomizationCRAWLER_ENABLE_COOKIE_HANDLING
: Enable browser-like cookie handlingCRAWLER_ENABLE_SOPHISTICATED_DELAYS
: Enable more human-like delaysCRAWLER_RANDOM_DELAY_FACTOR
: Factor for random delay calculation (0.5-2.0 recommended)CRAWLER_CUSTOM_USER_AGENTS
: Additional custom user agents (comma-separated)CRAWLER_CUSTOM_ACCEPT_LANGUAGES
: Additional custom accept-language values (comma-separated)
Configuration Precedence
The crawler loads configuration in the following order of precedence:
- Environment variables
.env
file- Default values
This means that environment variables will override values in the .env
file, and any unspecified values will use defaults.