Crawler Architecture

The Quillium-Crawler is built with a modular architecture that separates concerns and allows for flexible configuration. This document describes the main components and how they interact.

High-Level Architecture

The crawler is organized into several key components:

Quillium-Crawler/
├── internal/
│   ├── api/              # API server and routes
│   ├── crawler/          # Crawling logic, config, anti-bot, proxies
│   ├── dedup/            # Deduplication (bloom filter)
│   ├── elasticsearch/    # Elasticsearch integration
│   └── metrics/          # Metrics and monitoring
├── main.go               # Application entry point

Core Components

Main Application (main.go)

The entry point of the application that:

Loads configuration from environment variables
Initializes the Elasticsearch client and creates the index if needed
Creates the crawler manager and individual crawler instances
Configures proxies if specified
Starts the API server

Crawler Manager

The crawler manager coordinates multiple crawler instances, each responsible for crawling a specific start URL. It provides methods to:

Add new crawler instances
Start and stop crawlers
Monitor crawler status

Crawler Service

The crawler service is the core component responsible for:

Navigating websites following links
Extracting content from pages
Applying URL filtering rules
Implementing anti-bot measures
Handling request delays and timeouts
Deduplicating URLs using bloom filters

The crawler is built on top of the Colly framework, which provides efficient web scraping capabilities.

Elasticsearch Service

The Elasticsearch service manages the storage and retrieval of crawled data:

Initializes the Elasticsearch connection
Creates and manages the index with appropriate mappings
Stores page data with URL-based document IDs
Handles document updates for revisited pages
Provides methods to retrieve stored pages

Deduplication Service

The deduplication service prevents crawling the same URL multiple times using a bloom filter:

Efficiently checks if a URL has been seen before
Optimizes memory usage with configurable filter size
Calculates optimal bloom filter parameters based on expected URL count

Metrics Service

The metrics service collects and exposes performance metrics:

Pages crawled
Request success/failure counts
Content size statistics
Error rates

Metrics are exposed in Prometheus-compatible format via an HTTP endpoint.

API Server

The API server provides HTTP endpoints for controlling and monitoring the crawler:

Start/stop crawling
View crawler status
Access metrics

Data Flow

The crawler starts with one or more seed URLs
For each URL, it checks if it has been seen before using the deduplication service
If the URL is new, it sends an HTTP request to fetch the page
Anti-bot measures are applied to avoid detection
The page content is extracted and processed
The data is stored in Elasticsearch
Links are extracted from the page and added to the crawl queue
Metrics are updated
The process repeats until reaching configured limits (depth, max visits, etc.)

Concurrency Model

The crawler uses Go's concurrency primitives to efficiently handle multiple requests in parallel:

Each crawler instance runs in its own goroutine
Multiple HTTP requests are made concurrently within configurable limits
Mutex locks protect shared resources
Context cancellation is used for graceful shutdown

This architecture allows the crawler to scale efficiently while maintaining control over resource usage and respecting website rate limits.