URL Deduplication

The Quillium-Crawler implements an efficient URL deduplication system to avoid crawling the same pages multiple times. This document explains how the deduplication works and how it's configured.

Overview

Deduplication is essential for web crawlers to:

Prevent infinite loops in cyclical link structures
Avoid wasting resources on already visited pages
Reduce load on target websites
Improve crawling efficiency

The Quillium-Crawler uses a bloom filter for efficient URL deduplication.

Bloom Filter Implementation

A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It's particularly well-suited for URL deduplication because:

It has constant-time lookups regardless of the number of elements
It uses significantly less memory than storing the complete URL list
It never produces false negatives (though it can produce false positives)

How It Works

When a URL is discovered, the crawler checks if it's in the bloom filter
If it's not in the filter, the URL is added to the crawl queue and marked as seen
If it's already in the filter, the URL is skipped

Configuration

The bloom filter is automatically configured based on the expected number of URLs to be crawled:

// Calculate optimal bloom filter size based on expected visits
expectedURLs := 10000 // Default minimum size
if options.MaxVisits != nil && *options.MaxVisits > 0 {
    expectedURLs = *options.MaxVisits * 10 // Estimate 10x the max visits as potential URLs
}

// Create bloom filter with 1% false positive rate
bloomSize := dedup.CalculateOptimalSize(expectedURLs, 0.01)
hashFuncs := dedup.CalculateOptimalHashFunctions(bloomSize, expectedURLs)

The crawler automatically:

Sets a minimum expected URL count of 10,000
Estimates the total number of URLs as 10 times the maximum visit count
Configures the bloom filter for a 1% false positive rate
Calculates the optimal filter size and number of hash functions

Implementation Details

Bloom Filter Structure

The bloom filter is implemented in the dedup package:

type BloomFilter struct {
    bits       []uint64
    size       uint64
    hashFuncs  uint64
    setBits    uint64
    mutex      sync.RWMutex
}

Key Methods

NewBloomFilter: Creates a new bloom filter with the specified size and hash function count
Add: Adds a URL to the filter
Contains: Checks if a URL is in the filter
CalculateOptimalSize: Determines the optimal filter size based on expected elements and false positive rate
CalculateOptimalHashFunctions: Calculates the optimal number of hash functions

Integration with Crawler

The bloom filter is integrated into the crawler's link handling process:

func (c *Crawler) handleLink(e *colly.HTMLElement) {
    link := e.Attr("href")
    absURL := e.Request.AbsoluteURL(link)

    // Skip if URL doesn't match allowed patterns
    if !c.isAllowedURL(absURL) {
        return
    }

    // Skip URLs we've already seen (deduplication using bloom filter)
    if c.urlFilter.Contains(absURL) {
        return
    }

    // Mark URL as seen before visiting
    c.urlFilter.Add(absURL)

    c.Collector.Visit(absURL)
}

Performance Considerations

Memory Usage

The bloom filter is very memory-efficient. For example, with default settings:

A filter for 100,000 URLs uses approximately 120 KB of memory
A filter for 1,000,000 URLs uses approximately 1.2 MB of memory

This is significantly less than storing the complete list of URLs.

False Positives

Bloom filters can produce false positives (incorrectly indicating a URL has been seen). The crawler configures the filter for a 1% false positive rate, which means:

About 1 in 100 new URLs might be incorrectly identified as duplicates
This is an acceptable trade-off for the memory efficiency gained
The false positive rate increases as the filter fills up

No Removal Support

Standard bloom filters don't support element removal. This is acceptable for the crawler because:

URLs are only added, never removed during a crawl session
The filter is recreated for each new crawl session

Best Practices

Set Appropriate Max Visits: Configure CRAWLER_MAX_VISITS based on your expected crawl size to optimize the bloom filter
Monitor Memory Usage: For very large crawls, be aware of the memory used by the bloom filter
Consider False Positives: If complete coverage is critical, be aware of the small false positive rate
Restart for Fresh Crawls: For completely fresh crawls of the same site, restart the crawler to reset the bloom filter