Elasticsearch Integration
The Quillium-Crawler integrates with Elasticsearch to store and index crawled data. This document explains how the integration works and how to configure it.
Overview
Elasticsearch is used as the primary storage backend for the crawler. It provides:
- Efficient storage of crawled page data
- Full-text search capabilities
- Scalability for large datasets
- Document versioning for updated content
Configuration
Elasticsearch integration is configured through environment variables:
CRAWLER_ELASTICSEARCH_ADDRESSES=http://elasticsearch:9200
CRAWLER_ELASTICSEARCH_USERNAME=elastic
CRAWLER_ELASTICSEARCH_PASSWORD=changeme
CRAWLER_INDEX_NAME=crawled_data
CRAWLER_ENABLE_FULL_CONTENT=false
CRAWLER_ELASTICSEARCH_ADDRESSES
: Comma-separated list of Elasticsearch endpointsCRAWLER_ELASTICSEARCH_USERNAME
: Elasticsearch username for authenticationCRAWLER_ELASTICSEARCH_PASSWORD
: Elasticsearch password for authenticationCRAWLER_INDEX_NAME
: Name of the Elasticsearch index to store crawled dataCRAWLER_ENABLE_FULL_CONTENT
: Whether to store the full HTML content of pages (true/false)
Connection Initialization
The crawler initializes the Elasticsearch client during startup:
- It creates a client using the provided addresses, username, and password
- It implements a retry mechanism with exponential backoff to wait for Elasticsearch to be ready
- It creates the index if it doesn't exist, with appropriate mappings
Index Mapping
The crawler creates an Elasticsearch index with the following mapping:
{
"mappings": {
"properties": {
"url": { "type": "keyword" },
"title": { "type": "text", "analyzer": "standard" },
"snippet": { "type": "text", "analyzer": "standard" },
"full_content": { "type": "text", "analyzer": "standard" },
"created_at": { "type": "date" },
"updated_at": { "type": "date" }
}
}
}
This mapping provides:
- Exact matching on URLs using the
keyword
type - Full-text search on titles, snippets, and content
- Timestamp tracking for creation and updates
Data Storage
The crawler stores the following data for each crawled page:
- URL: The full URL of the page
- Title: The page title extracted from the
<title>
tag or first<h1>
- Snippet: A short description from the meta description or first paragraph
- Full Content: (Optional) The complete HTML content of the page
- Created At: Timestamp when the page was first crawled
- Updated At: Timestamp when the page was last updated
Document IDs
The crawler uses MD5 hashes of URLs as document IDs. This provides:
- Consistent IDs for the same URL
- Avoidance of special character issues in URLs
- Efficient document lookups
Update Handling
When a page is crawled multiple times, the crawler:
- Checks if the document already exists using the URL hash
- If it exists, preserves the original
created_at
timestamp - Updates the
updated_at
timestamp - Merges the new content with existing data
This approach maintains a history of when pages were first discovered while keeping content up to date.
Implementation Details
ESStorage Struct
The ESStorage
struct in the elasticsearch
package provides the interface between the crawler and Elasticsearch:
type ESStorage struct {
es *elasticsearch.Client
index string
}
Key Methods
NewESStorage
: Creates a new storage instance with the given client and index nameInitializeIndex
: Creates the index with appropriate mappings if it doesn't existSavePage
: Stores a page in Elasticsearch, handling new documents and updatesGetPage
: Retrieves a page by URLStorePageData
: Internal method that handles the details of storing and updating documents
Using the Stored Data
Once data is stored in Elasticsearch, you can:
- Use the Elasticsearch API to search and retrieve crawled pages
- Build custom applications that query the data
- Use Kibana for visualization and exploration
- Implement advanced text analysis and processing
Example Queries
Search for Pages Containing a Term
GET /crawled_data/_search
{
"query": {
"multi_match": {
"query": "search term",
"fields": ["title", "snippet", "full_content"]
}
}
}
Get Recently Crawled Pages
GET /crawled_data/_search
{
"sort": [
{ "updated_at": { "order": "desc" }}
],
"size": 10
}
Performance Considerations
- Full Content Storage: Storing full HTML content increases storage requirements but enables more comprehensive search
- Index Refresh: The crawler uses
refresh=true
to make documents immediately searchable - Bulk Operations: For high-volume crawling, consider implementing bulk indexing
- Scaling: For large datasets, configure Elasticsearch with appropriate resources and sharding