Web Crawler Configuration
DeepSearcher supports various web crawlers to collect data from websites for processing and indexing.
📝 Basic Configuration
config.set_provider_config("web_crawler", "(WebCrawlerName)", "(Arguments dict)")
📋 Available Web Crawlers
Crawler | Description | Key Feature |
---|---|---|
FireCrawlCrawler | Cloud-based web crawling service | Simple API, managed service |
Crawl4AICrawler | Browser automation crawler | Full JavaScript support |
JinaCrawler | Content extraction service | High accuracy parsing |
DoclingCrawler | Doc processing with crawling | Multiple format support |
🔍 Web Crawler Options
FireCrawl
FireCrawl is a cloud-based web crawling service designed for AI applications.
Key features: - Simple API - Managed Service - Advanced Parsing
config.set_provider_config("web_crawler", "FireCrawlCrawler", {})
Setup Instructions
- Sign up for FireCrawl and get an API key
- Set the API key as an environment variable:
export FIRECRAWL_API_KEY="your_api_key"
- For more information, see the FireCrawl documentation
Crawl4AI
Crawl4AI is a Python package for web crawling with browser automation capabilities.
config.set_provider_config("web_crawler", "Crawl4AICrawler", {"browser_config": {"headless": True, "verbose": True}})
Setup Instructions
- Install Crawl4AI:
pip install crawl4ai
- Run the setup command:
crawl4ai-setup
- For more information, see the Crawl4AI documentation
Jina Reader
Jina Reader is a service for extracting content from web pages with high accuracy.
config.set_provider_config("web_crawler", "JinaCrawler", {})
Setup Instructions
- Get a Jina API key
- Set the API key as an environment variable:
export JINA_API_TOKEN="your_api_key" # or export JINAAI_API_KEY="your_api_key"
- For more information, see the Jina Reader documentation
Docling Crawler
Docling provides web crawling capabilities alongside its document processing features.
config.set_provider_config("web_crawler", "DoclingCrawler", {})
Setup Instructions
- Install Docling:
pip install docling
- For information on supported formats, see the Docling documentation