FireCrawl Integration Example

This example demonstrates how to use FireCrawl with DeepSearcher to crawl and extract content from websites.

Overview

FireCrawl is a specialized web crawling service designed for AI applications. This example shows:

Setting up FireCrawl with DeepSearcher
Configuring API keys for the service
Crawling a website and extracting content
Querying the extracted content

Code Example

import logging
import os
from deepsearcher.offline_loading import load_from_website
from deepsearcher.online_query import query
from deepsearcher.configuration import Configuration, init_config

# Suppress unnecessary logging from third-party libraries
logging.getLogger("httpx").setLevel(logging.WARNING)

# Set API keys (ensure these are set securely in real applications)
os.environ['OPENAI_API_KEY'] = 'sk-***************'
os.environ['FIRECRAWL_API_KEY'] = 'fc-***************'


def main():
    # Step 1: Initialize configuration
    config = Configuration()

    # Set up Vector Database (Milvus) and Web Crawler (FireCrawlCrawler)
    config.set_provider_config("vector_db", "Milvus", {})
    config.set_provider_config("web_crawler", "FireCrawlCrawler", {})

    # Apply the configuration
    init_config(config)

    # Step 2: Load data from a website into Milvus
    website_url = "https://example.com"  # Replace with your target website
    collection_name = "FireCrawl"
    collection_description = "All Milvus Documents"

    # crawl a single webpage
    load_from_website(urls=website_url, collection_name=collection_name, collection_description=collection_description)
    # only applicable if using Firecrawl: deepsearcher can crawl multiple webpages, by setting max_depth, limit, allow_backward_links
    # load_from_website(urls=website_url, max_depth=2, limit=20, allow_backward_links=True, collection_name=collection_name, collection_description=collection_description)

    # Step 3: Query the loaded data
    question = "What is Milvus?"  # Replace with your actual question
    result = query(question)


if __name__ == "__main__":
    main()

Running the Example

Install DeepSearcher: pip install deepsearcher
Sign up for a FireCrawl API key at firecrawl.dev
Replace the placeholder API keys with your actual keys
Change the website_url to the website you want to crawl
Run the script: python load_website_using_firecrawl.py

Advanced Crawling Options

FireCrawl provides several advanced options for crawling:

max_depth: Control how many links deep the crawler should go
limit: Set a maximum number of pages to crawl
allow_backward_links: Allow the crawler to navigate to parent/sibling pages

Key Concepts

Web Crawling: Extracting content from websites
Depth Control: Managing how deep the crawler navigates
URL Processing: Handling multiple pages from a single starting point
Vector Storage: Storing the crawled content in a vector database for search