Skip to content
Learn Agentic AI11 min read0 views

Web Scraping Pipelines for Agent Knowledge: Crawling, Extracting, and Indexing Content

Build a production web scraping pipeline using Scrapy and Playwright that crawls websites, extracts structured content, deduplicates pages, and indexes knowledge for AI agent consumption.

Why Agents Need Web Scraping Pipelines

AI agents are only as useful as the knowledge they can access. Static document uploads cover internal knowledge, but many agent use cases demand fresh, continuously updated information from the open web — competitor pricing, regulatory updates, product documentation, forum discussions, and news.

A production scraping pipeline goes well beyond a simple requests.get() loop. It needs to handle JavaScript-rendered pages, respect rate limits and robots.txt, extract meaningful content from noisy HTML, deduplicate across crawls, and schedule recurring updates without manual intervention.

Architecture Overview

A robust scraping pipeline has four stages: crawling (fetching pages), extraction (pulling structured content from HTML), deduplication (avoiding redundant processing), and indexing (storing content for agent retrieval). Each stage runs independently so failures in one do not block the others.

Building the Crawler with Scrapy

Scrapy provides the crawling framework with built-in concurrency, politeness controls, and middleware support. For JavaScript-heavy sites, integrate Playwright as a download handler.

import scrapy
from scrapy import Request
from urllib.parse import urlparse
from datetime import datetime

class KnowledgeCrawler(scrapy.Spider):
    name = "knowledge_crawler"
    custom_settings = {
        "CONCURRENT_REQUESTS": 4,
        "DOWNLOAD_DELAY": 2,
        "ROBOTSTXT_OBEY": True,
        "DEPTH_LIMIT": 3,
        "CLOSESPIDER_PAGECOUNT": 500,
        "HTTPCACHE_ENABLED": True,
        "HTTPCACHE_EXPIRATION_SECS": 86400,
    }

    def __init__(self, start_urls: list, allowed_domains: list, **kwargs):
        super().__init__(**kwargs)
        self.start_urls = start_urls
        self.allowed_domains = allowed_domains

    def parse(self, response):
        # Skip non-HTML responses
        content_type = response.headers.get(
            "Content-Type", b""
        ).decode()
        if "text/html" not in content_type:
            return

        yield {
            "url": response.url,
            "html": response.text,
            "status": response.status,
            "crawled_at": datetime.utcnow().isoformat(),
            "domain": urlparse(response.url).netloc,
        }

        # Follow internal links
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, callback=self.parse)

The HTTPCACHE_ENABLED setting is critical — it prevents re-downloading pages that have not changed between crawl runs, saving bandwidth and respecting the target server.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Content Extraction

Raw HTML is useless for agents. The extraction stage strips navigation, ads, and boilerplate to isolate the main content.

from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import hashlib

@dataclass
class ExtractedPage:
    url: str
    title: str
    content: str
    headings: List[str]
    content_hash: str
    word_count: int
    crawled_at: str

class ContentExtractor:
    NOISE_TAGS = [
        "script", "style", "nav", "footer",
        "header", "aside", "iframe", "form",
    ]
    NOISE_CLASSES = [
        "sidebar", "menu", "nav", "footer",
        "advertisement", "cookie", "popup",
    ]

    def extract(self, raw: dict) -> Optional[ExtractedPage]:
        soup = BeautifulSoup(raw["html"], "html.parser")

        # Remove noise elements
        for tag in self.NOISE_TAGS:
            for el in soup.find_all(tag):
                el.decompose()
        for cls in self.NOISE_CLASSES:
            for el in soup.find_all(class_=lambda c: c and cls in c.lower()):
                el.decompose()

        # Extract main content
        main = (
            soup.find("main")
            or soup.find("article")
            or soup.find("div", role="main")
            or soup.find("body")
        )
        if not main:
            return None

        text = main.get_text(separator="\n", strip=True)
        if len(text.split()) < 50:
            return None  # skip thin pages

        title = soup.title.string if soup.title else ""
        headings = [
            h.get_text(strip=True)
            for h in main.find_all(["h1", "h2", "h3"])
        ]
        content_hash = hashlib.sha256(text.encode()).hexdigest()

        return ExtractedPage(
            url=raw["url"],
            title=title.strip(),
            content=text,
            headings=headings,
            content_hash=content_hash,
            word_count=len(text.split()),
            crawled_at=raw["crawled_at"],
        )

Deduplication Across Crawls

Agents should not have duplicate information in their knowledge base. Content hashing catches exact duplicates, but near-duplicates require SimHash or MinHash.

from datasketch import MinHash, MinHashLSH

class Deduplicator:
    def __init__(self, threshold: float = 0.85):
        self.lsh = MinHashLSH(threshold=threshold, num_perm=128)
        self.seen_hashes = set()

    def is_duplicate(self, page: ExtractedPage) -> bool:
        # Exact duplicate check
        if page.content_hash in self.seen_hashes:
            return True
        self.seen_hashes.add(page.content_hash)

        # Near-duplicate check with MinHash
        mh = MinHash(num_perm=128)
        for word in page.content.lower().split():
            mh.update(word.encode("utf-8"))

        if self.lsh.query(mh):
            return True
        self.lsh.insert(page.url, mh)
        return False

Scheduling Recurring Crawls

Use a simple scheduler to re-crawl sources on different frequencies based on how often they update.

from apscheduler.schedulers.asyncio import AsyncIOScheduler

scheduler = AsyncIOScheduler()

# News sites: crawl every 6 hours
scheduler.add_job(
    run_crawl, "interval", hours=6,
    args=[["https://news.example.com"]],
    id="news_crawl",
)

# Documentation: crawl daily
scheduler.add_job(
    run_crawl, "interval", hours=24,
    args=[["https://docs.example.com"]],
    id="docs_crawl",
)

scheduler.start()

FAQ

How do I handle JavaScript-rendered pages that Scrapy cannot parse?

Install scrapy-playwright and set the DOWNLOAD_HANDLERS to use Playwright for specific domains. Add meta={"playwright": True} to requests targeting JS-heavy sites. This launches a headless browser for those pages while keeping standard HTTP requests for everything else, balancing speed and completeness.

How do I respect robots.txt and avoid getting blocked?

Scrapy respects robots.txt by default with ROBOTSTXT_OBEY: True. Beyond that, set a DOWNLOAD_DELAY of at least 2 seconds, rotate user agents, limit concurrent requests per domain, and add your contact info to the user agent string so site owners can reach you if needed.

Should I store raw HTML or just extracted text?

Store both. Raw HTML goes into object storage (S3 or local disk) as an archive, while extracted text goes into your vector database for retrieval. Keeping raw HTML lets you re-extract content when your extraction logic improves without re-crawling everything.


#WebScraping #DataPipelines #KnowledgeBase #Scrapy #Playwright #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.