Skip to content
Learn Agentic AI
Learn Agentic AI11 min read0 views

Web Scraping Tools for AI Agents: Fetching and Parsing External Data

Build HTTP-based tools that let AI agents fetch web pages, parse content, and extract structured data. Covers request handling, HTML parsing, rate limiting, error handling, and responsible scraping practices.

Why Agents Need Web Access

An AI agent limited to its training data and local tools can only answer questions about things it already knows. Web scraping tools unlock real-time information: current prices, live documentation, recent news, API responses, and public datasets. This makes agents dramatically more useful for research, monitoring, and data collection tasks.

This post builds a set of web-fetching tools with proper safety controls.

Tool 1: Fetch a URL and Extract Text

The most fundamental web tool fetches a URL and returns the readable text content:

flowchart TD
    START["Web Scraping Tools for AI Agents: Fetching and Pa…"] --> A
    A["Why Agents Need Web Access"]
    A --> B
    B["Tool 1: Fetch a URL and Extract Text"]
    B --> C
    C["The Tool Schema"]
    C --> D
    D["Tool 2: Structured Data Extraction"]
    D --> E
    E["Rate Limiting"]
    E --> F
    F["URL Allowlisting"]
    F --> G
    G["Error Handling Strategy"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import httpx
from bs4 import BeautifulSoup

async def fetch_page(url: str, timeout: int = 10) -> str:
    allowed_schemes = {"http", "https"}
    from urllib.parse import urlparse
    parsed = urlparse(url)

    if parsed.scheme not in allowed_schemes:
        return f"Error: Only HTTP and HTTPS URLs are allowed"

    if parsed.hostname in ("localhost", "127.0.0.1", "0.0.0.0"):
        return f"Error: Cannot fetch local/internal URLs"

    try:
        async with httpx.AsyncClient(
            timeout=timeout,
            follow_redirects=True,
            max_redirects=3,
        ) as client:
            response = await client.get(url, headers={
                "User-Agent": "AgentBot/1.0 (research assistant)"
            })
            response.raise_for_status()

            soup = BeautifulSoup(response.text, "html.parser")

            for tag in soup(["script", "style", "nav", "footer", "header"]):
                tag.decompose()

            text = soup.get_text(separator="\n", strip=True)
            # Truncate to avoid flooding the context window
            if len(text) > 8000:
                text = text[:8000] + "\n\n[Content truncated at 8000 characters]"

            return text
    except httpx.TimeoutException:
        return "Error: Request timed out"
    except httpx.HTTPStatusError as e:
        return f"Error: HTTP {e.response.status_code}"
    except httpx.RequestError as e:
        return f"Error: {str(e)}"

Key design decisions: blocking localhost prevents SSRF attacks, the timeout prevents hanging on slow servers, HTML is stripped to text to reduce token usage, and the result is truncated to prevent context window overflow.

The Tool Schema

fetch_tool_schema = {
    "type": "function",
    "function": {
        "name": "fetch_webpage",
        "description": "Fetch a webpage URL and return its text content. Use this to read documentation, articles, or any public webpage. Returns plain text with HTML tags stripped. Content is truncated to 8000 characters.",
        "parameters": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The full URL to fetch, including https://"
                }
            },
            "required": ["url"]
        }
    }
}

Tool 2: Structured Data Extraction

Sometimes the agent needs specific data points rather than raw text. Build a tool that extracts structured content using CSS selectors:

async def extract_data(url: str, selectors: dict) -> str:
    import json

    try:
        async with httpx.AsyncClient(timeout=10, follow_redirects=True) as client:
            response = await client.get(url)
            response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")
        results = {}

        for key, selector in selectors.items():
            elements = soup.select(selector)
            if elements:
                results[key] = [el.get_text(strip=True) for el in elements[:20]]
            else:
                results[key] = []

        return json.dumps(results, indent=2)
    except Exception as e:
        return f"Error: {str(e)}"

The matching schema allows the LLM to specify what to extract:

extract_tool_schema = {
    "type": "function",
    "function": {
        "name": "extract_data",
        "description": "Extract specific elements from a webpage using CSS selectors. Returns a JSON object mapping selector names to extracted text content.",
        "parameters": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The URL to fetch"
                },
                "selectors": {
                    "type": "object",
                    "description": "A mapping of label names to CSS selectors. Example: {"titles": "h2.title", "prices": ".price-tag"}"
                }
            },
            "required": ["url", "selectors"]
        }
    }
}

Rate Limiting

Agents can be aggressive with tool calls. Implement rate limiting to be a responsible web citizen:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import time
from collections import defaultdict

class RateLimiter:
    def __init__(self, requests_per_second: float = 1.0):
        self.min_interval = 1.0 / requests_per_second
        self.last_request: dict[str, float] = defaultdict(float)

    async def wait(self, domain: str):
        elapsed = time.monotonic() - self.last_request[domain]
        if elapsed < self.min_interval:
            await asyncio.sleep(self.min_interval - elapsed)
        self.last_request[domain] = time.monotonic()

rate_limiter = RateLimiter(requests_per_second=0.5)

async def fetch_page_with_rate_limit(url: str) -> str:
    from urllib.parse import urlparse
    domain = urlparse(url).hostname
    await rate_limiter.wait(domain)
    return await fetch_page(url)

This ensures no more than one request per two seconds to any given domain.

URL Allowlisting

In production, restrict which domains the agent can access:

ALLOWED_DOMAINS = {
    "docs.python.org",
    "developer.mozilla.org",
    "api.github.com",
    "en.wikipedia.org",
}

def is_allowed_url(url: str) -> bool:
    from urllib.parse import urlparse
    hostname = urlparse(url).hostname
    return hostname in ALLOWED_DOMAINS

This prevents the agent from being tricked via prompt injection into fetching arbitrary URLs.

Error Handling Strategy

Web requests fail frequently. Design your error responses to help the LLM recover:

def format_fetch_error(url: str, error: Exception) -> str:
    if isinstance(error, httpx.TimeoutException):
        return f"The page at {url} took too long to respond. Try a different source."
    elif isinstance(error, httpx.HTTPStatusError):
        status = error.response.status_code
        if status == 404:
            return f"Page not found at {url}. Check the URL or try a different page."
        elif status == 403:
            return f"Access denied to {url}. This site blocks automated requests."
        elif status >= 500:
            return f"Server error at {url}. The site may be temporarily down."
    return f"Could not fetch {url}: {str(error)}"

Descriptive error messages let the LLM adapt its strategy — trying a different URL, rephrasing a search, or informing the user — rather than blindly retrying.

FAQ

How do I handle JavaScript-rendered pages?

The tools above only fetch raw HTML. For JavaScript-heavy sites, you need a headless browser like Playwright. However, this adds significant complexity and resource usage. For most agent use cases, raw HTML fetching covers 80% of needs. Add Playwright as a separate, heavier tool only when required.

Should I cache fetched pages?

Yes. Add a simple TTL cache keyed by URL. If the agent fetches the same documentation page three times during a single conversation, there is no reason to hit the server each time. A 5-minute cache TTL works well for most cases.

How do I prevent prompt injection from web content?

Web pages can contain text designed to manipulate the LLM. Sanitize fetched content by removing suspicious patterns, truncating aggressively, and framing the content in the tool result as untrusted data. In the system prompt, instruct the agent to treat fetched content as user-provided input, not as instructions.


#WebScraping #HTTP #ToolDesign #AIAgents #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.