AI-Powered Lead Scoring: Ranking Prospects with Machine Learning and LLMs

The Problem with Traditional Lead Scoring

Most CRM lead scoring relies on manually assigned point values: download a whitepaper gets 10 points, visit the pricing page gets 20. These static rules cannot adapt to changing buyer behavior and miss subtle signals that predict intent. A hybrid approach layers a machine learning model for behavioral signals on top of an LLM evaluator for qualitative signals, producing scores that are both data-driven and context-aware.

Feature Engineering for Lead Scoring

The ML component needs structured features derived from prospect behavior and firmographic data. Good features capture both recency and intensity of engagement.

from dataclasses import dataclass
from datetime import datetime, timedelta


@dataclass
class LeadFeatures:
    # Behavioral features
    page_views_7d: int = 0
    pricing_page_visits: int = 0
    content_downloads: int = 0
    email_opens_30d: int = 0
    email_clicks_30d: int = 0
    days_since_last_activity: int = 999
    form_submissions: int = 0

    # Firmographic features
    company_size_bucket: int = 0  # 0=unknown, 1=small, 2=mid, 3=enterprise
    industry_match: bool = False
    tech_stack_fit: float = 0.0  # 0.0 to 1.0

    def to_vector(self) -> list[float]:
        return [
            self.page_views_7d,
            self.pricing_page_visits * 3,  # weighted
            self.content_downloads,
            self.email_opens_30d,
            self.email_clicks_30d,
            max(0, 30 - self.days_since_last_activity),  # recency score
            self.form_submissions * 5,
            self.company_size_bucket,
            float(self.industry_match),
            self.tech_stack_fit,
        ]


def extract_features(lead_id: str, events: list[dict]) -> LeadFeatures:
    features = LeadFeatures()
    now = datetime.utcnow()
    seven_days_ago = now - timedelta(days=7)
    thirty_days_ago = now - timedelta(days=30)

    for event in events:
        ts = datetime.fromisoformat(event["timestamp"])
        if event["type"] == "page_view" and ts > seven_days_ago:
            features.page_views_7d += 1
            if "/pricing" in event.get("url", ""):
                features.pricing_page_visits += 1
        elif event["type"] == "email_open" and ts > thirty_days_ago:
            features.email_opens_30d += 1
        elif event["type"] == "email_click" and ts > thirty_days_ago:
            features.email_clicks_30d += 1
        elif event["type"] == "form_submit":
            features.form_submissions += 1
    return features

Training a Gradient Boosted Scoring Model

Gradient boosted trees handle the mixed feature types and nonlinear relationships common in lead scoring. We train on historical conversion data where the label is whether a lead became a customer.

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import joblib


def train_scoring_model(
    features: list[list[float]], labels: list[int]
) -> GradientBoostingClassifier:
    X = np.array(features)
    y = np.array(labels)

    model = GradientBoostingClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        subsample=0.8,
        random_state=42,
    )

    scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
    print(f"Cross-validated AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")

    model.fit(X, y)
    joblib.dump(model, "lead_scoring_model.pkl")
    return model


def predict_score(model, features: LeadFeatures) -> float:
    """Return a probability score between 0 and 100."""
    vector = np.array([features.to_vector()])
    probability = model.predict_proba(vector)[0][1]
    return round(probability * 100, 1)

LLM-Based Qualitative Scoring

The ML model captures behavioral patterns but misses qualitative signals buried in free-text fields like job titles, company descriptions, and conversation transcripts. An LLM evaluator fills this gap.

from openai import AsyncOpenAI

client = AsyncOpenAI()

QUALITATIVE_PROMPT = """Analyze this prospect and score their fit for our
B2B SaaS product (AI-powered customer service platform).

Prospect info:
- Title: {title}
- Company description: {company_desc}
- Recent conversation notes: {notes}
- LinkedIn headline: {headline}

Return JSON with:
- "qualitative_score": integer 1-100
- "buying_signals": list of observed signals
- "objection_risks": list of potential objections
- "recommended_approach": one sentence on how to engage
"""


async def qualitative_score(prospect: dict) -> dict:
    import json
    response = await client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": "Return valid JSON only."},
            {
                "role": "user",
                "content": QUALITATIVE_PROMPT.format(
                    title=prospect.get("title", "Unknown"),
                    company_desc=prospect.get("company_desc", ""),
                    notes=prospect.get("notes", "No notes"),
                    headline=prospect.get("headline", ""),
                ),
            },
        ],
    )
    return json.loads(response.choices[0].message.content)

Combining ML and LLM Scores

The final composite score blends the behavioral ML score with the qualitative LLM score. A weighted average lets you tune the balance based on which signal has proven more predictive for your business.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def composite_score(
    model,
    features: LeadFeatures,
    prospect: dict,
    ml_weight: float = 0.6,
) -> dict:
    ml_score = predict_score(model, features)
    qual_result = await qualitative_score(prospect)
    llm_score = qual_result["qualitative_score"]

    final_score = (ml_score * ml_weight) + (llm_score * (1 - ml_weight))

    tier = "hot" if final_score >= 75 else "warm" if final_score >= 40 else "cold"

    return {
        "final_score": round(final_score, 1),
        "ml_score": ml_score,
        "llm_score": llm_score,
        "tier": tier,
        "buying_signals": qual_result.get("buying_signals", []),
        "recommended_approach": qual_result.get("recommended_approach"),
    }

CRM Integration

Scores are only useful if they flow back into your sales team's workflow. Push composite scores and tier assignments back to your CRM so reps see them inline with their lead views.

import httpx


async def sync_score_to_hubspot(
    lead_email: str, score_data: dict, api_key: str
):
    async with httpx.AsyncClient() as client:
        # Find contact by email
        search_resp = await client.post(
            "https://api.hubapi.com/crm/v3/objects/contacts/search",
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "filterGroups": [{
                    "filters": [{
                        "propertyName": "email",
                        "operator": "EQ",
                        "value": lead_email,
                    }]
                }]
            },
        )
        contacts = search_resp.json().get("results", [])
        if not contacts:
            return
        contact_id = contacts[0]["id"]

        # Update custom properties
        await client.patch(
            f"https://api.hubapi.com/crm/v3/objects/contacts/{contact_id}",
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "properties": {
                    "ai_lead_score": str(score_data["final_score"]),
                    "ai_lead_tier": score_data["tier"],
                }
            },
        )

FAQ

How often should I retrain the ML scoring model?

Retrain monthly or whenever your conversion rate shifts by more than 10 percent. Use a holdout set from the most recent 30 days to validate that the model still generalizes. Stale models degrade quietly because the AUC on historical test sets stays high even when real-world accuracy drops.

Can the LLM scoring replace the ML model entirely?

Not reliably. LLMs are excellent at qualitative judgment but inconsistent with numerical scoring across large batches. The same prompt can return different scores for identical inputs on different runs. The ML model provides a stable baseline, and the LLM adds nuance that structured features cannot capture.

What happens when the ML and LLM scores disagree sharply?

Flag these cases for human review. A lead with a high behavioral score but low qualitative score might be a bot or a researcher, not a buyer. Conversely, a low behavioral score with high qualitative fit might indicate a new lead who has not had time to engage yet and deserves proactive outreach.

#LeadScoring #MachineLearning #LLM #CRMIntegration #Python #AgenticAI #LearnAI #AIEngineering

AI-Powered Lead Scoring: Ranking Prospects with Machine Learning and LLMs

The Problem with Traditional Lead Scoring

Feature Engineering for Lead Scoring

Training a Gradient Boosted Scoring Model

LLM-Based Qualitative Scoring

Combining ML and LLM Scores

CRM Integration

FAQ

How often should I retrain the ML scoring model?

Can the LLM scoring replace the ML model entirely?

What happens when the ML and LLM scores disagree sharply?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding