AI-Powered Lead Scoring: Ranking Prospects with Machine Learning and LLMs
Build a hybrid lead scoring system that combines traditional ML feature engineering with LLM-based qualitative analysis for more accurate prospect ranking and CRM integration.
The Problem with Traditional Lead Scoring
Most CRM lead scoring relies on manually assigned point values: download a whitepaper gets 10 points, visit the pricing page gets 20. These static rules cannot adapt to changing buyer behavior and miss subtle signals that predict intent. A hybrid approach layers a machine learning model for behavioral signals on top of an LLM evaluator for qualitative signals, producing scores that are both data-driven and context-aware.
Feature Engineering for Lead Scoring
The ML component needs structured features derived from prospect behavior and firmographic data. Good features capture both recency and intensity of engagement.
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class LeadFeatures:
# Behavioral features
page_views_7d: int = 0
pricing_page_visits: int = 0
content_downloads: int = 0
email_opens_30d: int = 0
email_clicks_30d: int = 0
days_since_last_activity: int = 999
form_submissions: int = 0
# Firmographic features
company_size_bucket: int = 0 # 0=unknown, 1=small, 2=mid, 3=enterprise
industry_match: bool = False
tech_stack_fit: float = 0.0 # 0.0 to 1.0
def to_vector(self) -> list[float]:
return [
self.page_views_7d,
self.pricing_page_visits * 3, # weighted
self.content_downloads,
self.email_opens_30d,
self.email_clicks_30d,
max(0, 30 - self.days_since_last_activity), # recency score
self.form_submissions * 5,
self.company_size_bucket,
float(self.industry_match),
self.tech_stack_fit,
]
def extract_features(lead_id: str, events: list[dict]) -> LeadFeatures:
features = LeadFeatures()
now = datetime.utcnow()
seven_days_ago = now - timedelta(days=7)
thirty_days_ago = now - timedelta(days=30)
for event in events:
ts = datetime.fromisoformat(event["timestamp"])
if event["type"] == "page_view" and ts > seven_days_ago:
features.page_views_7d += 1
if "/pricing" in event.get("url", ""):
features.pricing_page_visits += 1
elif event["type"] == "email_open" and ts > thirty_days_ago:
features.email_opens_30d += 1
elif event["type"] == "email_click" and ts > thirty_days_ago:
features.email_clicks_30d += 1
elif event["type"] == "form_submit":
features.form_submissions += 1
return features
Training a Gradient Boosted Scoring Model
Gradient boosted trees handle the mixed feature types and nonlinear relationships common in lead scoring. We train on historical conversion data where the label is whether a lead became a customer.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import joblib
def train_scoring_model(
features: list[list[float]], labels: list[int]
) -> GradientBoostingClassifier:
X = np.array(features)
y = np.array(labels)
model = GradientBoostingClassifier(
n_estimators=200,
max_depth=4,
learning_rate=0.1,
subsample=0.8,
random_state=42,
)
scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
print(f"Cross-validated AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
model.fit(X, y)
joblib.dump(model, "lead_scoring_model.pkl")
return model
def predict_score(model, features: LeadFeatures) -> float:
"""Return a probability score between 0 and 100."""
vector = np.array([features.to_vector()])
probability = model.predict_proba(vector)[0][1]
return round(probability * 100, 1)
LLM-Based Qualitative Scoring
The ML model captures behavioral patterns but misses qualitative signals buried in free-text fields like job titles, company descriptions, and conversation transcripts. An LLM evaluator fills this gap.
from openai import AsyncOpenAI
client = AsyncOpenAI()
QUALITATIVE_PROMPT = """Analyze this prospect and score their fit for our
B2B SaaS product (AI-powered customer service platform).
Prospect info:
- Title: {title}
- Company description: {company_desc}
- Recent conversation notes: {notes}
- LinkedIn headline: {headline}
Return JSON with:
- "qualitative_score": integer 1-100
- "buying_signals": list of observed signals
- "objection_risks": list of potential objections
- "recommended_approach": one sentence on how to engage
"""
async def qualitative_score(prospect: dict) -> dict:
import json
response = await client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Return valid JSON only."},
{
"role": "user",
"content": QUALITATIVE_PROMPT.format(
title=prospect.get("title", "Unknown"),
company_desc=prospect.get("company_desc", ""),
notes=prospect.get("notes", "No notes"),
headline=prospect.get("headline", ""),
),
},
],
)
return json.loads(response.choices[0].message.content)
Combining ML and LLM Scores
The final composite score blends the behavioral ML score with the qualitative LLM score. A weighted average lets you tune the balance based on which signal has proven more predictive for your business.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def composite_score(
model,
features: LeadFeatures,
prospect: dict,
ml_weight: float = 0.6,
) -> dict:
ml_score = predict_score(model, features)
qual_result = await qualitative_score(prospect)
llm_score = qual_result["qualitative_score"]
final_score = (ml_score * ml_weight) + (llm_score * (1 - ml_weight))
tier = "hot" if final_score >= 75 else "warm" if final_score >= 40 else "cold"
return {
"final_score": round(final_score, 1),
"ml_score": ml_score,
"llm_score": llm_score,
"tier": tier,
"buying_signals": qual_result.get("buying_signals", []),
"recommended_approach": qual_result.get("recommended_approach"),
}
CRM Integration
Scores are only useful if they flow back into your sales team's workflow. Push composite scores and tier assignments back to your CRM so reps see them inline with their lead views.
import httpx
async def sync_score_to_hubspot(
lead_email: str, score_data: dict, api_key: str
):
async with httpx.AsyncClient() as client:
# Find contact by email
search_resp = await client.post(
"https://api.hubapi.com/crm/v3/objects/contacts/search",
headers={"Authorization": f"Bearer {api_key}"},
json={
"filterGroups": [{
"filters": [{
"propertyName": "email",
"operator": "EQ",
"value": lead_email,
}]
}]
},
)
contacts = search_resp.json().get("results", [])
if not contacts:
return
contact_id = contacts[0]["id"]
# Update custom properties
await client.patch(
f"https://api.hubapi.com/crm/v3/objects/contacts/{contact_id}",
headers={"Authorization": f"Bearer {api_key}"},
json={
"properties": {
"ai_lead_score": str(score_data["final_score"]),
"ai_lead_tier": score_data["tier"],
}
},
)
FAQ
How often should I retrain the ML scoring model?
Retrain monthly or whenever your conversion rate shifts by more than 10 percent. Use a holdout set from the most recent 30 days to validate that the model still generalizes. Stale models degrade quietly because the AUC on historical test sets stays high even when real-world accuracy drops.
Can the LLM scoring replace the ML model entirely?
Not reliably. LLMs are excellent at qualitative judgment but inconsistent with numerical scoring across large batches. The same prompt can return different scores for identical inputs on different runs. The ML model provides a stable baseline, and the LLM adds nuance that structured features cannot capture.
What happens when the ML and LLM scores disagree sharply?
Flag these cases for human review. A lead with a high behavioral score but low qualitative score might be a bot or a researcher, not a buyer. Conversely, a low behavioral score with high qualitative fit might indicate a new lead who has not had time to engage yet and deserves proactive outreach.
#LeadScoring #MachineLearning #LLM #CRMIntegration #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.