Skip to content
Learn Agentic AI14 min read0 views

Benchmarking Vector Databases: Latency, Throughput, and Recall at Scale

Learn how to rigorously benchmark vector databases with proper methodology — measuring latency, throughput, and recall under realistic conditions to make informed infrastructure decisions.

Why Benchmark Your Own Workload

Vendor benchmarks are marketing. They show optimal configurations on favorable datasets under ideal conditions. Your application has specific embedding dimensions, query patterns, filter complexity, and concurrency levels that no generic benchmark captures.

The only benchmark that matters is one that simulates your actual workload. This guide covers the methodology, metrics, and tooling to run rigorous vector database benchmarks that inform real infrastructure decisions.

The Three Metrics That Matter

1. Recall at K — What fraction of the true nearest neighbors does the system return? Recall of 0.95 at K=10 means 9.5 out of 10 true neighbors are found.

2. Query Latency — How long does a single query take? Measure P50, P95, and P99 — averages hide tail latency that affects user experience.

3. Queries Per Second (QPS) — How many concurrent queries can the system handle before latency degrades? This determines how many users your system can serve.

These three metrics are in tension. Higher recall requires searching more candidates, which increases latency and reduces throughput. Every index configuration is a point on this three-way tradeoff surface.

Building a Benchmark Suite

Start with a reproducible benchmark framework:

import time
import numpy as np
from dataclasses import dataclass, field

@dataclass
class BenchmarkResult:
    recall_at_k: float
    latencies_ms: list[float] = field(default_factory=list)

    @property
    def p50_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 50))

    @property
    def p95_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 95))

    @property
    def p99_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 99))

    @property
    def qps(self) -> float:
        total_seconds = sum(self.latencies_ms) / 1000.0
        return len(self.latencies_ms) / total_seconds if total_seconds > 0 else 0

Computing Ground Truth

To measure recall, you need exact nearest neighbors as ground truth. Generate these with brute-force search:

import faiss

def compute_ground_truth(
    vectors: np.ndarray,
    queries: np.ndarray,
    k: int = 10
) -> np.ndarray:
    """Compute exact nearest neighbors using brute-force search."""
    dimension = vectors.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(vectors)
    distances, indices = index.search(queries, k)
    return indices  # shape: (num_queries, k)

Measuring Recall

Compare ANN results against ground truth:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def compute_recall(
    ann_results: list[list[int]],
    ground_truth: np.ndarray,
    k: int = 10
) -> float:
    """Compute recall@k: fraction of true neighbors found."""
    total_recall = 0.0
    for i, ann_ids in enumerate(ann_results):
        true_ids = set(ground_truth[i][:k])
        found = len(set(ann_ids[:k]) & true_ids)
        total_recall += found / k
    return total_recall / len(ann_results)

Benchmarking pgvector

import psycopg
from pgvector.psycopg import register_vector

def benchmark_pgvector(
    conn,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10,
    ef_search: int = 40
) -> BenchmarkResult:
    register_vector(conn)
    conn.execute(f"SET hnsw.ef_search = {ef_search}")

    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        rows = conn.execute(
            "SELECT id FROM documents ORDER BY embedding <=> %s LIMIT %s",
            (query_vec.tolist(), k)
        ).fetchall()
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        all_results.append([row[0] for row in rows])

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)

Benchmarking Pinecone

from pinecone import Pinecone

def benchmark_pinecone(
    index,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10
) -> BenchmarkResult:
    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        response = index.query(
            vector=query_vec.tolist(),
            top_k=k
        )
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        result_ids = [int(m["id"]) for m in response["matches"]]
        all_results.append(result_ids)

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)

Concurrent Load Testing

Single-query latency tells only part of the story. Test under concurrent load to find throughput limits:

import concurrent.futures

def concurrent_benchmark(
    search_fn,
    queries: np.ndarray,
    concurrency: int = 10
) -> dict:
    latencies = []

    def run_query(query_vec):
        start = time.perf_counter()
        search_fn(query_vec)
        return (time.perf_counter() - start) * 1000

    start_all = time.perf_counter()

    with concurrent.futures.ThreadPoolExecutor(
        max_workers=concurrency
    ) as executor:
        futures = [
            executor.submit(run_query, q)
            for q in queries
        ]
        for future in concurrent.futures.as_completed(futures):
            latencies.append(future.result())

    total_time = time.perf_counter() - start_all
    return {
        "concurrency": concurrency,
        "total_queries": len(queries),
        "total_time_s": total_time,
        "qps": len(queries) / total_time,
        "p50_ms": float(np.percentile(latencies, 50)),
        "p95_ms": float(np.percentile(latencies, 95)),
        "p99_ms": float(np.percentile(latencies, 99)),
    }

Running a Sweep

Test multiple configurations to find the optimal recall-latency tradeoff:

def parameter_sweep_pgvector(conn, queries, ground_truth):
    results = []
    for ef_search in [10, 20, 40, 80, 160, 320]:
        result = benchmark_pgvector(
            conn, queries, ground_truth,
            k=10, ef_search=ef_search
        )
        results.append({
            "ef_search": ef_search,
            "recall": result.recall_at_k,
            "p50_ms": result.p50_ms,
            "p95_ms": result.p95_ms,
            "qps": result.qps,
        })
        print(
            f"ef_search={ef_search}: "
            f"recall={result.recall_at_k:.3f}, "
            f"p50={result.p50_ms:.1f}ms, "
            f"p95={result.p95_ms:.1f}ms"
        )
    return results

Benchmarking Best Practices

Use realistic data. Random vectors behave differently from real embeddings. Use a subset of your actual production embeddings or a standard benchmark dataset like ANN-Benchmarks (sift-128, gist-960, or deep-96).

Warm up before measuring. Run 100-200 throwaway queries to fill caches and warm JIT-compiled code paths. Only measure after warmup.

Test with filters. If your application uses metadata filtering, include filters in your benchmark. Filtered search performance can differ dramatically from unfiltered.

Measure at your target scale. Performance at 100K vectors does not predict performance at 10M vectors. Load your benchmark with the volume you expect in production.

Run multiple trials. Network variability (especially for cloud databases) can skew individual measurements. Run each configuration 3-5 times and report the median.

Real-World Performance Expectations

Based on publicly available benchmarks and community reports for 1M vectors at 1536 dimensions with HNSW:

Database P50 Latency Recall@10 QPS (single client)
pgvector (PostgreSQL 16) 3-8ms 0.95-0.99 200-500
Pinecone (serverless) 10-30ms 0.95+ 100-300
Weaviate (self-hosted) 2-5ms 0.95-0.99 300-800
Chroma (self-hosted) 5-15ms 0.95+ 100-400

These numbers vary significantly based on hardware, index configuration, and query complexity. Always benchmark your own workload.

FAQ

How many queries should I run to get statistically meaningful benchmark results?

At minimum, run 1,000 queries per configuration. For latency percentiles (P95, P99), you need at least 10,000 queries to get stable measurements. Use different query vectors for each run — repeating the same queries can bias results due to caching effects.

Should I benchmark with or without metadata filters?

Both. Run a baseline without filters to understand raw vector search performance, then add filters that match your production query patterns. The performance gap between filtered and unfiltered search reveals how much overhead your filter strategy adds, which helps you design better metadata schemas.

How do I compare self-hosted vs managed vector databases fairly?

Match the compute resources. If your self-hosted pgvector runs on a 4-core, 16GB machine, compare it against a similarly sized managed instance, not the vendor's top-tier offering. Also account for operational costs — the managed service includes monitoring, backups, and scaling that you would need to build yourself.


#Benchmarking #VectorDatabase #Performance #Latency #Recall #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.