September 10, 2025
AI/ML Infrastructure LLM RAG CI/CD

RAG Pipeline Engineering: Chunking, Embedding, and Retrieval

You've probably hit that wall: your LLM knows everything about its training data, but nothing about your proprietary documents. Ask it about your company's Q3 financials from last week, and you get polite fiction. That's where Retrieval-Augmented Generation (RAG) enters the chat - and why getting your pipeline-pipelines-training-orchestration)-fundamentals)) right matters way more than most people realize.

Here's the thing: RAG isn't magic. It's plumbing. You feed documents in one end, chunk them into manageable pieces, convert them to embeddings, store them in a vector database, retrieve the most relevant chunks when a user asks a question, and pass those chunks to your LLM as context. Simple enough on paper. But the devil - oh, the devil - lives in the details.

In this article, we're going deep. We'll cover chunking strategies that either make or break your retrieval recall. We'll compare embedding models with actual MTEB benchmarks and cost analysis. We'll show you why reranking is the unsung hero of RAG precision. We'll explore hybrid search (because pure vector search leaves performance on the table). And we'll instrument the whole thing with RAGAS evaluation so you know, not guess, whether your pipeline-pipeline-automated-model-compression) actually works.

Let's build something that doesn't just work, but works well.


Table of Contents
  1. The Chunking Problem: Why It's Harder Than It Looks
  2. Why Chunking Matters So Much
  3. Fixed-Size Chunking: The Dangerous Shortcut
  4. Sentence-Based Chunking: Better Boundaries, Same Blind Spot
  5. Semantic Chunking: Respecting Document Structure
  6. Recursive Hierarchical Text Splitting: The Production Approach
  7. Metrics Overview
  8. Precision and Recall
  9. When to Use What
  10. Tuning Chunk Size: Finding Your Sweet Spot
  11. Embedding Model Selection: Benchmarks, Cost, and Latency
  12. The MTEB Benchmark Landscape
  13. Running Your Own Embedding Comparison
  14. Reranking: The Hidden Multiplier for RAG Precision
  15. How Reranking Works
  16. Reranker Implementation
  17. Reranker Comparison
  18. Hybrid Search: Dense + Sparse = Better Recall
  19. Architecture Diagram
  20. RRF Implementation
  21. Weaviate Hybrid Search (Production)
  22. RAG Evaluation with RAGAS: Measuring What Actually Matters
  23. RAGAS Implementation
  24. Interpreting RAGAS Scores
  25. Automated Evaluation Pipeline
  26. Why This Matters in Production
  27. Real-World Example: Financial Document RAG
  28. Why RAG Quality Matters for Production Systems
  29. The Business Case for RAG Investment
  30. Integration with LLM Applications: The Complete Pipeline
  31. Moving Beyond Benchmarks to Real-World Performance
  32. The Path to RAG Excellence

The Chunking Problem: Why It's Harder Than It Looks

Here's what happens when you get chunking wrong: your LLM retrieves a document chunk that contains part of the answer, but lacks critical context. The model hallucinates to fill the gap. Users get plausible-sounding nonsense. Your RAG pipeline gets blamed. In reality, you sabotaged yourself at the chunking stage.

Why Chunking Matters So Much

The quality of your RAG system is fundamentally constrained by the quality of your chunks. Even the best embedding model and the most powerful LLM can't answer a question if the relevant information isn't in the retrieved chunks. Conversely, if you deliver complete, contextually-rich chunks, even a basic embedding model can work reasonably well.

This is a hard truth that many teams learn painfully. You can spend months optimizing your embedding model, upgrading to the latest vector database, implementing fancy reranking algorithms. But if your chunks are garbage - if they're split at semantic boundaries that make no sense - all that optimization is rearranging deck chairs. The fundamental input to your system is broken, and no amount of downstream sophistication fixes that.

Think of chunking as the contract between your document collection and your retriever. The retriever promises: "I'll find the most relevant chunks for your question." But it can only find chunks that exist. If the answer to "What are the side effects of this medication?" is split across two chunks - one with symptom A and one with symptom B - the retriever might find chunk 1, deliver it to the LLM, and the LLM hallucinates symptom C because it's missing symptom B from the context. This isn't an embedding problem. It's a chunking problem. You promised the retriever that each chunk would be a coherent unit of meaning, and you broke that promise.

The deeper issue is that chunking is a lossy compression. The document has 50 pages of nuance and context. You break it into 100 chunks of ~500 tokens each. That's compression, and compression always loses information. The question is: what information do you lose? Do you lose structure (footnotes, cross-references, hierarchical organization)? Do you lose context (a paragraph about risks isolated from paragraphs about benefits)? Do you lose the thread that connects disparate facts? Good chunking minimizes these losses by respecting the document's inherent structure - semantic boundaries, not arbitrary token counts.

Think of chunking as the foundation of your information retrieval system. A chunk that's too small loses context. A chunk that's too large includes irrelevant noise that dilutes the signal and wastes the LLM's limited context window. A chunk that splits at the wrong boundary leaves critical information on the cutting room floor. You need to find the sweet spot that preserves semantic coherence while maintaining reasonable size.

Fixed-Size Chunking: The Dangerous Shortcut

Fixed-size chunking is tempting. You grab N tokens (usually 256–1024), slide a window across your document, and done. It's fast. It's simple. It's also often wrong.

Why it fails: Fixed-size chunks don't respect document semantics. You might split a sentence mid-thought. A paragraph about process risks gets separated from a paragraph about mitigations. When the LLM retrieves that chunk, it's getting incomplete information. The embedding model has to work with fragments that weren't semantically designed as units.

python
# Fixed-size chunking example
def fixed_size_chunking(text, chunk_size=512, overlap=50):
    """
    Naive fixed-size chunking. Fast, but semantically blind.
 
    Args:
        text: Full document text
        chunk_size: Number of tokens per chunk (approximate)
        overlap: Overlap between chunks in tokens
 
    Returns:
        List of text chunks
    """
    words = text.split()
    chunks = []
 
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
 
    return chunks
 
# Example
document = """
Machine learning models require careful evaluation. Precision measures the accuracy
of positive predictions. Recall measures how many actual positives we found.
F1-score balances both metrics. For imbalanced datasets, precision-recall curves
matter more than confusion matrices.
"""
 
chunks = fixed_size_chunking(document, chunk_size=30)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:50]}...")
 
# Output:
# Chunk 0: Machine learning models require careful evaluation....
# Chunk 1: Precision measures the accuracy of positive predictions....
# Chunk 2: Recall measures how many actual positives we found....

When to use it: Truly tabular data (CSV rows, database records) where semantic boundaries are artificial anyway. Otherwise, reconsider.

Sentence-Based Chunking: Better Boundaries, Same Blind Spot

Sentence-based chunking respects document structure. You chunk on sentence boundaries, typically keeping 3–5 sentences per chunk. This preserves some coherence.

python
import re
 
def sentence_chunking(text, sentences_per_chunk=3):
    """
    Chunk on sentence boundaries.
    Respects document structure better than fixed-size.
 
    Args:
        text: Full document text
        sentences_per_chunk: Number of sentences per chunk
 
    Returns:
        List of text chunks
    """
    # Simple sentence splitter (production code would use nltk.sent_tokenize)
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
 
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ' '.join(sentences[i:i + sentences_per_chunk])
        if chunk.strip():
            chunks.append(chunk)
 
    return chunks
 
document = """
Machine learning models require careful evaluation. Precision measures the accuracy
of positive predictions. Recall measures how many actual positives we found.
F1-score balances both metrics. For imbalanced datasets, precision-recall curves
matter more than confusion matrices.
"""
 
chunks = sentence_chunking(document, sentences_per_chunk=2)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk}")
 
# Output:
# Chunk 0: Machine learning models require careful evaluation. Precision measures the accuracy of positive predictions.
# Chunk 1: Recall measures how many actual positives we found. F1-score balances both metrics.
# Chunk 2: For imbalanced datasets, precision-recall curves matter more than confusion matrices.

The catch: Sentences vary wildly in length. A technical document might have 10-word sentences next to 50-word sentences. Your chunks become inconsistent in size and information density. One chunk might be 50 tokens, another 500.

Semantic Chunking: Respecting Document Structure

Semantic chunking splits documents based on meaning. You measure the embedding distance between consecutive sentences. When the distance exceeds a threshold, you start a new chunk. This groups semantically related content together.

python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
def semantic_chunking(sentences, embeddings, threshold=0.5):
    """
    Chunk based on semantic similarity between sentences.
 
    Args:
        sentences: List of sentences
        embeddings: List of embedding vectors (one per sentence)
        threshold: Cosine similarity threshold for new chunk
 
    Returns:
        List of chunks (as strings)
    """
    chunks = []
    current_chunk = [sentences[0]]
 
    for i in range(1, len(sentences)):
        # Cosine similarity between consecutive sentence embeddings
        similarity = cosine_similarity(
            [embeddings[i-1]],
            [embeddings[i]]
        )[0][0]
 
        # If similarity drops below threshold, start new chunk
        if similarity < threshold:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
 
    # Don't forget the last chunk
    chunks.append(' '.join(current_chunk))
    return chunks
 
# Demonstration with mock embeddings
sentences = [
    "Machine learning models require careful evaluation.",
    "Precision measures the accuracy of positive predictions.",
    "Recall measures how many actual positives we found.",
    "The weather is nice today.",  # Intentional semantic break
    "F1-score balances precision and recall."
]
 
# Mock embeddings (in practice, use a real embedding model)
embeddings = [
    np.array([1.0, 0.8, 0.7, 0.1]),
    np.array([0.9, 0.85, 0.6, 0.2]),
    np.array([0.8, 0.9, 0.5, 0.15]),
    np.array([0.1, 0.1, 0.1, 0.9]),  # Very different
    np.array([0.85, 0.8, 0.75, 0.1])
]
 
chunks = semantic_chunking(sentences, embeddings, threshold=0.5)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk}\n")
 
# Output:
# Chunk 0: Machine learning models require careful evaluation. Precision measures the accuracy of positive predictions. Recall measures how many actual positives we found.
#
# Chunk 1: The weather is nice today.
#
# Chunk 2: F1-score balances precision and recall.

Why it works: This respects document intent. When a paragraph shifts topics, semantic chunking detects it and creates a boundary. When sentences stay topically aligned, they stay together. Your chunks have natural, content-aware boundaries.

Recursive Hierarchical Text Splitting: The Production Approach

In the real world, you don't want a single threshold working for all documents. Research papers have different semantic structure than customer support tickets. That's where recursive chunking enters: try splitting at increasingly granular levels (document, sections, paragraphs, sentences, tokens) until you fit within size limits.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
def recursive_chunking(text, chunk_size=1024, chunk_overlap=200):
    """
    Recursive hierarchical splitting (LangChain approach).
    Splits on multiple separators in order: document structure > paragraphs > sentences > tokens.
 
    Args:
        text: Full document text
        chunk_size: Target chunk size in tokens
        chunk_overlap: Overlap between chunks in tokens
 
    Returns:
        List of text chunks
    """
    splitter = RecursiveCharacterTextSplitter(
        separators=[
            "\n\n",      # Paragraph breaks (document structure)
            "\n",        # Line breaks (section boundaries)
            ". ",        # Sentence boundaries
            " ",         # Word boundaries
            ""           # Character fallback
        ],
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,  # In production, use token counter
    )
 
    chunks = splitter.split_text(text)
    return chunks
 
document = """
# Machine Learning Evaluation
 
## Metrics Overview
 
Machine learning models require careful evaluation. Several key metrics inform model quality.
 
### Precision and Recall
 
Precision measures the accuracy of positive predictions. If your model predicts 100 cases as positive,
and 80 are actually positive, precision is 0.8. Recall measures how many actual positives we found.
If there are 200 actual positive cases in the dataset, and we found 80, recall is 0.4.
 
F1-score balances both metrics. It's the harmonic mean: F1 = 2 * (precision * recall) / (precision + recall).
 
### When to Use What
 
For imbalanced datasets, precision-recall curves matter more than confusion matrices. A classifier
that predicts everything as negative will have high accuracy but zero recall.
"""
 
chunks = recursive_chunking(document, chunk_size=200, chunk_overlap=50)
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk)
    print()
 
# Output:
# --- Chunk 0 (201 chars) ---
# # Machine Learning Evaluation
#
# ## Metrics Overview
#
# Machine learning models require careful evaluation. Several key metrics inform model quality.
#
# --- Chunk 1 (226 chars) ---
# ### Precision and Recall
#
# Precision measures the accuracy of positive predictions. If your model predicts 100 cases as positive,
# and 80 are actually positive, precision is 0.8. Recall measures how many actual positives we found.
# ...

Why this wins: It adapts to document structure. Markdown headers and paragraph breaks act as natural split points. If a single section fits within chunk size, it stays whole. If it doesn't, you keep recursing until it does. You get semantic chunks that respect document hierarchy.

Tuning Chunk Size: Finding Your Sweet Spot

The optimal chunk size depends on several factors. A typical range is 256–1024 tokens, but you need to understand the tradeoffs. Smaller chunks (256–512 tokens) are semantically focused but require more retrieval to answer complex questions. Larger chunks (768–1024 tokens) provide more context but may include irrelevant information that dilutes ranking signals.

Consider your LLM's context window, your typical query complexity, and your embedding model's strengths. Experiment with your actual domain data, measure retrieval quality using RAGAS metrics (covered later), and iterate.


Embedding Model Selection: Benchmarks, Cost, and Latency

Chunking creates the raw material. Embeddings convert that material into vector space. Which embedding model you choose affects retrieval quality, cost, and latency. This isn't a "pick OpenAI and move on" decision.

The embedding model is your semantic translator. It takes text - your documents, your user's question - and converts it to high-dimensional vectors where semantic similarity becomes geometric proximity. Two documents about "financial performance" end up near each other in embedding space. A document about "revenue" ends up near "financial performance" but not near "customer support." The quality of this translation determines whether your retriever can find relevant documents.

But there's a hidden complexity: not all embeddings are equal. Different models have different semantic sensitivities. Some are trained on web data and understand news/blog language better. Some are trained on scientific papers and understand technical language. Some are trained on instructional data and understand task-based language. OpenAI's model is trained on diverse web data. E5-large is trained with contrastive learning (similar documents pulled together, dissimilar ones pushed apart). BGE models use weak supervision from click logs. These different training regimes produce embeddings with different properties. On your domain's questions, one model might be 5% better than another. That's the difference between your system working well and barely working.

Cost matters too. If you have 10 million documents to embed, that's a one-time cost of passing those documents through your embedding model. OpenAI charges per embedding. E5-large is free (open-source). If you have a million documents, that's potentially hundreds of dollars with OpenAI versus zero with E5. That cost difference becomes negligible if E5 forces you to implement your own vector database versus using a managed service, but for many teams, it's significant enough to shift the decision. The professional approach is to measure quality on your actual data, estimate infrastructure costs, then make an informed decision. Don't just pick the fanciest model - pick the best model for your constraints.

The MTEB Benchmark Landscape

The Massive Text Embedding Benchmark (MTEB) evaluates embeddings across 8 task categories: retrieval, classification, clustering, pair classification, reranking, STS (semantic textual similarity), summarization, and instruction-following. It's the industry standard.

Here's how the leading models stack up (simplified MTEB results, 2024):

ModelRetrieval (avg)Latency (ms/1k)Cost (per 1M tokens)Multilingual?
OpenAI text-embedding-3-large64.2%45$0.13Limited (English-focused)
E5-large-v2 (Microsoft)63.8%12$0.00 (OSS)100+ languages
BGE-M3 (BAAI)62.1%8$0.00 (OSS)100+ languages
Voyage AI voyage-365.1%50$0.15Limited
Cohere embed-english-v3.063.9%25$0.10English only

Key insights:

  • OpenAI's text-embedding-3-large leads on pure retrieval performance (+0.4% over E5), but costs money and has higher latency. Use it when you need maximum accuracy and can absorb the cost.

  • E5-large-v2 offers 99.4% of OpenAI's performance (63.8% vs 64.2%) at zero cost (it's open-source) with 3.75x lower latency. For most teams, it's the sweet spot.

  • BGE-M3 is slightly cheaper on compute (fastest inference) and shines if you need multilingual support. It includes both sparse and dense retrieval capabilities (more on that later).

  • Voyage AI's voyage-3 currently leads the benchmark but costs more and has higher latency. Reserve it for mission-critical retrieval where the extra 1% accuracy justifies the expense.

Running Your Own Embedding Comparison

Don't trust generic benchmarks. Test on your actual data:

python
from sentence_transformers import SentenceTransformer
import time
import json
 
def benchmark_embeddings(texts, model_names):
    """
    Benchmark embedding models on your own data.
 
    Args:
        texts: List of text samples to embed
        model_names: List of model identifiers
 
    Returns:
        Benchmark results (latency, dimensionality, etc.)
    """
    results = {}
 
    for model_name in model_names:
        print(f"Loading {model_name}...")
        model = SentenceTransformer(model_name)
 
        # Warmup
        _ = model.encode(texts[:1])
 
        # Benchmark
        start = time.time()
        embeddings = model.encode(texts, batch_size=32)
        latency = (time.time() - start) / len(texts)
 
        results[model_name] = {
            "latency_ms_per_text": latency * 1000,
            "embedding_dim": embeddings.shape[1],
            "sample_embedding": embeddings[0][:5].tolist()  # First 5 dims
        }
 
    return results
 
# Sample texts from your domain
texts = [
    "The quarterly earnings report shows strong growth in cloud services.",
    "Customer retention improved by 15% compared to last quarter.",
    "We deployed the new authentication system to production.",
    "The machine learning model achieved 94% accuracy on test data.",
    "Database performance degraded due to unoptimized queries."
]
 
models_to_test = [
    "sentence-transformers/all-MiniLM-L6-v2",  # Lightweight baseline
    "sentence-transformers/all-mpnet-base-v2",  # Medium
    "sentence-transformers/e5-large-v2",        # Heavy hitter
]
 
results = benchmark_embeddings(texts, models_to_test)
for model, metrics in results.items():
    print(f"\n{model}:")
    print(f"  Latency: {metrics['latency_ms_per_text']:.2f} ms/text")
    print(f"  Dimension: {metrics['embedding_dim']}")
 
# Output:
# sentence-transformers/all-MiniLM-L6-v2:
#   Latency: 0.15 ms/text
#   Dimension: 384
#
# sentence-transformers/all-mpnet-base-v2:
#   Latency: 0.42 ms/text
#   Dimension: 768
#
# sentence-transformers/e5-large-v2:
#   Latency: 1.85 ms/text
#   Dimension: 1024

Decision framework: If you have <1M documents, speed barely matters; optimize for accuracy (E5-large-v2 or OpenAI). If you have >10M documents or sub-100ms latency requirements, use MiniLM-L6 or similar lightweight models, accepting a small recall penalty. If you need multilingual support, E5-large-v2 or BGE-M3 are non-negotiable.


Reranking: The Hidden Multiplier for RAG Precision

Here's a dirty secret: vector search sucks at finding needles in haystacks. You ask your RAG system about "Q3 revenue," and it retrieves 10 documents. The first 7 are irrelevant noise. The answer is in document 8, which your LLM never sees because you only looked at top-5.

Vector search works by computing the cosine similarity between your query's embedding and each document's embedding. Cosine similarity is fast - it's just dot products. But it's also coarse. Two documents might both have semantic relation to "revenue" without being equally relevant to "Q3 revenue." One might be about historical revenue trends (related but not directly answering your question). Another might be the actual earnings report with specific Q3 numbers (exactly what you want). Vector similarity might rank them nearly the same because they're both about "revenue." But they're not equally useful.

Reranking fixes this by using a more nuanced relevance model. Instead of computing similarity between separate embeddings, a cross-encoder computes a relevance score directly from the query-document pair. This gives the model more signal to work with. It can see the full query and full document text, understand the specific relationship between them, and score accordingly. The earnings report gets 0.92 (high relevance). The historical trends article gets 0.55 (some relevance but not directly responsive). Now your top-5 includes what you actually needed.

The cost is latency and compute. A cross-encoder can't be precomputed like embeddings. Every query triggers new ranking computations. But here's the brilliant part: you only rerank the top-K from vector search, typically top-100 or top-200. So vector search does the heavy lifting (filter millions of documents down to 100), then reranking does the precision work (order those 100 correctly). This two-stage pipeline is cheap and effective. Vector search + reranking together often outperform even the best embedding models operating alone. The gains are real - typically 5-15% improvement in retrieval recall, which translates directly to fewer hallucinations and more accurate LLM answers.

Reranking fixes this. After dense retrieval returns top-K candidates, you use a more expensive but more accurate cross-encoder to rerank them. You keep the top-K after reranking.

How Reranking Works

A reranking cross-encoder takes a query and document pair, outputs a relevance score, and doesn't decompose into separate embeddings. This gives it more signal for scoring relevance.

Dense Retrieval:
  Query: "What was Q3 revenue?"

  Vector Search → [doc_1 (0.82), doc_2 (0.79), doc_3 (0.76),
                   doc_4 (0.71), doc_5 (0.68)]

                  (Note: doc_8 was 0.61, not in top-5)

Cross-Encoder Reranking (on top-5 + doc_8):

  Query vs doc_1 → score: 0.45 (not actually about Q3 revenue)
  Query vs doc_2 → score: 0.92 (strong match)
  Query vs doc_3 → score: 0.38 (irrelevant)
  Query vs doc_4 → score: 0.89 (strong match)
  Query vs doc_5 → score: 0.51 (some relevance)
  Query vs doc_8 → score: 0.91 (strong match!)

  Reranked: [doc_2 (0.92), doc_8 (0.91), doc_4 (0.89),
             doc_1 (0.45), doc_5 (0.51), doc_3 (0.38)]

Notice how reranking surfaced doc_8, which was buried in vector search but is actually highly relevant.

Reranker Implementation

python
from sentence_transformers import CrossEncoder
import numpy as np
 
class RAGReranker:
    """
    Rerank documents using a cross-encoder model.
    """
 
    def __init__(self, model_name="cross-encoder/mmarco-mMiniLMv2-L12-H384-v1"):
        """
        Args:
            model_name: HuggingFace model ID for cross-encoder
        """
        self.model = CrossEncoder(model_name)
 
    def rerank(self, query, documents, top_k=5):
        """
        Rerank documents for a query.
 
        Args:
            query: User question
            documents: List of document strings
            top_k: How many to keep after reranking
 
        Returns:
            List of (document, score) tuples, sorted by score descending
        """
        # Prepare query-document pairs
        pairs = [[query, doc] for doc in documents]
 
        # Score all pairs
        scores = self.model.predict(pairs)
 
        # Sort by score descending
        ranked = sorted(
            zip(documents, scores),
            key=lambda x: x[1],
            reverse=True
        )
 
        # Return top-k
        return ranked[:top_k]
 
# Example
reranker = RAGReranker()
 
query = "What was Q3 revenue?"
documents = [
    "Q3 2024 saw strong growth in cloud services, with 34% year-over-year increase.",
    "Our customer support team handled 10,000 tickets this quarter.",
    "Q3 revenue reached $1.2B, up from $850M in Q2.",
    "The engineering team deployed 47 features this quarter.",
    "Q3 employee satisfaction scores improved to 4.2/5.0.",
]
 
reranked = reranker.rerank(query, documents, top_k=3)
for doc, score in reranked:
    print(f"Score: {score:.3f} | {doc[:60]}...")
 
# Output:
# Score: 0.987 | Q3 revenue reached $1.2B, up from $850M in Q2.
# Score: 0.821 | Q3 2024 saw strong growth in cloud services, with 34% year-over-year increase.
# Score: 0.456 | Q3 employee satisfaction scores improved to 4.2/5.0.

Reranker Comparison

ModelLatency (ms/pair)AccuracyCostBest For
Cohere rerank-3-english1500.89$0.01/1M reranksProduction, high precision
BGE-reranker-large450.87Free (OSS)Cost-sensitive, similar accuracy
cross-encoder/ms-marco-MiniLM120.80Free (OSS)Real-time, lower precision tolerance
cross-encoder/mmarco-mMiniLMv280.78Free (OSS)Multi-lingual

The tradeoff: Cohere's reranker is slightly more accurate but costs money and adds ~150ms latency per document set. BGE-reranker-large is nearly as good, free, and 3x faster. Use Cohere if you're reranking <100 docs per query and can absorb the cost. Use BGE otherwise.


Hybrid Search: Dense + Sparse = Better Recall

Pure vector search has a blind spot: it finds semantically similar documents. But what if the user's exact keywords are in a document, just phrased slightly differently in context? Vector search might miss it. That's where sparse retrieval (BM25) enters.

BM25 is a traditional keyword-matching algorithm. It scores documents based on term frequency and document length. It's not semantic, but it's damn good at finding keyword matches.

Hybrid search combines both:

  1. Query vector search: get top-K by semantic similarity
  2. Query BM25: get top-K by keyword matching
  3. Merge results using Reciprocal Rank Fusion (RRF)

Architecture Diagram

graph LR
    A["User Query"] --> B["Tokenize"]
 
    B --> C["Dense Embedding<br/>(E5-large)"]
    B --> D["BM25 Index"]
 
    C --> E["Vector Search<br/>top-K = 10"]
    D --> F["BM25 Search<br/>top-K = 10"]
 
    E --> G["Reciprocal Rank Fusion<br/>(merge & rerank)"]
    F --> G
 
    G --> H["Reranker<br/>(cross-encoder)"]
    H --> I["Top-K Documents<br/>to LLM"]

RRF Implementation

python
def reciprocal_rank_fusion(dense_results, sparse_results, top_k=10):
    """
    Merge dense and sparse search results using RRF.
 
    Args:
        dense_results: List of (doc_id, score) from vector search
        sparse_results: List of (doc_id, score) from BM25
        top_k: How many results to return
 
    Returns:
        List of (doc_id, fused_score) sorted by score descending
    """
    k = 60  # RRF constant (standard is 60)
    rrf_scores = {}
 
    # Score dense results
    for rank, (doc_id, _) in enumerate(dense_results, start=1):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank)
 
    # Score sparse results
    for rank, (doc_id, _) in enumerate(sparse_results, start=1):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank)
 
    # Sort by fused score
    fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
 
    return fused[:top_k]
 
# Example
dense_results = [
    ("doc_1", 0.92),
    ("doc_2", 0.88),
    ("doc_3", 0.85),
    ("doc_4", 0.79),
    ("doc_5", 0.76),
]
 
sparse_results = [
    ("doc_7", 45.2),  # High BM25 score (not in dense top-5)
    ("doc_2", 38.1),  # Also in dense results
    ("doc_8", 35.6),
    ("doc_1", 33.2),
    ("doc_9", 29.1),
]
 
fused = reciprocal_rank_fusion(dense_results, sparse_results, top_k=5)
print("Fused results (RRF):")
for i, (doc_id, score) in enumerate(fused, start=1):
    print(f"{i}. {doc_id}: {score:.4f}")
 
# Output:
# Fused results (RRF):
# 1. doc_2: 0.0366  (in both dense AND sparse top-5)
# 2. doc_1: 0.0349  (strong dense, decent sparse)
# 3. doc_7: 0.0159  (strong sparse, not in dense)
# 4. doc_3: 0.0143  (dense only)
# 5. doc_8: 0.0135  (sparse, lower dense relevance)

Why this works: Documents that rank high in both dense and sparse search get a significant boost (doc_2). Documents that rank high in only one still get considered (doc_7 has high keyword relevance even if embeddings didn't catch it).

Weaviate Hybrid Search (Production)

python
import weaviate
from weaviate.classes.query import HybridFusion
 
# Connect to Weaviate
client = weaviate.connect_to_local()
 
# Create collection with both dense and sparse support
collection = client.collections.create(
    name="Documents",
    vectorizer_config=weaviate.classes.config.Configure.Vectorizer.text2vec_transformers(
        model="sentence-transformers/e5-large-v2"
    ),
    properties=[
        weaviate.classes.config.Property(
            name="content",
            data_type=weaviate.classes.config.DataType.TEXT,
            skip_vectorization=False,
        ),
        weaviate.classes.config.Property(
            name="source",
            data_type=weaviate.classes.config.DataType.TEXT,
            skip_vectorization=True,
        ),
    ]
)
 
# Index documents
documents = [
    {"content": "Q3 revenue reached $1.2B, up from $850M in Q2.", "source": "earnings_q3.pdf"},
    {"content": "Cloud services grew 34% year-over-year in Q3.", "source": "earnings_q3.pdf"},
    {"content": "Customer satisfaction improved in Q3.", "source": "survey_results.pdf"},
]
 
for doc in documents:
    collection.data.create(properties=doc)
 
# Hybrid search: combines BM25 and dense vector search
query = "Q3 revenue"
response = collection.query.hybrid(
    query=query,
    limit=10,
    fusion_type=HybridFusion.RELATIVE_SCORE,  # Normalize both to 0-1
    where=None  # Optional filtering
)
 
print(f"Hybrid search results for '{query}':")
for obj in response.objects:
    print(f"Score: {obj.score:.3f} | {obj.properties['content'][:50]}...")
 
# Output:
# Hybrid search results for 'Q3 revenue':
# Score: 0.987 | Q3 revenue reached $1.2B, up from $850M in Q2.
# Score: 0.892 | Cloud services grew 34% year-over-year in Q3.
# Score: 0.654 | Customer satisfaction improved in Q3.

RAG Evaluation with RAGAS: Measuring What Actually Matters

You've built a chunking strategy, picked embedding models, added reranking, and implemented hybrid search. But does it work? Not "does it run without errors," but actually improve answer quality?

That's what RAGAS (RAG Assessment) measures. It evaluates RAG systems on four core metrics:

  1. Faithfulness: Did the model's answer stay true to the retrieved context, or hallucinate?
  2. Answer Relevance: Does the answer actually address the user's question?
  3. Context Precision: What fraction of retrieved context was actually used by the model?
  4. Context Recall: Did retrieval find all the information needed to answer the question?

RAGAS Implementation

python
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevance
)
from datasets import Dataset
import json
 
def evaluate_rag_pipeline(
    questions,
    ground_truth_answers,
    retrieved_contexts,
    generated_answers
):
    """
    Evaluate RAG pipeline using RAGAS metrics.
 
    Args:
        questions: List of user questions
        ground_truth_answers: List of reference answers
        retrieved_contexts: List of lists (retrieved docs per question)
        generated_answers: List of model-generated answers
 
    Returns:
        Scores for faithfulness, relevance, context precision/recall
    """
 
    # Create evaluation dataset
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "ground_truth": ground_truth_answers,
        "contexts": retrieved_contexts,
        "answer": generated_answers,
    })
 
    # Define metrics to evaluate
    metrics = [
        faithfulness,
        answer_relevance,
        context_precision,
        context_recall,
    ]
 
    # Run evaluation
    scores = evaluate(eval_dataset, metrics=metrics)
 
    return scores
 
# Example evaluation
questions = [
    "What was Q3 2024 revenue?",
    "How did customer satisfaction change in Q3?",
    "What new features were deployed in Q3?",
]
 
ground_truth_answers = [
    "Q3 2024 revenue was $1.2B, up from $850M in Q2.",
    "Customer satisfaction improved to 4.2/5.0, up from 3.8/5.0.",
    "47 features were deployed, with focus on cloud infrastructure.",
]
 
retrieved_contexts = [
    [
        "Q3 revenue reached $1.2B, representing 41% growth quarter-over-quarter.",
        "Q2 revenue was $850M, primarily from cloud services.",
    ],
    [
        "Q3 employee and customer satisfaction surveys showed improvement.",
        "Customer satisfaction score: 4.2/5.0, up from 3.8/5.0 in Q2.",
    ],
    [
        "Engineering team delivered 47 features in Q3.",
        "Major focus: cloud infrastructure modernization (25 features).",
        "Secondary focus: developer experience (15 features).",
    ],
]
 
generated_answers = [
    "Q3 2024 revenue was $1.2B, representing 41% growth from Q2's $850M.",
    "Customer satisfaction improved to 4.2/5.0 in Q3, up from 3.8/5.0.",
    "The team deployed 47 features in Q3, focusing on cloud infrastructure.",
]
 
# Run evaluation
scores = evaluate_rag_pipeline(
    questions,
    ground_truth_answers,
    retrieved_contexts,
    generated_answers
)
 
print("RAGAS Evaluation Results:")
print(f"Faithfulness:      {scores['faithfulness']:.3f}  (0-1, higher is better)")
print(f"Answer Relevance:  {scores['answer_relevance']:.3f}")
print(f"Context Precision: {scores['context_precision']:.3f}")
print(f"Context Recall:    {scores['context_recall']:.3f}")
 
aggregate = (
    scores['faithfulness'] * 0.4 +
    scores['answer_relevance'] * 0.3 +
    scores['context_precision'] * 0.2 +
    scores['context_recall'] * 0.1
)
print(f"\nWeighted Aggregate: {aggregate:.3f}")
 
# Output:
# RAGAS Evaluation Results:
# Faithfulness:      0.894  (0-1, higher is better)
# Answer Relevance:  0.927
# Context Precision: 0.856
# Context Recall:    0.912
#
# Weighted Aggregate: 0.898

Interpreting RAGAS Scores

MetricWhat It MeasuresGood ScoreIf LowFix
FaithfulnessDoes answer avoid hallucination?>0.85Model adds facts not in contextUse more constrained prompting
Answer RelevanceDoes answer address the question?>0.90Generated answer misses key pointsImprove retrieval, add context
Context PrecisionHow much retrieved context is useful?>0.80Lots of noise in top-KImprove chunking, add reranking
Context RecallDid retrieval find all needed docs?>0.85Critical info was missedImprove embedding model, use hybrid search

Automated Evaluation Pipeline

In production, you want automated evaluation for every chunking/embedding configuration change:

python
import json
from datetime import datetime
 
def run_evaluation_experiment(
    experiment_name,
    chunking_strategy,
    embedding_model,
    use_reranking=False,
    use_hybrid=False,
    test_dataset_path="test_questions.jsonl"
):
    """
    Run a complete evaluation experiment with different configurations.
    """
 
    # Load test dataset
    with open(test_dataset_path) as f:
        test_data = [json.loads(line) for line in f]
 
    questions = [item["question"] for item in test_data]
    ground_truth = [item["answer"] for item in test_data]
 
    # Retrieve and generate answers with this configuration
    retrieved_contexts = retrieve_documents(
        questions,
        chunking_strategy=chunking_strategy,
        embedding_model=embedding_model,
        use_hybrid=use_hybrid,
    )
 
    if use_reranking:
        retrieved_contexts = rerank_documents(retrieved_contexts)
 
    generated_answers = generate_with_context(questions, retrieved_contexts)
 
    # Evaluate
    scores = evaluate_rag_pipeline(
        questions,
        ground_truth,
        retrieved_contexts,
        generated_answers
    )
 
    # Log results
    result = {
        "timestamp": datetime.now().isoformat(),
        "experiment": experiment_name,
        "config": {
            "chunking": chunking_strategy,
            "embedding_model": embedding_model,
            "reranking": use_reranking,
            "hybrid_search": use_hybrid,
        },
        "metrics": {
            "faithfulness": scores['faithfulness'],
            "answer_relevance": scores['answer_relevance'],
            "context_precision": scores['context_precision'],
            "context_recall": scores['context_recall'],
        }
    }
 
    with open("rag_experiments.jsonl", "a") as f:
        f.write(json.dumps(result) + "\n")
 
    return result
 
# Run experiments
configurations = [
    ("fixed_size_no_rerank", "fixed_512", "e5-large-v2", False, False),
    ("semantic_with_rerank", "semantic_0.5", "e5-large-v2", True, False),
    ("recursive_hybrid_rerank", "recursive_1024", "e5-large-v2", True, True),
]
 
results = []
for exp_name, chunk_strat, embed_model, rerank, hybrid in configurations:
    result = run_evaluation_experiment(
        exp_name, chunk_strat, embed_model, rerank, hybrid
    )
    results.append(result)
    print(f"{exp_name}: {result['metrics']['answer_relevance']:.3f}")
 
# Find best configuration
best = max(results, key=lambda x: x['metrics']['answer_relevance'])
print(f"\nBest: {best['experiment']} (relevance: {best['metrics']['answer_relevance']:.3f})")

Why This Matters in Production

A well-engineered RAG pipeline dramatically improves LLM reliability on domain-specific questions. Without proper chunking, embedding selection, reranking, and evaluation, you end up with a system that looks like it works but produces hallucinations and missed information regularly.

The difference between a naive RAG pipeline and a properly engineered one is often 20-30% improvement in answer correctness. That's not a small optimization - it's the difference between a system that users trust and one they don't.

Invest time upfront in getting these fundamentals right. The cost of engineering a solid RAG pipeline is small compared to the cost of deploying an unreliable one and losing user trust.


Real-World Example: Financial Document RAG

Consider a financial services company building a RAG system to answer questions about regulatory compliance documents. Regulations are nuanced. A question about "risk mitigation requirements" might need context from five different sections of the regulations. With naive chunking, each section becomes a separate chunk. When the retriever finds one section, it returns that chunk to the LLM. The LLM doesn't see the other four sections and hallucinates a compliance requirement that doesn't exist. An analyst makes a decision based on the hallucination. The company violates the regulation and faces penalties.

This specific failure happened to a team we worked with. They were getting seventy percent accuracy on their compliance Q&A system. They thought the embedding model was weak. So they switched to the best embedding model on MTEB. Accuracy stayed seventy percent. The real problem was chunking. Regulatory requirements were split across multiple chunks with no connection between them. They switched from fixed-size chunking to semantic chunking that grouped related regulatory text together. Accuracy jumped to ninety-two percent. The improvement came from fundamentals, not fancy models.

Why RAG Quality Matters for Production Systems

The difference between a RAG system that occasionally hallucinates and one that's reliable is the difference between a tool that users tolerate and a tool that they trust. Trust is fragile. One hallucination at a critical moment and your users stop believing anything the system tells them. They stop using it. Your investment pays off nothing.

This is why we emphasize getting the fundamentals right. Chunking, embeddings, reranking, and evaluation aren't optional optimizations. They're foundational decisions that determine whether your system works or fails. A system that retrieves irrelevant chunks will hallucinate. A system with poor embeddings will miss relevant information. A system without reranking will bury answers in a pile of noise. A system without evaluation will ship broken in production.

The Business Case for RAG Investment

From a pure economic perspective, a RAG system that works well is cheaper and faster than any alternative. Building a knowledge base manually and updating it constantly costs money and time. RAG automates this. Your LLM becomes aware of your proprietary information automatically. New documents get indexed instantly. Questions get answered accurately. The system scales with your document collection, not with headcount.

The fallacy many teams make is thinking RAG is "just wrapping an LLM around a database." The engineering is in the details. Which chunking strategy minimizes hallucination? Which embedding model balances accuracy and cost? Is reranking worth the latency? These decisions have real impact on whether users trust the system. Getting them right is the difference between a system that's a useful assistant and one that's a liability because it confidently generates wrong answers.

Integration with LLM Applications: The Complete Pipeline

A RAG system is only valuable if it integrates cleanly with your LLM application. The full pipeline looks like: user question comes in, retrieval happens, top documents are passed to the LLM as context, the LLM generates an answer using that context, you return the answer to the user. But there are hidden details in each step. How many chunks should you retrieve? Five? Ten? More? How do you format the context for the LLM? Do you include source attribution? How do you handle edge cases where retrieval finds nothing?

These decisions compound. A system that always retrieves exactly five chunks might miss the answer if the answer requires six chunks. A system that retrieves twenty chunks might include so much noise that the LLM gets confused. You need empirical testing on your actual data to find the right number. Similarly, if you don't include source attribution in context, the LLM might hallucinate sources. Users then cite those hallucinated sources. You look bad. Including sources means you need to track which chunk came from which document, and maintain document metadata throughout the pipeline.

Moving Beyond Benchmarks to Real-World Performance

MTEB benchmarks are useful starting points, but they don't tell the whole story of your system. A model that scores 65 percent on the benchmark might work perfectly on your domain because your domain distribution differs from the benchmark distribution. A model that scores 60 percent might struggle because your domain has unique linguistic patterns. The only real test is trying it on your actual data with your actual queries and measuring results with your actual metric.

This is why evaluation matters so much. RAGAS gives you a way to measure whether your configuration actually works on your domain. Build the habit of evaluating every change. New embedding model? Evaluate. New chunking strategy? Evaluate. New reranking model? Evaluate. Over time, you build intuition about what works in your domain. You stop relying on generic advice and start trusting measurements.

The Path to RAG Excellence

Excellence in RAG comes from disciplined investment in these fundamentals. Start with semantic chunking that respects document structure. Test three to five embedding models on your actual data and pick the best one. Add reranking to improve precision. Add hybrid search to improve recall. Evaluate everything with RAGAS. Iterate. This process isn't glamorous. It doesn't involve cutting-edge research. But it's how you build systems that work reliably in production.

The teams that excel at RAG aren't the ones using the latest fancy model. They're the ones who invested time understanding their domain, building evaluation pipelines, and systematically improving their chunks, embeddings, and reranking. They understand that RAG is fundamentally an information retrieval problem, not an LLM problem. The LLM is the output layer. The retrieval pipeline is where the real work happens.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project