Index / Notes / Problem Statement

Why RAG Pipelines Fail: 5 Common Pitfalls (and What to Watch For)

RAG works in demos, fails in production. The five failure modes below cause most of those failures — and they all show up downstream of the data layer, not the model.

Reid Spachman May 2, 2026 11 min read

TL;DR

Schema drift breaks retrieval silently — set up KS-stat monitoring on every column.
Near-duplicate detection is a recall problem at scale; hash-based dedup misses 30%+.
Chunk-strategy mistakes (fixed-size on prose) cause retrieval recall collapse.
Retrieval recall@10 below 0.85 means your index isn't earning its keep.
Embedding-model mismatch between query and corpus is the silent killer.

A RAG pipeline is a retrieval-augmented generation system — a pattern that routes a query through a vector index to pull relevant document chunks, then passes those chunks to a language model as context. When it works, the model answers from your corpus rather than from its training weights. When it fails, it answers confidently from neither, and the failure is almost never visible in the output.

That last point is what makes RAG failures hard to catch. The model still generates text. The application still responds. The pipeline still processes documents. But retrieval silently misses the right chunks, the index quietly diverges from the live corpus, and the system degrades in a way that only a held-out evaluation set would catch — a set most teams don't build until something breaks badly enough to justify it.

The five failure modes below account for most of the RAG production failures that have been publicly documented, presented at conferences, or written up in vendor post-mortems. Each one is a data-layer problem that shows up as a model-layer symptom, which is exactly why they persist: the root cause is upstream of where teams look.

What does it mean for a RAG pipeline to fail?

In strict terms, a RAG pipeline fails when the system returns a response that does not accurately reflect the relevant content in the corpus — whether by missing that content entirely, by surfacing the wrong chunks, or by corrupting the context the model receives. Retrieval failure is distinct from model failure: the language model may be behaving correctly given the context it received; the problem is that the context was wrong.

This definition matters because it sets the scope of diagnosis. When a RAG system produces a hallucination or a wrong answer, the natural instinct is to investigate the model — the prompt, the temperature, the model version. But in production systems with mature underlying models, the more common cause is retrieval: the right document existed in the corpus, the model would have answered correctly if it had been retrieved, and the pipeline didn't surface it.

RAG failures also vary by mode. Some failures are total — the corpus doesn't contain the answer, and the pipeline correctly signals that. Those are easy to catch. The hard failures are partial and silent: the corpus contains the answer, retrieval almost works, and the model fills the gap with a plausible hallucination that passes review. These are the failures the five pitfalls below produce.

Why does schema drift break retrieval silently?

Schema drift is the change that happens to a data source over time — a field renamed, a date format flipped, an upstream API that starts returning nulls where it used to return strings — without any corresponding change to the downstream pipeline that consumes it. In a RAG context, drift doesn't just create bad rows; it degrades the quality of the entire index in ways that are invisible until recall collapses.

The canonical public example is what happens when an enterprise RAG system ingests Confluence pages. Confluence's export format is version-dependent. An org that upgrades Confluence mid-deployment may silently change the HTML structure of its page exports — headings flatten to paragraphs, code blocks lose their semantic markers — without any API error, without any pipeline exception, and without any change to the document count. The corpus looks healthy. Completeness metrics pass. But the parser was trained on the old format, and the chunks it produces from the new format are longer, noisier, and have their structural signal stripped out. Retrieval on technical queries — "how do we configure X" — degrades immediately. Retrieval on prose queries degrades more slowly, but it still degrades.

The same pattern appears in SEC filing pipelines (EDGAR's XBRL taxonomy versioning changes every year), in news data (wire services periodically rename or reformat metadata fields), and in internal ticketing exports (Jira's CSV export format changed twice in 2023 and 2024 in ways that affected how body text was delimited).

What to watch for: changes in field distributions that don't correspond to real-world events. If the average length of a body_text field drops by 30% overnight and there was no change in document volume, something upstream changed. If a numeric field that used to have a mean of 14.2 now has a mean of 0.4, the source changed — or the parser broke. Distributional shift tests (Kolmogorov–Smirnov for numeric columns, chi-squared for categorical) catch these without requiring you to read every document. The problem isn't detecting the drift; it's having the monitoring in place to see it when it's happening rather than three months later when someone asks why the system can't answer questions it used to answer well.

Why does hash-based deduplication miss most duplicates at scale?

The default deduplication approach in most data pipelines is hash-based: compute a hash of the document content (MD5, SHA-1, or a normalized variant), store it in a set, and skip any document whose hash you've already seen. This catches exact duplicates — the same byte sequence appearing twice. It misses everything else, and at corpus scale, everything else is most of the duplication problem.

Consider a press release. A company issues one. It gets scraped by Reuters, Associated Press, Bloomberg, MarketBeat, and six industry newsletters. The AP version has the boilerplate footer stripped. Bloomberg adds a dateline. MarketBeat reformats the paragraph breaks and adds an ad slot. The industry newsletters add "as reported by" headers and change the headline capitalization. None of these six versions has the same hash. All six go into the corpus. If a user asks a question that this press release answers, retrieval returns six near-identical chunks — consuming six context slots out of the ten available — instead of one clean canonical document plus five slots of additional, non-redundant context.

This isn't hypothetical. It's documented in the Common Crawl deduplication literature: the 2022 CCNet paper found that hash-based dedup on web text corpora left roughly 30–40% of near-duplicate content intact, depending on how "duplicate" was defined. The FineWeb dataset blog post (Hugging Face, 2024) includes detailed ablations showing that MinHash-based near-duplicate detection over exact-hash dedup substantially reduces training perplexity — a proxy for corpus noise. The same dynamic applies to RAG corpora, with recall degradation as the metric instead of training perplexity.

Near-duplicate detection approaches (MinHash with locality-sensitive hashing is the canonical reference in the literature; SimHash and learned-embedding similarity are alternatives with different tradeoff profiles) find document pairs that are similar above some threshold without requiring exact byte-level matches. They scale — MinHash in particular can be parallelized across shards efficiently. But they require tuning decisions that exact-hash dedup doesn't: what threshold counts as a duplicate, which version of a near-duplicate set do you keep, and how do you handle the case where the "duplicate" you drop happens to be the only version that contains a key detail the others omitted.

What to watch for: unusual clustering of retrieved chunks under recall evaluation. If you build a held-out set of (query, expected-document) pairs and find that your top-10 retrieval results frequently contain three or more chunks from documents with near-identical text, your dedup pass isn't catching near-duplicates. Run a spot-check: pick 100 random documents from the corpus and compute pairwise similarity against a sample of the rest. If more than 5–10% have a near-duplicate sibling, hash-based dedup is leaving significant noise in your index.

Why does fixed-size chunking break retrieval on prose?

Chunking is the step that divides a document into model-sized pieces before embedding. The naive implementation — split every document into fixed-size token windows, with some overlap — is simple to implement, produces predictable output sizes, and works well in demos. It fails systematically on prose because it ignores the semantic structure of the content it's cutting.

A fixed-size chunker doesn't know where an argument ends and the next one begins. It doesn't know that the sentence at position 510 is the conclusion of a paragraph that began at position 420, and that the chunk boundary you drew at position 512 cut the conclusion off from its supporting evidence. It doesn't know that the heading at position 256 provides the key context for everything that follows until position 480, and that a chunk beginning at 300 will retrieve without that context.

The failure mode is consistent: dense prose — legal documents, technical specifications, narrative analysis, SEC filings — retrieves poorly on question-style queries because the most relevant content spans chunk boundaries. A question about the conclusion of an argument surfaces the conclusion chunk but not the evidence. A question about the implications of a policy retrieves the implications sentence but not the policy definition that precedes it by 200 tokens. The embedding model did its job; the chunk it embedded just didn't contain enough of the relevant document to be useful.

The 2024 study by Shi et al. ("Lost in the Middle: How Language Models Use Long Contexts") found that retrieval quality degrades when the relevant content is at chunk boundaries rather than chunk centers — even when the embedding model theoretically has enough context window. The relevant signal is diluted by the unrelated text on the other side of the boundary. Fixed-size chunking maximizes the probability of putting relevant content at boundaries, because it has no information about where boundaries should be.

What to watch for: asymmetric recall across query types. If your evaluation set includes both factual questions (which typically anchor to a single sentence) and conceptual questions (which require multi-sentence context), and your system performs well on factual but poorly on conceptual, fixed-size chunking is a likely cause. The tell is a visual inspection of the top-retrieved chunks for conceptual queries: if the retrieved chunks routinely begin or end mid-argument, mid-definition, or mid-example, the chunking strategy is cutting semantic units apart.

Why is retrieval recall the metric that matters most?

There are many metrics you can track on a RAG system: answer quality (human ratings, model-based evaluation), faithfulness (does the answer reflect the provided context), context precision (are the retrieved chunks relevant to the query), end-to-end latency, and throughput. Retrieval recall — specifically, the fraction of queries for which the correct document appears in the top-k retrieved results — is the metric that most teams don't measure and that most failures trace back to.

The reason recall matters more than the other metrics is that it's a precondition for all of them. If the right document isn't in the top-10 retrieved chunks, no amount of prompting, model quality, or answer-faithfulness scoring fixes the response. The model answers from whatever context it received. A system with 0.70 retrieval recall@10 is producing wrong answers on 30% of answerable queries before the model is even involved — and most of those wrong answers are confident and fluent, because the model received context, just not the right context.

The absence of recall measurement creates a specific organizational failure mode: teams optimize the wrong thing. They improve the prompt. They swap to a better model. They tune the reranker. Each change improves evaluation scores on the test set, because the test set and the retrieval pipeline share the same corpus at a point in time. But production recall degrades silently as the corpus drifts, as new document types enter the pipeline, as the dedup pass misses new near-duplicate patterns. The evaluation set ages. The production system diverges from it. The gap only shows up when a user sends a query that the aging test set doesn't cover.

The literature on RAG evaluation — the RAGAS framework (Es et al., 2023), the ARES framework (Saad-Falcon et al., 2023), and Anthropic's own internal evaluations published in various contexts — consistently identifies retrieval recall as the most sensitive leading indicator for overall system quality. It's also the easiest to compute: you need a held-out set of (query, document) pairs, and you need to score whether the document appears in the top-k retrieved results. The evaluation is deterministic. It doesn't require human raters. It catches regressions before they reach users.

What to watch for: retrieval recall@10 below your launch baseline on a stable held-out set. The right threshold depends on your application — high-stakes question answering (medical, legal, financial) needs higher recall than an internal document search. But any decline in recall on a stable held-out set is a regression signal, regardless of the absolute number. If you don't have a held-out set yet, the most important thing you can add to your RAG system isn't a better model or a better prompt — it's one.

Why does embedding-model mismatch cause silent recall collapse?

An embedding model maps text into a vector space. The structure of that space — which documents land near each other, which queries map close to which documents — is determined by how the model was trained. Two embedding models trained on different data or with different objectives produce different vector spaces. A query embedding from model A is not directly comparable to document embeddings from model B. Querying model A embeddings against a model B corpus produces nonsense similarity scores and near-zero effective recall, because the query vector is navigating a space it was never trained to navigate.

This seems obvious stated directly, but it creates a specific and common failure mode in production: embedding-model version drift. A pipeline that ships with text-embedding-3-small builds its index. Six months later, the team upgrades to text-embedding-3-large because it benchmarks better. They update the query embedding call. They don't re-embed the corpus — that's expensive and requires a migration — or they re-embed only new documents going forward. Now the index contains a mix of old-model and new-model embeddings. Queries run through the new model. Recall on old-model-embedded documents collapses. The system still returns results. The results look plausible. But documents embedded before the upgrade are effectively invisible to the new-model query.

The same failure occurs across model families, not just versions. An organization might embed a corpus with OpenAI's text-embedding-ada-002 and then switch to Cohere Embed v3 for cost reasons. Or a team might be running a RAG system that was built using one embedding provider, and a new team adds a second corpus embedded with a different provider, and the two corpora end up in the same index queried by a single model. Each of these is a mismatch. Each produces the same result: silent recall collapse on the affected subset of the corpus.

The problem doesn't always affect the whole corpus. Mixed-model indexes are the common case — you re-embed new documents as your pipeline evolves, and old documents keep their original embeddings. In practice this means recall degrades progressively as the proportion of old-model embeddings grows relative to the total, and the degradation is invisible unless you track recall segmented by document age or by when the embedding was generated.

The 2024 Massive Text Embedding Benchmark (MTEB) results, maintained publicly on Hugging Face, document how significantly different models vary not just in absolute recall performance but in which query types and domains each model handles well. A mismatch isn't just a dimension issue — it's a semantic space issue. A model fine-tuned on legal text embeds legal queries differently than a general-purpose model. Using the wrong model for your domain is a form of mismatch even when the version is consistent.

What to watch for: recall segmented by document creation date or embedding generation timestamp. If recall@10 is strong on documents ingested in the last 90 days and weak on documents ingested more than 6 months ago, the older documents were likely embedded with a different model or a different preprocessing step. The fix isn't obvious — re-embedding is expensive and requires a migration window — but identifying the pattern is the first step, and it's a pattern that only shows up if you're tracking the right segmentation.

A secondary signal: query performance degradation on a specific domain or document type without any change to the ingestion pipeline. If a corpus of financial filings suddenly retrieves poorly while general web text retrieves fine, the embedding model may have been updated upstream (through an API, through a framework dependency, through an automatic model alias resolution) and the new version may have a different strength/weakness profile on financial text. The mismatch doesn't have to be explicit to be real.

These five failure modes — schema drift, hash-based dedup gaps, fixed-size chunking on prose, missing retrieval recall measurement, and embedding-model mismatch — share a structural property: they all occur upstream of the model, they all manifest downstream as model-quality problems, and they're all invisible without instrumentation specifically designed to catch them. Standard application monitoring (error rates, latency, throughput) misses every one of them. The pipeline looks healthy until someone asks the right question on the wrong day.

The inverse is also true. A pipeline that monitors for distributional drift, validates dedup recall on a held-out set, chunks on semantic boundaries, tracks recall@10 weekly, and enforces embedding-model consistency across its entire corpus doesn't need to be lucky. It catches the failures before users do.

Frequently asked

How do I detect drift in production?

Run a KS-statistic per numeric column and a chi-squared per categorical column, weekly. Alert when either crosses a threshold calibrated on a quiet week.

What's the right chunk size?

It depends on the embedding model's context window and your retrieval pattern. A common default is 512 tokens with 64-token overlap, but semantic chunking outperforms fixed sizes on prose.

How do I measure retrieval recall in production?

Build a held-out evaluation set of (query, expected-doc) pairs from real queries. Score recall@10 weekly. Anything below your launch baseline is a regression.

Reid Spachman

Founder at ixprt. Building Diagest, AssetModel, and DailyWallStreet. Based in New York.

Want to skip the work?

Diagest absorbs the parse / clean / dedup / chunk / embed work and hands your AI exactly what it needs.