Skip to content

RedHop vs LangChain vs LlamaIndex

We’d rather you trust the numbers than the marketing. Below is the same contract question done three ways, then a full, reproducible benchmark across scenarios, so you can judge it against your own workload.

You have a contract.pdf and one question: “What is the governing law?” Here’s the code path in each library to get the LLM the right context.

import redhop
from openai import OpenAI
query = "What is the governing law?"
ctx = redhop.Document.from_file("contract.pdf").context(query)
# parsed, chunked, retrieved, and token-budgeted internally
response = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],
)
print(response.choices[0].message.content)

What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.

RedHopLangChainLlamaIndex
Document parsingbuilt-in (from_file)a loadera reader
Chunking strategyinternal defaultyou tune ityou tune it
Embedding modeloptional (off by default)requiredrequired
Vector store / ANNnone, at any tierFAISS / etc.built-in index
Retriever wiringnonemanualquery engine
Cost to index$0, ~1ms (BM25)1 embed call/chunk1 embed call/chunk
Why it kept a passageDecision Reportopaqueopaque

That’s the categorical difference: RedHop is one bounded step (from_file → context) with no vector database, at any tier. The frameworks are pipelines you assemble, embed into, and operate. RedHop’s default needs no model at all, so out of the box it’s queryable instantly with no embedding step.

On speed, RedHop is queryable instantly on its lexical default (no embedding step) and answers warm queries in ~1–6ms in-process. The full numbers are on the Speed page. But speed isn’t why you’d pick RedHop. The reasons are the runtime itself: the bounded API, conditional pruning, the Decision Report, no infrastructure. The fair question is whether that simplicity costs answer quality, so we measured it, head to head, below.

Same documents, BM25 for all three (so we compare context assembly, not retrieval engines), same token budget. Two datasets, CUAD (real contracts) and HotpotQA (multi-hop), across two tiers: evidence retention (no LLM) and downstream answer quality (gpt-4o-mini).

Evidence retention: RedHop vs LangChain vs LlamaIndex on HotpotQA multi-hop and CUAD contracts. RedHop leads multi-hop at 80%. On CUAD contracts a 4th striped bar shows RedHop reaching 90.7% with the Stripper + workload-curated Vocabulary chain (the same preprocessing is not applied to LlamaIndex), beating LlamaIndex's 86% by 4.7 points.

Evidence retention (gold-evidence recall ≥0.8, n=300, latest rerun 2026-06-06):

datasetRedHopLangChainLlamaIndex
HotpotQA (multi-hop)80%71%72%
CUAD (raw 24-word template)81.3%73%86%
CUAD (template-stripped query)†87.7%
CUAD (Stripper + Vocabulary)‡90.7%

† Same RedHop runtime, with redhop.Stripper(boilerplate_terms) (a compiled, token-level boilerplate-removal rewrite) applied to the query before retrieval, which drops CUAD’s fixed 24-word wrapper (Highlight the parts (if any)…) and passes the quoted clause name + Details: elaboration to BM25. We did not apply the same preprocessor to LangChain/LlamaIndex, so the striped bar in the chart is not an apples-to-apples comparison with theirs. Mechanism (BM25 boilerplate dilution) + recipe: CUAD_RECALL_GAP.

‡ Same as above, plus redhop.Vocabulary({...}) with a 34-key workload dictionary mapping clause names to high-IDF synonyms (Change of Controlmerger, successor, acquisition, …). The vocabulary selectively raises the BM25 score of the gold-bearing chunk, the opposite mechanism to unweighted PRF (which falsified on the same workload because it added corpus-pervasive low-IDF terms, see CUAD_PRF_NULL). Worked example + 4-arm probe: CUAD_CLAUSE_EXPANSION.

Do you have to write a stripper to get this lift? Only if your queries follow a fixed template (legal QA, support-ticket triage, form-filled queries from a UI). For variable natural-language queries it’s a no-op. RedHop ships the primitives that compose the full workflow at the public API surface: analyze_query_set (detect the pattern), Stripper(boilerplate) (compiled token-level boilerplate removal), Vocabulary({...}) (workload-curated synonyms), doc.context_with_rewrites(query, [stripper, vocab]) (runs the chain with audit trail on ctx.report.query_rewrites), and evaluate (score the lift deterministically with no LLM judge, see EVALUATE_API). The analyzer flags CUAD as templated without firing on diverse natural-language QA (HotpotQA + MuSiQue both stay quiet, see QUERY_SET_ANALYZER). Decision rule + runnable code on the Choosing a configuration page → “Templated queries with heavy boilerplate”.

Or, the one-knob alternative. retrieval="hybrid" recovers most of the lift on the raw query (+5.3 points) without a stripper or dict. But on CUAD specifically, BM25 + Stripper + Vocabulary is Pareto-optimal: higher retention (90.7% vs 89.0% for hybrid + cross-encoder) AND 270× lower latency (~2.5ms vs ~683ms). The two paths are substitutes, not complements. Pick one. See CUAD_HYBRID_RERANK for the 6-arm probe.

Answer quality (gpt-4o-mini, F1 / EM, n=150):

datasetRedHopLangChainLlamaIndex
HotpotQA0.51 / 0.410.50 / 0.390.50 / 0.42
CUAD0.34 / 0.170.25 / 0.110.35 / 0.16

On a real contract (the contract.pdf path itself)

Section titled “On a real contract (the contract.pdf path itself)”

We ran RedHop’s Document.from_text → context() path on 50 real CUAD contracts (644 clause questions) with BM25, budget 2,048 tok, the exact path the code above uses. Numbers are end-to-end (after Auto pruning). “Retained” means gold-span word-recall, a lexical retention proxy, not downstream answer quality:

  • −80% tokens: a ~9.3k-token contract becomes a ~1.9k-token context.
  • Gold evidence retained at ≥0.8 word-recall on 88% of queries (≥0.5 on 96%). The no-prune retrieval ceiling is 98%, so pruning costs ~6 points.
  • ~1.7ms/query p50 (warm in-memory index, single local CPU), ~1ms to chunk+index a whole contract, on the default BM25 path.
  • Auto chose to prune on 94% of queries. Real contracts are large, so the regime where pruning is measured to help is the common case.

Full conditions and the skeptic’s checklist are on the benchmarks page.

  • RedHop leads multi-hop retention by a clear margin (+8 over LlamaIndex, +9 over LangChain) and is ≈ LlamaIndex / ahead of LangChain on answers. LlamaIndex edges RedHop on contract extraction with the raw CUAD query (see the next bullet).
  • The CUAD 4-point gap to LlamaIndex is mechanism-known: BM25 template-boilerplate dilution. CUAD asks every question with the same 24-word fixed template (Highlight the parts (if any) of this contract related to "X"…). BM25 weights each query term by IDF over the corpus, so the 19 boilerplate words dilute the 5 real signal words. redhop.Stripper(boilerplate) removes the template tokens at query time (compiled token-level matching, so an "of" stripper does not erase the "of" inside "office"). On RedHop, Stripper lifts ≥0.8 retention 82% → 87.7%. Adding a hand-authored 34-key clause-name redhop.Vocabulary({...}) reaches 90.7%.
  • Fair-preprocessing footnote (n=300, 2026-06-08). Applying the same Stripper to every system’s query (the apples-to-apples comparison) lifts all three: LlamaIndex 86% → 94%, RedHop 82% → 88%, LangChain 73% → 79%. LlamaIndex actually benefits more from the same Stripper than RedHop does. Its BM25 retriever is the stronger one on contract-extraction. The 90.7% RedHop result adds Vocabulary on top of Stripper, but that recipe was not applied to LlamaIndex in the published comparison, and given LlamaIndex’s bigger lift from the Stripper step alone, an unmeasured-but-likely outcome is that LlamaIndex with the same Vocabulary would match or beat 90.7%. Reproduce with bench/.venv/bin/python bench/compare.py. Full investigation in CUAD_RECALL_GAP and CUAD_CLAUSE_EXPANSION.
  • Retention is a loose proxy for answers: RedHop’s bigger retention lead shrinks to a near-tie on answer quality, because at a sensible budget every system gives the model enough to roughly tie. We show both numbers.
  • LangChain’s deficit is mostly refusals (CUAD 59% vs ~47%): its chunking surfaced the answer span less often, so the model bailed more.
  • These are BM25-vs-BM25 results. The frameworks’ default vector retrievers aren’t covered here.

Answer quality is in the same band across all three (the numbers above), so the deciding factors are what the frameworks don’t offer:

  1. A Decision Report for every call: what it did, why, and why it chose not to intervene. No black box.
  2. Conditional optimization: prunes only when large/diluted (measured to help), and passes small contexts through untouched.
  3. An evidence layer: every default traces to a measured finding, including the experiments that failed.
  4. A tiny, bounded surface: Document.from_text(...).context(query), no vector infrastructure to run.

The big frameworks give you the full pipeline kit (many loaders, retrievers, vector stores, agents) when you want to assemble and tune that machinery yourself. RedHop is the opposite bet: document-centric retrieval as one bounded, in-process step, where simplicity and explainability matter more than wiring.

Terminal window
python3 -m venv bench/.venv
bench/.venv/bin/pip install redhop rank-bm25 langchain-community llama-index-core llama-index-retrievers-bm25
bench/.venv/bin/python bench/compare.py # retention (free)
bench/.venv/bin/python bench/tier3.py --n 150 # answer quality (needs OPENROUTER_API_KEY)
  • gpt-4o-mini only, one budget per dataset, two datasets. CUAD extraction F1 is low in absolute terms (hard task). The relative ranking is the signal.
  • LlamaIndex’s contract edge on the raw CUAD query is real, and now mechanism-attributed: BM25 boilerplate dilution from CUAD’s fixed 24-word template (see CUAD_RECALL_GAP). The Stripper + Vocabulary chain closes it and puts RedHop ahead (90.7% vs 86%).
  • RedHop’s reasoning_preserving strategy does not beat plain top-k downstream. Its value is the runtime decisions and transparency, not a better ranking algorithm.
  • The CUAD contract numbers above are evidence retention (word-recall), not downstream answer quality. The token reduction and latency are end-to-end.

Next: Benchmarks, every number, reproducible, with full methodology.

Looking for a specific framework alternative?

Section titled “Looking for a specific framework alternative?”