RedHop vs LangChain vs LlamaIndex
We’d rather you trust the numbers than the marketing. Below is the same contract question done three ways, then a full, reproducible benchmark across scenarios, so you can judge it against your own workload.
The same question, three ways
Section titled “The same question, three ways”You have a contract.pdf and one question: “What is the governing law?” Here’s
the code path in each library to get the LLM the right context.
import redhopfrom openai import OpenAI
query = "What is the governing law?"
ctx = redhop.Document.from_file("contract.pdf").context(query)# parsed, chunked, retrieved, and token-budgeted internally
response = OpenAI().chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],)print(response.choices[0].message.content)What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.
from langchain_community.document_loaders import PyMuPDFLoaderfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_openai import OpenAIEmbeddings, ChatOpenAIfrom langchain_community.vectorstores import FAISSfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_core.runnables import RunnablePassthroughfrom langchain_core.output_parsers import StrOutputParser
query = "What is the governing law?"
pages = PyMuPDFLoader("contract.pdf").load()chunks = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200,).split_documents(pages)
store = FAISS.from_documents(chunks, OpenAIEmbeddings())retriever = store.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_template( "Answer using only the context.\n\n{context}\n\nQuestion: {input}")
chain = ( {"context": retriever, "input": RunnablePassthrough()} | prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser())
print(chain.invoke(query))What you stand up: a splitter (you choose
chunk_size/overlap), an embedding model, a FAISS vector
store, a retriever, a prompt template, and a retrieval chain — six wired pieces,
and embeddings cost a call per chunk.
from llama_index.core import VectorStoreIndex, Settingsfrom llama_index.core.node_parser import SentenceSplitterfrom llama_index.readers.file import PyMuPDFReaderfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.llms.openai import OpenAI
query = "What is the governing law?"
Settings.embed_model = OpenAIEmbedding()Settings.llm = OpenAI(model="gpt-4o-mini")
docs = PyMuPDFReader().load(file_path="contract.pdf")
index = VectorStoreIndex.from_documents( docs, transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=50)],)
engine = index.as_query_engine(similarity_top_k=4)print(engine.query(query))What you stand up: a node parser, an embedding model, a vector index, and a query engine. Cleaner than LangChain, but still an embed-and-index pipeline you own and pay for.
What each approach makes you own
Section titled “What each approach makes you own”| RedHop | LangChain | LlamaIndex | |
|---|---|---|---|
| Document parsing | built-in (from_file) | a loader | a reader |
| Chunking strategy | internal default | you tune it | you tune it |
| Embedding model | optional (off by default) | required | required |
| Vector store / ANN | none, at any tier | FAISS / etc. | built-in index |
| Retriever wiring | none | manual | query engine |
| Cost to index | $0, ~1ms (BM25) | 1 embed call/chunk | 1 embed call/chunk |
| Why it kept a passage | Decision Report | opaque | opaque |
That’s the categorical difference: RedHop is one bounded step (from_file → context)
with no vector database, at any tier. The frameworks are pipelines you assemble,
embed into, and operate. RedHop’s default needs no model at all, so out of the box
it’s queryable instantly with no embedding step.
On speed, RedHop is queryable instantly on its lexical default (no embedding step) and answers warm queries in ~1–6ms in-process. The full numbers are on the Speed page. But speed isn’t why you’d pick RedHop. The reasons are the runtime itself: the bounded API, conditional pruning, the Decision Report, no infrastructure. The fair question is whether that simplicity costs answer quality, so we measured it, head to head, below.
The benchmark
Section titled “The benchmark”Same documents, BM25 for all three (so we compare context assembly, not retrieval engines), same token budget. Two datasets, CUAD (real contracts) and HotpotQA (multi-hop), across two tiers: evidence retention (no LLM) and downstream answer quality (gpt-4o-mini).
Evidence retention (gold-evidence recall ≥0.8, n=300, latest rerun 2026-06-06):
| dataset | RedHop | LangChain | LlamaIndex |
|---|---|---|---|
| HotpotQA (multi-hop) | 80% | 71% | 72% |
| CUAD (raw 24-word template) | 81.3% | 73% | 86% |
| CUAD (template-stripped query)† | 87.7% | — | — |
| CUAD (Stripper + Vocabulary)‡ | 90.7% | — | — |
† Same RedHop runtime, with redhop.Stripper(boilerplate_terms) (a
compiled, token-level boilerplate-removal rewrite) applied to the query
before retrieval, which drops CUAD’s fixed 24-word wrapper
(Highlight the parts (if any)…) and passes the quoted clause name +
Details: elaboration to BM25. We did not apply the same preprocessor
to LangChain/LlamaIndex, so the striped bar in the chart is not an
apples-to-apples comparison with theirs. Mechanism (BM25 boilerplate
dilution) + recipe: CUAD_RECALL_GAP.
‡ Same as above, plus redhop.Vocabulary({...}) with a 34-key workload
dictionary mapping clause names to high-IDF synonyms
(Change of Control → merger, successor, acquisition, …). The vocabulary
selectively raises the BM25 score of the gold-bearing chunk, the opposite
mechanism to unweighted PRF (which falsified on the same workload because it
added corpus-pervasive low-IDF terms, see
CUAD_PRF_NULL).
Worked example + 4-arm probe:
CUAD_CLAUSE_EXPANSION.
Do you have to write a stripper to get this lift? Only if your queries
follow a fixed template (legal QA, support-ticket triage, form-filled queries
from a UI). For variable natural-language queries it’s a no-op. RedHop ships
the primitives that compose the full workflow at the public API surface:
analyze_query_set (detect the pattern), Stripper(boilerplate) (compiled
token-level boilerplate removal), Vocabulary({...}) (workload-curated
synonyms), doc.context_with_rewrites(query, [stripper, vocab]) (runs the
chain with audit trail on ctx.report.query_rewrites), and evaluate (score
the lift deterministically with no LLM judge, see
EVALUATE_API).
The analyzer flags CUAD as templated without firing on diverse natural-language
QA (HotpotQA + MuSiQue both stay quiet, see
QUERY_SET_ANALYZER).
Decision rule + runnable code on the
Choosing a configuration page → “Templated queries with heavy boilerplate”.
Or, the one-knob alternative. retrieval="hybrid" recovers most of the
lift on the raw query (+5.3 points) without a stripper or dict. But on CUAD
specifically, BM25 + Stripper + Vocabulary is Pareto-optimal: higher
retention (90.7% vs 89.0% for hybrid + cross-encoder) AND 270× lower latency
(~2.5ms vs ~683ms). The two paths are substitutes, not complements. Pick
one. See
CUAD_HYBRID_RERANK
for the 6-arm probe.
Answer quality (gpt-4o-mini, F1 / EM, n=150):
| dataset | RedHop | LangChain | LlamaIndex |
|---|---|---|---|
| HotpotQA | 0.51 / 0.41 | 0.50 / 0.39 | 0.50 / 0.42 |
| CUAD | 0.34 / 0.17 | 0.25 / 0.11 | 0.35 / 0.16 |
On a real contract (the contract.pdf path itself)
Section titled “On a real contract (the contract.pdf path itself)”We ran RedHop’s Document.from_text → context() path on 50 real CUAD contracts
(644 clause questions) with BM25, budget 2,048 tok, the exact path the code above
uses. Numbers are end-to-end (after Auto pruning). “Retained” means gold-span
word-recall, a lexical retention proxy, not downstream answer quality:
- −80% tokens: a ~9.3k-token contract becomes a ~1.9k-token context.
- Gold evidence retained at ≥0.8 word-recall on 88% of queries (≥0.5 on 96%). The no-prune retrieval ceiling is 98%, so pruning costs ~6 points.
- ~1.7ms/query p50 (warm in-memory index, single local CPU), ~1ms to chunk+index a whole contract, on the default BM25 path.
Autochose to prune on 94% of queries. Real contracts are large, so the regime where pruning is measured to help is the common case.
Full conditions and the skeptic’s checklist are on the benchmarks page.
How to read this
Section titled “How to read this”- RedHop leads multi-hop retention by a clear margin (+8 over LlamaIndex, +9 over LangChain) and is ≈ LlamaIndex / ahead of LangChain on answers. LlamaIndex edges RedHop on contract extraction with the raw CUAD query (see the next bullet).
- The CUAD 4-point gap to LlamaIndex is mechanism-known: BM25
template-boilerplate dilution. CUAD asks every question with the same
24-word fixed template (
Highlight the parts (if any) of this contract related to "X"…). BM25 weights each query term by IDF over the corpus, so the 19 boilerplate words dilute the 5 real signal words.redhop.Stripper(boilerplate)removes the template tokens at query time (compiled token-level matching, so an"of"stripper does not erase the"of"inside"office"). On RedHop, Stripper lifts ≥0.8 retention 82% → 87.7%. Adding a hand-authored 34-key clause-nameredhop.Vocabulary({...})reaches 90.7%. - Fair-preprocessing footnote (n=300, 2026-06-08). Applying the
same Stripper to every system’s query (the apples-to-apples
comparison) lifts all three: LlamaIndex 86% → 94%, RedHop
82% → 88%, LangChain 73% → 79%. LlamaIndex actually benefits
more from the same Stripper than RedHop does. Its BM25 retriever is
the stronger one on contract-extraction. The 90.7% RedHop result adds
Vocabulary on top of Stripper, but that recipe was not applied to
LlamaIndex in the published comparison, and given LlamaIndex’s bigger
lift from the Stripper step alone, an unmeasured-but-likely outcome is
that LlamaIndex with the same Vocabulary would match or beat 90.7%.
Reproduce with
bench/.venv/bin/python bench/compare.py. Full investigation in CUAD_RECALL_GAP and CUAD_CLAUSE_EXPANSION. - Retention is a loose proxy for answers: RedHop’s bigger retention lead shrinks to a near-tie on answer quality, because at a sensible budget every system gives the model enough to roughly tie. We show both numbers.
- LangChain’s deficit is mostly refusals (CUAD 59% vs ~47%): its chunking surfaced the answer span less often, so the model bailed more.
- These are BM25-vs-BM25 results. The frameworks’ default vector retrievers aren’t covered here.
So why pick RedHop?
Section titled “So why pick RedHop?”Answer quality is in the same band across all three (the numbers above), so the deciding factors are what the frameworks don’t offer:
- A Decision Report for every call: what it did, why, and why it chose not to intervene. No black box.
- Conditional optimization: prunes only when large/diluted (measured to help), and passes small contexts through untouched.
- An evidence layer: every default traces to a measured finding, including the experiments that failed.
- A tiny, bounded surface:
Document.from_text(...).context(query), no vector infrastructure to run.
The big frameworks give you the full pipeline kit (many loaders, retrievers, vector stores, agents) when you want to assemble and tune that machinery yourself. RedHop is the opposite bet: document-centric retrieval as one bounded, in-process step, where simplicity and explainability matter more than wiring.
Reproduce it yourself
Section titled “Reproduce it yourself”python3 -m venv bench/.venvbench/.venv/bin/pip install redhop rank-bm25 langchain-community llama-index-core llama-index-retrievers-bm25bench/.venv/bin/python bench/compare.py # retention (free)bench/.venv/bin/python bench/tier3.py --n 150 # answer quality (needs OPENROUTER_API_KEY)Scope & caveats
Section titled “Scope & caveats”- gpt-4o-mini only, one budget per dataset, two datasets. CUAD extraction F1 is low in absolute terms (hard task). The relative ranking is the signal.
- LlamaIndex’s contract edge on the raw CUAD query is real, and now
mechanism-attributed: BM25 boilerplate dilution from CUAD’s fixed 24-word
template (see CUAD_RECALL_GAP).
The
Stripper+Vocabularychain closes it and puts RedHop ahead (90.7% vs 86%). - RedHop’s
reasoning_preservingstrategy does not beat plain top-k downstream. Its value is the runtime decisions and transparency, not a better ranking algorithm. - The CUAD contract numbers above are evidence retention (word-recall), not downstream answer quality. The token reduction and latency are end-to-end.
Next: Benchmarks, every number, reproducible, with full methodology.
Looking for a specific framework alternative?
Section titled “Looking for a specific framework alternative?”- RedHop as a LangChain alternative: code-vs-code, what each gives you, migration guide
- RedHop as a LlamaIndex alternative: same shape, plus where the CUAD template-dilution gap lives and how to close it
- RedHop as a Haystack alternative: three calls vs pipelines, components, document stores