Retrieval & context tips
These are operational laws RedHop’s experiments converged on, measured across four model families, reported with bootstrap 95% confidence intervals, and reproducible from the evidence layer. They’re useful whatever tool you use. RedHop just applies them for you.
Should I optimize this context at all?
Section titled “Should I optimize this context at all?”- Small & focused (fits comfortably, few distractors) → pass it through. Pruning here is neutral-to-harmful: you risk dropping reasoning evidence for no gain.
- Large or junk-heavy (diluted) → prune to budget. Pruning recovers accuracy lost to attention dilution (lost-in-the-middle).
redhop.Document.from_text(text).context(query) uses strategy="auto" by
default. It decides this from input size and reports which it chose and why.
Know your workload shape
Section titled “Know your workload shape”Before any knob, ask one question: how does a correct answer get assembled from your corpus? Almost every workload falls into one of three shapes, and the shape, not the file format, tells you which lever to pull. You usually know which one you are just by reading a handful of your own queries.
| Your workload looks like… | The answer is… | The failure mode to expect | Reach for |
|---|---|---|---|
| Single-hop extraction: contracts, runbooks, API refs, financial filings, “find the clause / value / section that says X” | in one place, and the query usually shares words with it | if every query has a fixed wrapper, the boilerplate dilutes the real terms | lexical default. strategy="raw_topk" for single-doc extraction. Strip the template (below) |
| Multi-hop reasoning: “what’s the nationality of the director of film X?” or anything that chains two+ facts | spread across several chunks, and the later “hop” is often low-relevance to the question | a relevance filter prunes the bridge chunk, so the model can’t complete the chain | strategy="reasoning_preserving" (default). Don’t hard-filter. Don’t apply a cross-encoder uniformly |
| Paraphrase / vocabulary mismatch: support & HR KBs where users phrase things nothing like the docs | in one place, but worded with different words than the query | BM25 keyword matching misses it entirely | retrieval="semantic" (dense), or add rerank="cross-encoder" (verify on your corpus first) |
Two notes that save pain:
- The shapes can combine. A templated and single-hop workload (legal
review) wants both
raw_topkand a template stripper. A multi-hop workload with paraphrased questions wantsreasoning_preservingandsemantic. Read the table as “which rows apply,” not “pick one.” - When unsure, you’re probably multi-hop-ish, so under-filter. The
expensive mistake is treating a multi-hop workload as single-hop and
hard-filtering away the second hop.
strategy="auto"(the default) is built for exactly this uncertainty.
The laws below are the evidence behind each of these calls.
The laws
Section titled “The laws”These come in two kinds: a few that reframe how to think about the problem, and a few that are just what the measurements forced us to accept. Each links to the finding it rests on, including the ones we falsified.
How to think
Section titled “How to think”-
Under-filter: the cost is asymmetric. This is the spine of everything else, in three beats:
- Relevance ≠ reasoning usefulness. A chunk can be low-relevance to the query yet essential: the multi-hop “second hop” (second-hop tax).
- So removing the wrong chunk costs more than keeping junk. Across models, aggressive filtering was net-harmful: the lost reasoning evidence cost more than the distractors removed (distractor robustness).
- So make “do nothing” the default and intervention the exception. Avoiding damage beats chasing average lift (reasoning preservation).
The practical rule: don’t hard-filter by query relevance, and keep low-relevance chunks that are linked to relevant ones.
-
The decision is the value, not a magic optimizer. Naive top-k captures most of the gain, and no pruning algorithm dominates. Getting the when right matters more than the how (context economics).
What we measured
Section titled “What we measured”- Optimize under dilution, not by raw length. 20k focused tokens beat 5k noisy ones. The driver is junk fraction / evidence density, not size alone (context dilution).
- Stronger rerankers aren’t universally safer. A cross-encoder applied uniformly on multi-hop can lower recall by demoting the bridge evidence (reranking limits).
- Optimization is model-aware. Frontier models tolerate distractors. Smaller/open models are more sensitive. The same policy isn’t optimal for all (distractor robustness).
- Defaults are priors, not guarantees: measure on your own corpus. Every
default here is the average call across our benchmarks, and your workload is
n=1 to us. Before trusting a non-default knob (a cross-encoder,
semantic, a template stripper), confirm it lifts your numbers, not ours.
Retrieval: BM25 first, dense for vocabulary gaps
Section titled “Retrieval: BM25 first, dense for vocabulary gaps”This one is about finding candidates, not optimizing them, a different pipeline stage from the laws above, and the “paraphrase / vocabulary mismatch” row of the workload table.
BM25 (zero-dependency) is the default and best for lexical/keyword queries, but
misses paraphrase / low-overlap ones. To recover those, opt into dense
(retrieval="semantic"). It embeds every chunk and cosines the query against
all of them by meaning, exact and ANN-free (just name a model, still no vector
DB). Measured on HotpotQA: BM25 ≈ 0.49 → dense ≈ 0.80 (recall@3). On a
synonym-mismatch probe BM25 20% → dense 88% recall@1
(semantic mismatch).
Templated queries: strip the boilerplate first
Section titled “Templated queries: strip the boilerplate first”This one’s worth a plain-English walkthrough, because the words “boilerplate” and “stripping” hide a very simple idea.
When does it apply? When every query in your workload is the same shape and only a word or two changes. Legal review, support triage, anything fed from a structured form looks like this:
"Highlight the parts of this contract a lawyer should review regarding Termination.""Highlight the parts of this contract a lawyer should review regarding Governing Law.""Highlight the parts of this contract a lawyer should review regarding Indemnification."Boilerplate is the copy-paste wrapper that repeats in every query:
Highlight the parts of this contract a lawyer should review regarding …. It
tells the retriever nothing, because it matches every query and every document
equally. The only words that actually distinguish one query from another (the
discriminators) are Termination, Governing Law, Indemnification.
Stripping just means deleting that wrapper before you search, so the retriever sees only the words that matter:
Before: Highlight the parts of this contract a lawyer should review regarding TerminationAfter: TerminationWhy it helps: BM25 weights every query word, so the 19 boilerplate words dilute the signal from the 1–2 real ones. Strip them and the match sharpens onto the real target. On CUAD this is a measured 81.3% → 87.7% retention lift.
Two things to keep honest:
- Your boilerplate isn’t CUAD’s, so your stripper isn’t this one. The
strip_cuad_templatename has the workload baked in on purpose: you’d writestrip_support_template,strip_invoice_template, etc., one per workload. RedHop deliberately ships no built-instrip_template(): templates are workload-specific, and a built-in would make the wrong call on the next one. - For single-doc extraction also set
strategy="raw_topk". On contract-shape tasks the Auto-routedreasoning_preservingstrategy solves a multi-hop problem you don’t have.raw_topkbeats it by ~4 points.
Full mechanism, numbers, and the runnable recipe: Choosing a config → Templated queries with heavy boilerplate.
The one-knob alternative: just turn on retrieval="hybrid"
Section titled “The one-knob alternative: just turn on retrieval="hybrid"”If writing a stripper sounds like more work than you want to do, here’s the honest alternative: you can get most of the same lift by flipping a single flag. Hybrid retrieval reads each chunk as semantic content (via a small embedding model) instead of just counting tokens, so the boilerplate ratio stops mattering: the model knows the wrapper words are uninformative without anyone having to tell it.
doc = redhop.Document.from_file("contract.pdf", retrieval="hybrid", model="bge-small")ctx = doc.context(user_query) # raw query, no preprocessingMeasured on the same CUAD setup: hybrid on the raw template query gives +5.3 points (81.3% → 86.7%), close to the +6.4 that template stripping gives on its own. You pay ~10ms per query (vs ~2.5ms BM25) and an 80MB model download on first use.
So when do you reach for which?
- Lowest-effort, near-best: turn on
retrieval="hybrid". One config flag, ~+5 points automatic, no dict to maintain. - Best-quality and fastest: stay on BM25 default, compile a
Stripper(...)(or useanalyze_query_setto surface the boilerplate for you), and pair it with a workloadVocabulary({...})(the next section). Run both throughdoc.context_with_rewrites(query, [stripper, vocab]). On CUAD this gets to 90.7% at ~2.5ms, higher retention and lower latency than hybrid+CE.
The trade is straightforward: hybrid saves you the dict-and-stripper work but caps your headroom at what the embedder can do unsupervised (~86–88%). Stripping + expansion takes more setup but stacks productively and runs at native BM25 speed.
Expand the discriminators (when stripping isn’t enough)
Section titled “Expand the discriminators (when stripping isn’t enough)”Once the boilerplate is gone, your query is small but might still miss the
gold passage, because the clause uses different words than the query
asks for. Ask about Change of Control in a contract and the relevant
clause probably talks about a merger, a successor, or an acquisition.
The query and the gold span are semantically identical but lexically
disjoint, so BM25 can’t connect them.
The fix is the mirror image of stripping. Stripping subtracts low-IDF noise (the wrapper that fires on everything). Expansion adds high-IDF discriminators (the rare terms that appear in the gold but not in the query).
import redhop
# YOUR workload's taxonomy. Build the dict by reading a handful of your gold# spans and noting the recurring high-IDF terms — they're surprisingly stable# per topic in any "fixed-taxonomy" workload (legal clauses, support-ticket# categories, HR-policy buckets, etc.).vocab = redhop.Vocabulary({ "change of control": ["merger", "successor", "acquisition"], "non-compete": ["restraint", "non-competition"], "indemnification": ["hold harmless", "defend", "liability"],})
stripper = redhop.Stripper(my_boilerplate)ctx = doc.context_with_rewrites(user_query, [stripper, vocab])
# The audit trail makes every transformation observable.for rec in ctx.report.query_rewrites: print(rec.stage, rec.matched, rec.added)Vocabulary is token-level (an "ip" vocabulary key does NOT fire on
"recipient"), doesn’t recursively chain, and dedupes synonyms across
overlapping matches. The original query is preserved verbatim: the
synonyms are appended, never substituted. Vocabulary.bidirectional({...})
gives symmetric matching (PTO ↔ “paid time off” ↔ “vacation”).
Why this isn’t PRF. Pseudo-relevance feedback (“read the top chunks, add their most common words to the query”) looks superficially similar but fails on boilerplate-heavy corpora. See CUAD_PRF_NULL. PRF re-injects the same low-IDF terms you just spent the strip step removing, because the top chunks share a lot of corpus boilerplate. Your hand-curated dict is the opposite: the synonyms are chosen to be high IDF on your corpus (rare across non-matching documents), so they sharpen the ranking instead of washing it out.
Measured on CUAD. With only a Stripper, ≥0.8 retention on the framework comparison is 88% (already past LlamaIndex’s 86%). Add a 34-key clause-name Vocabulary and it’s 90.7%, 4 points past LlamaIndex.
| arm | ≥0.8 retention |
|---|---|
| raw 24-word template | 81.3% |
| Stripper | 87.7% |
| Stripper + Vocabulary | 90.7% |
| raw + Vocabulary (control) | 86.3% |
Honest scope, two things worth knowing:
- Hand-curated synonyms ≠ a recipe for synonyms. The dict was built by reading CUAD gold spans and noting recurring terms. An unfamiliar workload needs the same domain inspection. RedHop deliberately ships no automated synonym miner here: that’s the falsified PRF arc.
- The mechanism direction matters. Adding the right terms (high-IDF,
workload-curated) gives the lift. Adding the wrong terms (low-IDF,
corpus-frequency-derived) takes it away. If you’re not sure your dict
is high-IDF on your corpus, A/B it with
redhop.evaluateagainst a small gold sample before committing.
Full mechanism, the CUAD dict, and the four-arm probe that justifies the numbers: CUAD_CLAUSE_EXPANSION.
Knobs (and sane defaults)
Section titled “Knobs (and sane defaults)”| Knob | Where | Default | When to change |
|---|---|---|---|
chunk_size | from_text (index-time) | 128 | smaller for very tight budgets |
strategy | from_text | "auto" | rarely |
budget | context() (query-time) | doc default | per-query, freely |
language | from_text (index-time) | raw pipeline (no stemming) | "english" for code search / inflection-heavy content (CamelCase split + Snowball stem). Language code ("german", "french", …) for non-English |
code_neighbors_default | from_text/from_file (index-time) | 1 | 0 for memory-tight code search where BM25 already surfaces body chunks. 2/3 at loose budgets to recover function bodies further from the seed (CODE_NEIGHBORS_DEFAULT) |
prose_heading_default | from_text/from_file (index-time) | true | false to skip auto-attaching section headings, measurably helpful at typical budgets (+7pt ≥0.8) but a wash on categorical ”## Setup”-style headings (PROSE_HEADING_DEFAULT) |
chunk_size is fixed at construction (it’s how the index is built). budget is
per-query and free to vary without re-indexing.
Why the default is a minimal analyzer (and when to opt back in)
Section titled “Why the default is a minimal analyzer (and when to opt back in)”Up through 0.3.1 the default analyzer applied English Snowball stemming
(so "highlighted" matched "highlight"), plus CamelCase splitting and
stopword filtering. In 0.3.2 the default flipped to a minimal
pipeline (Unicode tokenization, lowercase, ASCII fold, nothing else)
because measurement said the heavier pipeline was hurting more than
helping:
| Workload | english ≥0.8 | raw (new default) ≥0.8 | english p50 | raw p50 |
|---|---|---|---|---|
| CUAD | 86% | 91% (+5) | 6.4 ms | 3.8 ms |
| HotpotQA | 100% | 100% (tied) | 2.9 ms | 2.3 ms |
| MuSiQue | 90% | 97% (+7) | 3.4 ms | 2.3 ms |
Stemming was hurting via false-positive stem collisions ("settles" /
"settling" / "settled" all → "settl"), inflating BM25 scores on
chunks that shared any form and drowning out the discriminating
proper nouns.
# 0.3.2 default — raw pipeline, no extra arguments needed:doc = redhop.Document.from_text(text)
# Opt back in to English Snowball (camelCase + stopwords + stemmer):doc = redhop.Document.from_text(text, language="english")When to opt back in to language="english":
- Code search. The CamelCase splitter is what makes
"compressVideo"matchable via"compress". - Heavy paraphrase between query and doc: queries about
“acquisitions” against doc text mentioning “acquired”, “acquiring”.
Test with
redhop.evaluate(..., gold_chunks=...)on a sample.
Non-English content: use the language code ("german", "french", …).
The language-specific stemmer handles morphology your content needs.
Full evidence + workload-specific recommendations: RAW_ANALYZER.
Bring your own chunker, if the workload calls for it
Section titled “Bring your own chunker, if the workload calls for it”RedHop’s chunker is well-tuned for sentence-aware prose (MULTIHOP_CHUNK_SIZE_NULL
shows bigger chunk_size regresses on multi-hop, and smaller doesn’t help much).
But if you’ve measured a different chunker that fits your workload (semantic
chunkers, AST-aware code chunkers, schema-aware splitters for tabular data,
or any third-party Markdown / LaTeX / academic-paper splitter), wire it in
via Document.from_chunks(...):
# Use any chunker you want; just hand RedHop the resulting strings as Chunks.from your_chunker import chunk_into_sectionssections = chunk_into_sections(open("paper.tex").read())chunks = [ redhop.Chunk(text, source="paper.tex", id=f"sec-{i}", metadata={"section": title}) for i, (title, text) in enumerate(sections)]doc = redhop.Document.from_chunks(chunks)ctx = doc.context("What is the main contribution?")The constant-chunking matrix
(MULTIHOP_CONSTANT_CHUNKING)
showed two things worth knowing before you spend time on this:
- The chunker dominates (it’s the lever: RedHop’s BM25 vs LangChain’s vs LlamaIndex’s is essentially flat on the same chunks, and the chunker choice is where ±20pts of retention live).
- There’s no universally-best chunker. RedHop’s sentence-aware chunker
wins on HotpotQA’s short-paragraph shape. LangChain’s char-recursive chunker
ties on MuSiQue’s compositional multi-hop. If your workload is something
else (legal cross-references, scientific papers, structured data), test
on your own corpus with
redhop.evaluate(..., gold_chunks=...)before committing.
The full evidence behind each law, including the hypotheses that were falsified, lives in the project’s evidence layer on GitHub.
Next: vs LangChain / LlamaIndex (the same contract question, three ways) · Benchmarks (every number, reproducible).