Skip to content

Choosing a configuration

If you’re not sure which Document settings to use, this page tells you in 60 seconds, based on what the docs you’re loading actually look like.

Your corpus is…
┌────────────────────┼─────────────────────────┐
│ │ │
"normal" — code, "structured" — a "synonym-mismatch" — HR
API refs, runbooks, contract / policy with FAQs, support tickets
handbooks, reports, near-duplicate clauses where users ask in
mixed folders (e.g. governing-law different words than
overrides per region) the docs use
│ │
▼ ▼ ▼
Document.from_file(p) Document.from_file(p, Document.from_file(p,
doc.context(q) retrieval="hybrid", retrieval="hybrid",
model="bge-small") model="bge-small",
rerank="cross-encoder")
doc.context(q,
include_heading=True,
neighbors=1)

Three recipes cover the practical space.

No model download. ~50ms warm queries. Zero ONNX runtime.

import redhop
doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("What is the governing law?")
prompt = ctx.text() # feed to any LLM
print(ctx.report) # see what was retrieved and why

When this is right: code, API references, internal docs, runbooks, financial reports, handbooks, mixed folders (from_folder). The queries share vocabulary with the answers, which is the case for most technical and policy content.

Hybrid + heading-aware retrieval. Adds an ~80MB embedding model download on first run, and warm queries climb to ~150ms. Worth it only if your doc has clauses like “main clause X” and “EU override of clause X” and “Japan override of clause X”. Heading awareness disambiguates them.

doc = redhop.Document.from_file(
"msa.pdf",
retrieval="hybrid",
model="bge-small",
)
ctx = doc.context(
"What law applies in the UK?",
include_heading=True,
neighbors=1,
)

When this is right: legal contracts with regional variations, multi-jurisdiction policies, vendor security questionnaires with repeated sub-sections. When it’s wrong: clean single-chapter docs. Adding neighbors=1 to well-structured chapters can dilute well-targeted retrieval rather than help it.

Adds a cross-encoder reranker, which closes the synonym gap (the canonical “employee left” vs “staff terminated” case). Adds ~300MB of model download and 5–10× query latency.

doc = redhop.Document.from_file(
"support_kb.md",
retrieval="hybrid",
model="bge-small",
rerank="cross-encoder",
)
ctx = doc.context("why did the worker leave?")

When this is right: corpora where queries and answers regularly share no surface words (HR, support FAQs translated from internal phrasing, multilingual content). When it’s wrong: anywhere the lexical default already works: it adds latency without recovering anything. Verify it helps on your corpus before adopting.

Lexical defaultHybrid + bge+ cross-encoder rerank
First-run model downloadnone~80MB (bge-small)+ ~300MB (cross-encoder)
Warm query latency~50ms~150ms~1000ms
Compile-time depsnoneONNX runtimeONNX runtime
Where it helpsmost document QAregional overrides, parallel sub-sectionssynonym-mismatch retrieval
Where it hurtsadds latency on docs lexical already handlesadds latency without recovering anything unless the failure mode is synonym mismatch

The library can only retrieve what your query gives it. Three patterns no config can fix:

'vendor' retrieves the vendor-management section, not the liability cap (even when both mention vendors). 'settle' can retrieve the indemnification clause (“settle a claim”) rather than the arbitration clause (“settle a dispute”), even with a cross-encoder reranker, because both readings are defensible.

Fix it in the query, not the config: add one disambiguating word. 'liability cap for vendor' correctly finds the cap clause. 'arbitration forum to settle disputes' finds the arbitration clause.

2. Natural-language paraphrase with no shared vocabulary

Section titled “2. Natural-language paraphrase with no shared vocabulary”

'How long do I have to cancel and get my money back?' against a contract that uses “refund” and “termination for convenience” (not “cancel” or “money back”) can return an empty or weak context across every tier.

Fix in the query: use the doc’s vocabulary. “What’s the refund window?” finds the relevant clause immediately. Fix at the config level (sometimes): retrieval="hybrid" adds a dense embedder that can match refund to cancel through semantic similarity. Hybrid is a strict superset of lexical (BM25-tail fallback fills any chunks the dense pool missed), so you never lose candidates by turning it on. The cost is the ~80MB embedder download and ~3× warm latency.

3. Templated queries with heavy boilerplate

Section titled “3. Templated queries with heavy boilerplate”

If every query in your workload follows a fixed template (“Highlight the parts (if any) of this contract related to X that should be reviewed by a lawyer. Details: …”, “Help me with X, my account is Y, the error is Z, form-filled queries from a structured UI), BM25 weights each query term by corpus IDF, not by how often the term appears across your query set. So the 19 boilerplate words dilute the 5 real signal words, and retention suffers.

This is measured. On CUAD’s fixed 24-word template, stripping the boilerplate to just <clause name> <details> before calling context() lifts ≥0.8 retention from 81.3% → 87.7% (n=300, BM25, budget 2,000 tok), overtaking LlamaIndex’s 86%. Full mechanism + numbers: CUAD_CLAUSE_EXPANSION (the controlled three-arm run).

Two paths up the same hill: pick one, don’t combine. Measured on CUAD (CUAD_HYBRID_RERANK):

pathwhat you doretentionlatency
One-knobretrieval="hybrid" (BGE-small embedder)~86–88%~10 ms/q
Best-qualityBM25 default + analyze_query_setStripper + Vocabulary chain via doc.context_with_rewrites(...)90.7%~2.5 ms/q

Hybrid retrieval reads chunks as semantic content rather than counting tokens, so the boilerplate ratio stops mattering. It substitutes for template stripping by a different mechanism. Running both gives diminishing returns: once one mechanism has fixed the boilerplate dilution, the other adds only +0.3 points. Strip + expand is Pareto-optimal on CUAD (higher retention AND lower latency) but takes the upfront work of writing a stripper and building a synonym dict.

Recommended workflow if you go the best-quality path: detect → compile → run-through-rewrites → A/B. Every step ships in the public API. The rewrites compile once (the analyzer pass happens at construction time, not per query) and run through doc.context_with_rewrites(query, [stripper, vocab]). The per-stage audit trail lands on ctx.report.query_rewrites so every transform is observable:

import redhop
# 1. Detect — analyzer reports the shape of your query set.
report = redhop.analyze_query_set(my_queries[:300])
# Cross-workload probe (findings/QUERY_SET_ANALYZER.md):
# CUAD → is_templated=True, share=0.66, cost="high"
# HotpotQA → is_templated=False, share=0.00, cost="none"
# MuSiQue → is_templated=False, share=0.12, cost="none"
if report.is_templated:
# 2. Compile the rewrites. Stripper compiles the boilerplate
# once via the analyzer — token-level matching, so an "of"
# stripper does NOT erase the "of" inside "office".
stripper = redhop.Stripper(report.boilerplate_terms)
# 3. (optional) Vocabulary. If your workload has a known taxonomy
# of "topics" with predictable synonyms (clause types, error
# codes, issue categories), compile them once. On CUAD this
# lifts retention from 87.7% to 90.7% on top of the Stripper.
# Mechanism: workload-curated high-IDF discriminators raise
# the BM25 score of the relevant chunk. Opposite mechanism
# direction from PRF (which fails on boilerplate-heavy
# corpora; see CUAD_PRF_NULL).
vocab = redhop.Vocabulary({
# YOUR workload's keys → synonyms. Worked CUAD example in
# CUAD_CLAUSE_EXPANSION.md.
"change of control": ["merger", "successor", "acquisition"],
"non-compete": ["restraint", "non-competition"],
})
# 4. Run the chain inside context_with_rewrites; the per-stage
# audit lands on ctx.report.query_rewrites automatically.
doc = redhop.Document.from_text(your_document)
ctx_a = doc.context(user_query) # baseline
ctx_b = doc.context_with_rewrites(user_query, [stripper, vocab])
# 5. A/B — redhop.evaluate scores both arms deterministically.
# No LLM judge; same primitives the Decision Report uses.
eval_a = redhop.evaluate(user_query, ctx_a, gold_chunks=gold_ids)
eval_b = redhop.evaluate(user_query, ctx_b, gold_chunks=gold_ids)
# eval_b.overall - eval_a.overall is the per-query lift.
# 6. Inspect — every rewrite is observable.
for rec in ctx_b.report.query_rewrites:
print(rec.stage, "matched=", rec.matched,
"added=", rec.added, "removed=", rec.removed)

The analyzer is conservative by design: HotpotQA and MuSiQue both register quiet on the probe (is_templated=False), while CUAD fires (is_templated=True, share 0.66). The analyzer measures the shape of your queries. It does not promise a specific retention lift. The CUAD lift was measured directly at +6 points on CUAD specifically. On a different templated workload the magnitude depends on how much of your real signal was being drowned, which is why step 3 matters.

For single-doc extraction workloads also set strategy="raw_topk". On contract-shape tasks the Auto-routed reasoning_preserving strategy is solving a multi-hop problem you don’t have, and raw_topk beats it by ~4 points at every chunk size.

We deliberately do not ship a CUAD-specific strip_template() helper. Templates are workload-specific, and embedding one into the library would make the wrong call for the next workload. Stripper(...) and Vocabulary({...}) take your boilerplate / synonym dict so the call stays on your side.

What about PRF / query expansion? Tested twice on RedHop, falsified twice with two different failure mechanisms. The dilution win here is subtraction at the query boundary, not addition. See CUAD_PRF_NULL for the mechanism that predicts where unweighted PRF will fail on a new workload.

  • Context optimization strategy, when to prune what was retrieved: Tips guide.
  • All parameters, the full reference: Options.
  • Loaders, every on-ramp (from_text, from_file, from_folder, from_bytes): Loaders.