Choosing a configuration

If you’re not sure which Document settings to use, this page tells you in 60 seconds, based on what the docs you’re loading actually look like.

The decision tree

                  Your corpus is…
                        │
   ┌────────────────────┼─────────────────────────┐
   │                    │                         │
"normal" — code,    "structured" — a              "synonym-mismatch" — HR
API refs, runbooks, contract / policy with        FAQs, support tickets
handbooks, reports, near-duplicate clauses        where users ask in
mixed folders       (e.g. governing-law           different words than
                    overrides per region)         the docs use
                        │                         │
        ▼               ▼                         ▼
  Document.from_file(p)   Document.from_file(p,    Document.from_file(p,
  doc.context(q)          retrieval="hybrid",     retrieval="hybrid",
                          model="bge-small")      model="bge-small",
                                                  rerank="cross-encoder")
                          doc.context(q,
                              include_heading=True,
                              neighbors=1)

Three recipes cover the practical space.

The recipes

Default: for most docs

No model download. ~50ms warm queries. Zero ONNX runtime.

import redhop

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("What is the governing law?")
prompt = ctx.text()         # feed to any LLM
print(ctx.report)           # see what was retrieved and why

const { Document } = require("redhop");
const doc = Document.fromFile("contract.pdf");
const ctx = doc.context("What is the governing law?");
const prompt = ctx.text;
console.log(ctx.report.rendered);

let mut doc = redhop::read_file("contract.pdf")?;
let ctx = doc.context("What is the governing law?")?;
let prompt = ctx.text();

When this is right: code, API references, internal docs, runbooks, financial reports, handbooks, mixed folders (from_folder). The queries share vocabulary with the answers, which is the case for most technical and policy content.

Structured docs with parallel clauses

Hybrid + heading-aware retrieval. Adds an ~80MB embedding model download on first run, and warm queries climb to ~150ms. Worth it only if your doc has clauses like “main clause X” and “EU override of clause X” and “Japan override of clause X”. Heading awareness disambiguates them.

doc = redhop.Document.from_file(
    "msa.pdf",
    retrieval="hybrid",
    model="bge-small",
)
ctx = doc.context(
    "What law applies in the UK?",
    include_heading=True,
    neighbors=1,
)

const doc = Document.fromFile("msa.pdf", {
  retrieval: "hybrid",
  model: "bge-small",
});
const ctx = doc.context(
  "What law applies in the UK?",
  undefined,   // budget — keep default
  1,           // neighbors
  true,        // includeHeading
);

let mut doc = redhop::read_file_with("msa.pdf", &redhop::LoadOptions {
    retrieval: Some("hybrid".into()),
    model: Some("bge-small".into()),
    ..Default::default()
})?;
let ctx = doc.context_with("What law applies in the UK?", &redhop::ContextOptions {
    include_heading: true,
    neighbors: 1,
    ..Default::default()
})?;

When this is right: legal contracts with regional variations, multi-jurisdiction policies, vendor security questionnaires with repeated sub-sections. When it’s wrong: clean single-chapter docs. Adding neighbors=1 to well-structured chapters can dilute well-targeted retrieval rather than help it.

Synonym-mismatch corpora

Adds a cross-encoder reranker, which closes the synonym gap (the canonical “employee left” vs “staff terminated” case). Adds ~300MB of model download and 5–10× query latency.

Python
Node.js

doc = redhop.Document.from_file(
    "support_kb.md",
    retrieval="hybrid",
    model="bge-small",
    rerank="cross-encoder",
)
ctx = doc.context("why did the worker leave?")

const doc = Document.fromFile("support_kb.md", {
  retrieval: "hybrid",
  model: "bge-small",
  rerank: "cross-encoder",
});
const ctx = doc.context("why did the worker leave?");

When this is right: corpora where queries and answers regularly share no surface words (HR, support FAQs translated from internal phrasing, multilingual content). When it’s wrong: anywhere the lexical default already works: it adds latency without recovering anything. Verify it helps on your corpus before adopting.

Trade-offs at a glance

	Lexical default	Hybrid + bge	+ cross-encoder rerank
First-run model download	none	~80MB (bge-small)	+ ~300MB (cross-encoder)
Warm query latency	~50ms	~150ms	~1000ms
Compile-time deps	none	ONNX runtime	ONNX runtime
Where it helps	most document QA	regional overrides, parallel sub-sections	synonym-mismatch retrieval
Where it hurts	—	adds latency on docs lexical already handles	adds latency without recovering anything unless the failure mode is synonym mismatch

Query writing: the part the user controls

The library can only retrieve what your query gives it. Three patterns no config can fix:

1. One-word polysemy queries

'vendor' retrieves the vendor-management section, not the liability cap (even when both mention vendors). 'settle' can retrieve the indemnification clause (“settle a claim”) rather than the arbitration clause (“settle a dispute”), even with a cross-encoder reranker, because both readings are defensible.

Fix it in the query, not the config: add one disambiguating word. 'liability cap for vendor' correctly finds the cap clause. 'arbitration forum to settle disputes' finds the arbitration clause.

2. Natural-language paraphrase with no shared vocabulary

'How long do I have to cancel and get my money back?' against a contract that uses “refund” and “termination for convenience” (not “cancel” or “money back”) can return an empty or weak context across every tier.

Fix in the query: use the doc’s vocabulary. “What’s the refund window?” finds the relevant clause immediately. Fix at the config level (sometimes): retrieval="hybrid" adds a dense embedder that can match refund to cancel through semantic similarity. Hybrid is a strict superset of lexical (BM25-tail fallback fills any chunks the dense pool missed), so you never lose candidates by turning it on. The cost is the ~80MB embedder download and ~3× warm latency.

3. Templated queries with heavy boilerplate

If every query in your workload follows a fixed template (“Highlight the parts (if any) of this contract related to X that should be reviewed by a lawyer. Details: …”, “Help me with X, my account is Y, the error is Z”, form-filled queries from a structured UI), BM25 weights each query term by corpus IDF, not by how often the term appears across your query set. So the 19 boilerplate words dilute the 5 real signal words, and retention suffers.

This is measured. On CUAD’s fixed 24-word template, stripping the boilerplate to just <clause name> <details> before calling context() lifts ≥0.8 retention from 81.3% → 87.7% (n=300, BM25, budget 2,000 tok), overtaking LlamaIndex’s 86%. Full mechanism + numbers: CUAD_CLAUSE_EXPANSION (the controlled three-arm run).

Two paths up the same hill: pick one, don’t combine. Measured on CUAD (CUAD_HYBRID_RERANK):

path	what you do	retention	latency
One-knob	`retrieval="hybrid"` (BGE-small embedder)	~86–88%	~10 ms/q
Best-quality	BM25 default + `analyze_query_set` → `Stripper` + `Vocabulary` chain via `doc.context_with_rewrites(...)`	90.7%	~2.5 ms/q

Hybrid retrieval reads chunks as semantic content rather than counting tokens, so the boilerplate ratio stops mattering. It substitutes for template stripping by a different mechanism. Running both gives diminishing returns: once one mechanism has fixed the boilerplate dilution, the other adds only +0.3 points. Strip + expand is Pareto-optimal on CUAD (higher retention AND lower latency) but takes the upfront work of writing a stripper and building a synonym dict.

Recommended workflow if you go the best-quality path: detect → compile → run-through-rewrites → A/B. Every step ships in the public API. The rewrites compile once (the analyzer pass happens at construction time, not per query) and run through doc.context_with_rewrites(query, [stripper, vocab]). The per-stage audit trail lands on ctx.report.query_rewrites so every transform is observable:

import redhop

# 1. Detect — analyzer reports the shape of your query set.
report = redhop.analyze_query_set(my_queries[:300])
# Cross-workload probe (findings/QUERY_SET_ANALYZER.md):
#   CUAD     → is_templated=True,  share=0.66, cost="high"
#   HotpotQA → is_templated=False, share=0.00, cost="none"
#   MuSiQue  → is_templated=False, share=0.12, cost="none"

if report.is_templated:
    # 2. Compile the rewrites. Stripper compiles the boilerplate
    #    once via the analyzer — token-level matching, so an "of"
    #    stripper does NOT erase the "of" inside "office".
    stripper = redhop.Stripper(report.boilerplate_terms)

    # 3. (optional) Vocabulary. If your workload has a known taxonomy
    #    of "topics" with predictable synonyms (clause types, error
    #    codes, issue categories), compile them once. On CUAD this
    #    lifts retention from 87.7% to 90.7% on top of the Stripper.
    #    Mechanism: workload-curated high-IDF discriminators raise
    #    the BM25 score of the relevant chunk. Opposite mechanism
    #    direction from PRF (which fails on boilerplate-heavy
    #    corpora; see CUAD_PRF_NULL).
    vocab = redhop.Vocabulary({
        # YOUR workload's keys → synonyms. Worked CUAD example in
        # CUAD_CLAUSE_EXPANSION.md.
        "change of control": ["merger", "successor", "acquisition"],
        "non-compete":       ["restraint", "non-competition"],
    })

    # 4. Run the chain inside context_with_rewrites; the per-stage
    #    audit lands on ctx.report.query_rewrites automatically.
    doc = redhop.Document.from_text(your_document)
    ctx_a = doc.context(user_query)                              # baseline
    ctx_b = doc.context_with_rewrites(user_query, [stripper, vocab])

    # 5. A/B — redhop.evaluate scores both arms deterministically.
    #          No LLM judge; same primitives the Decision Report uses.
    eval_a = redhop.evaluate(user_query, ctx_a, gold_chunks=gold_ids)
    eval_b = redhop.evaluate(user_query, ctx_b, gold_chunks=gold_ids)
    # eval_b.overall - eval_a.overall is the per-query lift.

    # 6. Inspect — every rewrite is observable.
    for rec in ctx_b.report.query_rewrites:
        print(rec.stage, "matched=", rec.matched,
              "added=", rec.added, "removed=", rec.removed)

The analyzer is conservative by design: HotpotQA and MuSiQue both register quiet on the probe (is_templated=False), while CUAD fires (is_templated=True, share 0.66). The analyzer measures the shape of your queries. It does not promise a specific retention lift. The CUAD lift was measured directly at +6 points on CUAD specifically. On a different templated workload the magnitude depends on how much of your real signal was being drowned, which is why step 3 matters.

For single-doc extraction workloads also set strategy="raw_topk". On contract-shape tasks the Auto-routed reasoning_preserving strategy is solving a multi-hop problem you don’t have, and raw_topk beats it by ~4 points at every chunk size.

We deliberately do not ship a CUAD-specific strip_template() helper. Templates are workload-specific, and embedding one into the library would make the wrong call for the next workload. Stripper(...) and Vocabulary({...}) take your boilerplate / synonym dict so the call stays on your side.

What about PRF / query expansion? Tested twice on RedHop, falsified twice with two different failure mechanisms. The dilution win here is subtraction at the query boundary, not addition. See CUAD_PRF_NULL for the mechanism that predicts where unweighted PRF will fail on a new workload.