Skip to content

RedHop: the context layer that shows its work

The context layer that shows its work.

RedHop makes RAG easy. Hand it your documents and a question, and it pulls just the sections that matter, hands them to your LLM, and explains every decision, with citations back to the source. On real contracts it cut prompt tokens by 80% with the gold evidence kept, at about 1.7ms per query. Python, Node, and Rust over a single Rust core. Chunking, retrieval, and token-budgeting run in-process, with nothing to wire and no services to run.

Get started →GitHub

import redhop
from openai import OpenAI

query = "What is the governing law?"

doc = redhop.Document.from_file("contract.pdf")   # parsed + indexed
ctx = doc.context(query)   # just the sections that matter

resp = OpenAI().chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],
)

A short, fixed pipeline. You bring the documents. RedHop owns chunking, retrieval, and allocation. You get back a context object with the prompt, the citations, and a Decision Report.

RedHop pipeline: Document(s) (PDF, DOCX, PPTX, XLSX, MD, code, bytes, folder) feed into Chunking (sentence-budgeted, 128-token default), then Retrieval (BM25, Hybrid, Dense. No ANN, no vector DB), then Allocation (ContextStrategy, size-gated Auto), producing a BuiltContext with prompt, chunks, citations, and a Decision Report. Italic labels point at the calibrating finding in docs/findings/.

The whole surface is load → ask → read. Each call returns a normal Python object you can hand to any LLM client.

01 · Load

Point at a file or folder

doc = Document.from_file(“contract.pdf”)

A whole PDF is parsed, chunked, and indexed in a millisecond or two. from_folder handles a directory the same way.

02 · Ask

Send a question

ctx = doc.context(“What is the governing law?”)

Retrieval, token-budgeting, and the assembly decision all happen in-process. No service to call.

03 · Read

Use the result

ctx.text()       # prompt
ctx.citations  # sources
ctx.report     # decision

One context object carries the prompt for the LLM, the per-chunk citations, and the Decision Report.

Different corpora reward different retrieval. Three layers of tuning, each measurable on the same Decision Report.

Retrieval
  • Lexical (BM25) by default
  • Hybrid with a small dense embedder
  • Cross-encoder rerank tier
Rewrites
  • Template stripping
  • Vocabulary expansion
  • Per-stage audit on the report
Eval
  • No LLM judge
  • Same engine as runtime
  • Milliseconds per query
CUAD retention lift across the detect, strip, expand workflow. Retention rises from 81.3% on the raw 24-word CUAD template, to 87.7% with Stripper, to 90.7% after also adding a workload Vocabulary. LlamaIndex's 86% is shown as a vertical reference line, and a footnote records that the same Stripper applied to LlamaIndex lifts it to 94%.

CUAD retention: 81.3% on the raw template, 87.7% after Stripper, 90.7% after also applying Vocabulary. LlamaIndex sits at 86%, between the two RedHop steps. Every rewrite stage lands on the Decision Report as an audit trail.

Choosing a configuration → · Benchmarks →

You have a contract.pdf and one question: “What is the governing law?” Here’s the code path to get the LLM the right context in each library, with the same answer quality. The full head-to-head benchmark is on the comparison page.

import redhop
from openai import OpenAI
query = "What is the governing law?"
ctx = redhop.Document.from_file("contract.pdf").context(query)
# parsed, chunked, retrieved, and token-budgeted internally
response = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],
)
print(response.choices[0].message.content)

What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.

Honest positioning so you can decide before you invest the time. The wins are measured, and they are workload-shaped, so check yours against them.

Pick RedHop when
  • You want three small calls (load → ask → read) instead of a framework
  • You need a Decision Report and per-rewrite audit trail (regulated or debugging-heavy contexts)
  • You’re doing multi-hop reasoning where the bridge passage matters
  • Your corpus fits in memory
  • Your team ships in Python, Node, or Rust and wants the same engine, defaults, and reports in each
Look elsewhere when
  • You have millions of chunks and need ANN-scale search infrastructure
  • Your queries share almost no vocabulary with your documents. A dense-first vector stack is the better fit
  • You want a broad connector and agent ecosystem rather than a focused context layer
  • You want a managed or hosted offering. RedHop is library-only
81%HotpotQA hybrid ≥0.8, ahead of LangChain 77% and LlamaIndex 67%
+8 / +5Multi-hop retention lead on HotpotQA / MuSiQue (n=300 each)
−80%Prompt tokens on real contracts, gold evidence kept, at ~1.7ms per query

Apples-to-apples with the same bge-small (post-0.3.1 pure-rerank fix). On MuSiQue LangChain still leads narrowly (39% vs 34%). CUAD reaches 90.7% with the Stripper + Vocabulary recipe, but the same Stripper lifts LlamaIndex to 94%. That is a reproducible, audited workflow, not a retrieval-engine win. RedHop’s hybrid tier is currently 2–5× slower than the competitors’ hybrid, a known open item. We have not measured against dense-only services at scale. Full head-to-head → · How to read this