RedHop: the context layer that shows its work
The context layer that shows its work.
RedHop makes RAG easy. Hand it your documents and a question, and it pulls just the sections that matter, hands them to your LLM, and explains every decision, with citations back to the source. On real contracts it cut prompt tokens by 80% with the gold evidence kept, at about 1.7ms per query. Python, Node, and Rust over a single Rust core. Chunking, retrieval, and token-budgeting run in-process, with nothing to wire and no services to run.
import redhop
from openai import OpenAI
query = "What is the governing law?"
doc = redhop.Document.from_file("contract.pdf") # parsed + indexed
ctx = doc.context(query) # just the sections that matter
resp = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],
) const { Document } = require("redhop");
const OpenAI = require("openai");
const query = "What is the governing law?";
const doc = Document.fromFile("contract.pdf"); // parsed + indexed
const ctx = doc.context(query); // just the sections that matter
const resp = await new OpenAI().chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: `${ctx.text}\n\nQuestion: ${query}` }],
}); use redhop::read_file;
let query = "What is the governing law?";
let mut doc = read_file("contract.pdf")?; // parsed + indexed
let ctx = doc.context(query)?; // just the sections that matter
// hand ctx.text() to any LLM client, no lock-in:
let prompt = format!(
"{}\n\nQuestion: {query}", ctx.text(),
);
let answer = llm.complete(&prompt).await?; How it works
Section titled “How it works”A short, fixed pipeline. You bring the documents. RedHop owns chunking, retrieval, and allocation. You get back a context object with the prompt, the citations, and a Decision Report.
Three calls do the work
Section titled “Three calls do the work”The whole surface is load → ask → read. Each call returns a normal Python
object you can hand to any LLM client.
Point at a file or folder
doc = Document.from_file(“contract.pdf”)A whole PDF is parsed, chunked, and indexed in a millisecond or two. from_folder handles a directory the same way.
Send a question
ctx = doc.context(“What is the governing law?”)Retrieval, token-budgeting, and the assembly decision all happen in-process. No service to call.
Use the result
ctx.text() # prompt
ctx.citations # sources
ctx.report # decisionOne context object carries the prompt for the LLM, the per-chunk citations, and the Decision Report.
Tune retrieval, measure the lift
Section titled “Tune retrieval, measure the lift”Different corpora reward different retrieval. Three layers of tuning, each measurable on the same Decision Report.
- Lexical (BM25) by default
- Hybrid with a small dense embedder
- Cross-encoder rerank tier
- Template stripping
- Vocabulary expansion
- Per-stage audit on the report
- No LLM judge
- Same engine as runtime
- Milliseconds per query
CUAD retention: 81.3% on the raw template, 87.7% after Stripper, 90.7%
after also applying Vocabulary. LlamaIndex sits at 86%, between the two
RedHop steps. Every rewrite stage lands on the Decision Report as an audit
trail.
Choosing a configuration → · Benchmarks →
The same question, three ways
Section titled “The same question, three ways”You have a contract.pdf and one question: “What is the governing law?” Here’s
the code path to get the LLM the right context in each library, with the same
answer quality. The full head-to-head benchmark is on the
comparison page.
import redhopfrom openai import OpenAI
query = "What is the governing law?"
ctx = redhop.Document.from_file("contract.pdf").context(query)# parsed, chunked, retrieved, and token-budgeted internally
response = OpenAI().chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],)print(response.choices[0].message.content)What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.
from langchain_community.document_loaders import PyMuPDFLoaderfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_openai import OpenAIEmbeddings, ChatOpenAIfrom langchain_community.vectorstores import FAISSfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_core.runnables import RunnablePassthroughfrom langchain_core.output_parsers import StrOutputParser
query = "What is the governing law?"
pages = PyMuPDFLoader("contract.pdf").load()chunks = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200,).split_documents(pages)
store = FAISS.from_documents(chunks, OpenAIEmbeddings())retriever = store.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_template( "Answer using only the context.\n\n{context}\n\nQuestion: {input}")
chain = ( {"context": retriever, "input": RunnablePassthrough()} | prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser())
print(chain.invoke(query))What you stand up: a splitter (you choose
chunk_size/overlap), an embedding model, a FAISS vector
store, a retriever, a prompt template, and a retrieval chain — six wired pieces,
and embeddings cost a call per chunk.
from llama_index.core import VectorStoreIndex, Settingsfrom llama_index.core.node_parser import SentenceSplitterfrom llama_index.readers.file import PyMuPDFReaderfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.llms.openai import OpenAI
query = "What is the governing law?"
Settings.embed_model = OpenAIEmbedding()Settings.llm = OpenAI(model="gpt-4o-mini")
docs = PyMuPDFReader().load(file_path="contract.pdf")
index = VectorStoreIndex.from_documents( docs, transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=50)],)
engine = index.as_query_engine(similarity_top_k=4)print(engine.query(query))What you stand up: a node parser, an embedding model, a vector index, and a query engine. Cleaner than LangChain, but still an embed-and-index pipeline you own and pay for.
Where RedHop fits
Section titled “Where RedHop fits”Honest positioning so you can decide before you invest the time. The wins are measured, and they are workload-shaped, so check yours against them.
- You want three small calls (
load → ask → read) instead of a framework - You need a Decision Report and per-rewrite audit trail (regulated or debugging-heavy contexts)
- You’re doing multi-hop reasoning where the bridge passage matters
- Your corpus fits in memory
- Your team ships in Python, Node, or Rust and wants the same engine, defaults, and reports in each
- You have millions of chunks and need ANN-scale search infrastructure
- Your queries share almost no vocabulary with your documents. A dense-first vector stack is the better fit
- You want a broad connector and agent ecosystem rather than a focused context layer
- You want a managed or hosted offering. RedHop is library-only
Apples-to-apples with the same bge-small (post-0.3.1 pure-rerank fix). On MuSiQue LangChain still leads narrowly (39% vs 34%). CUAD reaches 90.7% with the Stripper + Vocabulary recipe, but the same Stripper lifts LlamaIndex to 94%. That is a reproducible, audited workflow, not a retrieval-engine win. RedHop’s hybrid tier is currently 2–5× slower than the competitors’ hybrid, a known open item. We have not measured against dense-only services at scale. Full head-to-head → · How to read this