Skip to content

Loaders

RedHop owns everything from chunking onward. You just get content in. There are a few on-ramps, from “I already have text,” to a file or whole folder, to bytes straight from a cloud bucket. All return a Document and take the same options (Retrieval options). Same API in Python, Node, and Rust: pick your tab.

On-rampFor
from_texttext you already have (your own parser/OCR, a DB field)
from_chunkscontent you already chunked
from_filea file on disk: PDF, DOCX, PPTX, XLSX, or text/code
from_bytesbytes you already fetched: S3 / Azure Blob / GCS / HTTP / DB blobs
from_foldera whole folder in one index (with an optional incremental on-disk index)

Got the text already (a DB field, an API response, your own parser or OCR)? Hand it straight to RedHop. This is also the escape hatch for formats from_file doesn’t parse, or scanned PDFs that need OCR first.

import redhop
doc = redhop.Document.from_text(my_text)

Already split your content (your own chunker, a DB dump, a data dictionary)? Wrap each as a typed Chunk: source flows through to citations, id is your stable identifier, and metadata is an open dict whose page / heading / line keys are picked up by the citation getter:

doc = redhop.Document.from_chunks([
redhop.Chunk(
"clause one …",
source="msa.pdf",
id="c1",
metadata={"page": 1, "heading": "Definitions"},
),
redhop.Chunk(
"clause two …",
source="msa.pdf",
id="c2",
metadata={"page": 1, "heading": "Definitions"},
),
])

Strings and plain dicts are no longer accepted (0.3.0 breaking change). The typed constructor was added so manually-built chunks can carry the same metadata into citations that file loaders populate. See redhop.Chunk below for the full field list.

Point it at a path and RedHop reads, parses, chunks, and indexes it: PDF, DOCX, PPTX, XLSX, or any text/code file. It tracks the file path as each chunk’s source plus structural location (citations), so a citation points at contract.pdf, p.3 or notes.md → Setup or main.py:42, not just the filename.

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("how long is the refund window?")

Indexing is type-aware. Each format carries the structural location it has:

FormatHow it’s splitCitation
Markdownby headingheading + line
Code (.py .ts .rs …)by definition (function/class), kept verbatimsymbol + line
Text / databy blank-line blockline
DOCXparagraphs (heading-aware) + tablesheading
PPTXone section per slidepage (slide #)
XLSX / ODSone section per sheet, rows pipe-joinedheading (sheet)
PDFone section per pagepage

Code is chunked verbatim (formatting preserved) and labeled with its nearest definition (auth.py → def login), and prose is sentence-packed. Code is also retrieved lexically under hybrid. See Retrieval options.

The file isn’t always on local disk. Sometimes it’s in S3, Cloudflare R2, Azure Blob, GCS, behind a URL, or in a DB column. RedHop doesn’t bundle cloud SDKs (credentials and auth aren’t its job). Fetch the bytes with the client you already have, and hand them over. The name (e.g. "contract.pdf") picks the parser and becomes the citation source, so pass something meaningful like the object key.

import boto3, redhop
obj = boto3.client("s3").get_object(Bucket="my-bucket", Key="contract.pdf")
doc = redhop.Document.from_bytes(obj["Body"].read(), source="s3://my-bucket/contract.pdf")

Every selected chunk remembers where it came from: source plus whichever of page / heading / line the format provides (the rest absent):

for c in ctx.citations:
print(c["source"], c["page"], c["heading"], c["line"])
# contract.pdf 3 None None → "contract.pdf, p.3"
# notes.md None "Setup" 12 → "notes.md → Setup"

This is what lets you show the model’s evidence trail (“answer grounded in contract.docx → Refund Policy”) without a separate store. from_text/from_chunks give you source and the chunk text. The structural fields are filled in per format.

The on-ramp for file apps and coding agents: index a whole folder and ask, no vector DB to operate. RedHop walks the directory, reads every file it can, and builds one combined index: each chunk keeps its own file path as source, so citations point at the right file across the corpus.

doc = redhop.Document.from_folder("./docs")
ctx = doc.context("what's our deprecation policy?")

It reads everything from_file can. Files it can’t parse are skipped. Hidden entries and build/cache dirs (node_modules, target, __pycache__, venv, dist, build) are always ignored. It respects your .gitignore (even outside a git checkout). Add your own excludes, or turn gitignore off:

doc = redhop.Document.from_folder(
"./repo",
retrieval="hybrid", # BM25 → dense; scales, no vector DB
ignore=["*.lock", "tests/**", "*.min.js"], # extra excludes
# gitignore=False, # to include .gitignore'd files
recursive=True,
)

By default the index is in-memory (rebuilt each run). Turn on persistence to save it to disk and reload incrementally: on the next run, files whose modified-time and size are unchanged are reused from the cache (no re-parsing, no re-embedding), and only new/changed files are processed and removed files are dropped. This is what makes a folder of thousands of files practical. The win is biggest on the semantic/hybrid tiers, where the saved index carries the embeddings.

# First run embeds everything and saves to ./docs/.redhop; later runs reuse it.
doc = redhop.Document.from_folder("./docs", persist=True, retrieval="hybrid")
# Put the index elsewhere:
doc = redhop.Document.from_folder("./docs", persist=True, index_dir="/var/cache/redhop")

The cache is keyed by a fingerprint of your indexing settings (chunk size, retrieval tier, model), so changing any of them rebuilds rather than serving a stale index.

from_chunks requires redhop.Chunk instances (or new Chunk(...) in Node, redhop::core::Chunk::new(...) in Rust). Strings and plain dicts will raise ValueError with a migration hint. Two concepts are kept distinct:

  • source, provenance: where the chunk came from (file path, URL, logical handle). This is what ctx.citations[*].source displays.
  • id, identity: a stable identifier for dedup and gold-chunk evaluation. Auto-generated as c0, c1, … if you don’t supply one.
redhop.Chunk(
text,
source=None, # provenance — what citations show
id=None, # identity — stable handle for dedup / eval
metadata=None, # open dict; page/heading/line picked up by citations
token_count=None, # whitespace-counted if omitted
embedding=None, # for pre-computed dense vectors
)

Why the typed constructor. Before 0.3.0, manually-built chunks couldn’t carry page/heading/line into citations: those fields were always None on the manual path. The typed Chunk closes that gap: metadata you attach flows through, so when a support agent reads ctx.citations[0].heading it shows the article title you set.

→ Once content is loaded, pick how it’s retrieved: Retrieval options.