Loaders

RedHop owns everything from chunking onward. You just get content in. There are a few on-ramps, from “I already have text,” to a file or whole folder, to bytes straight from a cloud bucket. All return a Document and take the same options (Retrieval options). Same API in Python, Node, and Rust: pick your tab.

On-ramp	For
`from_text`	text you already have (your own parser/OCR, a DB field)
`from_chunks`	content you already chunked
`from_file`	a file on disk: PDF, DOCX, PPTX, XLSX, or text/code
`from_bytes`	bytes you already fetched: S3 / Azure Blob / GCS / HTTP / DB blobs
`from_folder`	a whole folder in one index (with an optional incremental on-disk index)

from_text: text you already have

Got the text already (a DB field, an API response, your own parser or OCR)? Hand it straight to RedHop. This is also the escape hatch for formats from_file doesn’t parse, or scanned PDFs that need OCR first.

import redhop
doc = redhop.Document.from_text(my_text)

const { Document } = require("redhop");
const doc = Document.fromText(myText);

let doc = redhop::Document::from_text("notes", my_text)?;

from_chunks: you already chunked

Already split your content (your own chunker, a DB dump, a data dictionary)? Wrap each as a typed Chunk: source flows through to citations, id is your stable identifier, and metadata is an open dict whose page / heading / line keys are picked up by the citation getter:

doc = redhop.Document.from_chunks([
    redhop.Chunk(
        "clause one …",
        source="msa.pdf",
        id="c1",
        metadata={"page": 1, "heading": "Definitions"},
    ),
    redhop.Chunk(
        "clause two …",
        source="msa.pdf",
        id="c2",
        metadata={"page": 1, "heading": "Definitions"},
    ),
])

const { Document, Chunk } = require("redhop");
const doc = Document.fromChunks([
  new Chunk("clause one …", {
    source: "msa.pdf",
    id: "c1",
    metadata: { page: 1, heading: "Definitions" },
  }),
  new Chunk("clause two …", {
    source: "msa.pdf",
    id: "c2",
    metadata: { page: 1, heading: "Definitions" },
  }),
]);

use redhop::core::{Chunk, ChunkId, TokenCount};
use std::collections::HashMap;
let mut meta = HashMap::new();
meta.insert("page".to_string(), serde_json::json!(1));
meta.insert("heading".to_string(), serde_json::json!("Definitions"));
let chunks = vec![
    Chunk::new(ChunkId::new("c1"), "clause one …", "msa.pdf", TokenCount(3))
        .with_metadata(meta.clone()),
    Chunk::new(ChunkId::new("c2"), "clause two …", "msa.pdf", TokenCount(3))
        .with_metadata(meta),
];
let doc = redhop::Document::from_chunks(chunks)?;

Strings and plain dicts are no longer accepted (0.3.0 breaking change). The typed constructor was added so manually-built chunks can carry the same metadata into citations that file loaders populate. See redhop.Chunk below for the full field list.

from_file: a file on disk, one line

Point it at a path and RedHop reads, parses, chunks, and indexes it: PDF, DOCX, PPTX, XLSX, or any text/code file. It tracks the file path as each chunk’s source plus structural location (citations), so a citation points at contract.pdf, p.3 or notes.md → Setup or main.py:42, not just the filename.

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("how long is the refund window?")

const doc = Document.fromFile("contract.pdf");
const ctx = doc.context("how long is the refund window?");

let mut doc = redhop::read_file("contract.pdf")?;
let ctx = doc.context("how long is the refund window?")?;

Indexing is type-aware. Each format carries the structural location it has:

Format	How it’s split	Citation
Markdown	by heading	heading + line
Code (`.py` `.ts` `.rs` …)	by definition (function/class), kept verbatim	symbol + line
Text / data	by blank-line block	line
DOCX	paragraphs (heading-aware) + tables	heading
PPTX	one section per slide	page (slide #)
XLSX / ODS	one section per sheet, rows pipe-joined	heading (sheet)
PDF	one section per page	page

Code is chunked verbatim (formatting preserved) and labeled with its nearest definition (auth.py → def login), and prose is sentence-packed. Code is also retrieved lexically under hybrid. See Retrieval options.

from_bytes: cloud storage, HTTP, blobs

The file isn’t always on local disk. Sometimes it’s in S3, Cloudflare R2, Azure Blob, GCS, behind a URL, or in a DB column. RedHop doesn’t bundle cloud SDKs (credentials and auth aren’t its job). Fetch the bytes with the client you already have, and hand them over. The name (e.g. "contract.pdf") picks the parser and becomes the citation source, so pass something meaningful like the object key.

import boto3, redhop
obj = boto3.client("s3").get_object(Bucket="my-bucket", Key="contract.pdf")
doc = redhop.Document.from_bytes(obj["Body"].read(), source="s3://my-bucket/contract.pdf")

const { Document } = require("redhop");
// e.g. const buf = Buffer.from(await (await fetch(url)).arrayBuffer());
const doc = Document.fromBytes(buf, "s3://my-bucket/contract.pdf");

// let bytes = your_s3_client.get(...).await?;  // any client
let doc = redhop::read_bytes(&bytes, "s3://my-bucket/contract.pdf")?;

Citations

Every selected chunk remembers where it came from: source plus whichever of page / heading / line the format provides (the rest absent):

for c in ctx.citations:
    print(c["source"], c["page"], c["heading"], c["line"])
    # contract.pdf  3  None  None  → "contract.pdf, p.3"
    # notes.md   None "Setup"  12  → "notes.md → Setup"

for (const c of ctx.citations) {
  console.log(c.source, c.page, c.heading, c.line);
  // contract.pdf  3  null  null  → "contract.pdf, p.3"
}

for c in redhop::citations(&ctx) {
    println!("{} {:?} {:?} {:?}", c.source, c.page, c.heading, c.line);
}

This is what lets you show the model’s evidence trail (“answer grounded in contract.docx → Refund Policy”) without a separate store. from_text/from_chunks give you source and the chunk text. The structural fields are filled in per format.

from_folder: point at a directory

The on-ramp for file apps and coding agents: index a whole folder and ask, no vector DB to operate. RedHop walks the directory, reads every file it can, and builds one combined index: each chunk keeps its own file path as source, so citations point at the right file across the corpus.

doc = redhop.Document.from_folder("./docs")
ctx = doc.context("what's our deprecation policy?")

const doc = Document.fromFolder("./docs");
const ctx = doc.context("what's our deprecation policy?");

let mut doc = redhop::read_folder("./docs")?;
let ctx = doc.context("what's our deprecation policy?")?;

It reads everything from_file can. Files it can’t parse are skipped. Hidden entries and build/cache dirs (node_modules, target, __pycache__, venv, dist, build) are always ignored. It respects your .gitignore (even outside a git checkout). Add your own excludes, or turn gitignore off:

doc = redhop.Document.from_folder(
    "./repo",
    retrieval="hybrid",                          # BM25 → dense; scales, no vector DB
    ignore=["*.lock", "tests/**", "*.min.js"],   # extra excludes
    # gitignore=False,                            # to include .gitignore'd files
    recursive=True,
)

const doc = Document.fromFolder("./repo", {
  recursive: true,
  ignore: ["*.lock", "tests/**", "*.min.js"],   // extra excludes
  // gitignore: false,                            // to include .gitignore'd files
  options: { retrieval: "hybrid" },              // BM25 → dense; scales, no vector DB
});

use redhop::{read_folder_with, FolderOptions, LoadOptions};

let opts = FolderOptions {
    recursive: Some(true),
    ignore: vec!["*.lock".into(), "tests/**".into(), "*.min.js".into()],
    // gitignore: Some(false),
    load: LoadOptions { retrieval: Some("hybrid".into()), ..Default::default() },
    ..Default::default()
};
let mut doc = read_folder_with("./repo", &opts)?;

Persist the index: incremental reload

By default the index is in-memory (rebuilt each run). Turn on persistence to save it to disk and reload incrementally: on the next run, files whose modified-time and size are unchanged are reused from the cache (no re-parsing, no re-embedding), and only new/changed files are processed and removed files are dropped. This is what makes a folder of thousands of files practical. The win is biggest on the semantic/hybrid tiers, where the saved index carries the embeddings.

# First run embeds everything and saves to ./docs/.redhop; later runs reuse it.
doc = redhop.Document.from_folder("./docs", persist=True, retrieval="hybrid")
# Put the index elsewhere:
doc = redhop.Document.from_folder("./docs", persist=True, index_dir="/var/cache/redhop")

// First run embeds everything and saves to ./docs/.redhop; later runs reuse it.
let doc = Document.fromFolder("./docs", { persist: true, options: { retrieval: "hybrid" } });
// Put the index elsewhere:
doc = Document.fromFolder("./docs", { persist: true, indexDir: "/var/cache/redhop" });

use redhop::{read_folder_with, FolderOptions, LoadOptions};

// First run embeds everything and saves to ./docs/.redhop; later runs reuse it.
let opts = FolderOptions {
    persist: true,
    load: LoadOptions { retrieval: Some("hybrid".into()), ..Default::default() },
    ..Default::default()
};
let mut doc = read_folder_with("./docs", &opts)?;

The cache is keyed by a fingerprint of your indexing settings (chunk size, retrieval tier, model), so changing any of them rebuilds rather than serving a stale index.

redhop.Chunk: the typed chunks primitive

from_chunks requires redhop.Chunk instances (or new Chunk(...) in Node, redhop::core::Chunk::new(...) in Rust). Strings and plain dicts will raise ValueError with a migration hint. Two concepts are kept distinct:

source, provenance: where the chunk came from (file path, URL, logical handle). This is what ctx.citations[*].source displays.
id, identity: a stable identifier for dedup and gold-chunk evaluation. Auto-generated as c0, c1, … if you don’t supply one.

redhop.Chunk(
    text,
    source=None,            # provenance — what citations show
    id=None,                # identity — stable handle for dedup / eval
    metadata=None,          # open dict; page/heading/line picked up by citations
    token_count=None,       # whitespace-counted if omitted
    embedding=None,         # for pre-computed dense vectors
)

Why the typed constructor. Before 0.3.0, manually-built chunks couldn’t carry page/heading/line into citations: those fields were always None on the manual path. The typed Chunk closes that gap: metadata you attach flows through, so when a support agent reads ctx.citations[0].heading it shows the article title you set.

→ Once content is loaded, pick how it’s retrieved: Retrieval options.