Akroasis owns the long-term offline reference store through pinax, the planned
knowledge layer. The repository does not ship an instance/ tree today; this
document defines the target layout and migration policy so the existing
theke/_reference staging area can move only after the source inventory is
visible and checksummed.
The canonical runtime path is:
${AKROASIS_INSTANCE:-~/akroasis/instance}/reference/
The store is instance data, not source code. It should not be committed to this repository. Repo docs and manifests may describe captures, checksums, and index state, but the payloads live under the operator instance root.
Expected top-level layout:
reference/
captures/
frozen/
refreshable/
indexes/
rg/
tantivy/
embeddings/
manifests/
adapters/
staging/
captures/frozen/ holds material captured for grid-down use where upstream
availability is the risk. captures/refreshable/ holds mirrors that can be
re-pulled from upstream while the network is available. indexes/ is generated
and disposable; it can be rebuilt from captures and manifests. staging/ is for
imports before checksums, content class, license notes, and refresh policy are
assigned.
Initial supported content classes:
| Class | Destination | Notes |
|---|---|---|
| Dash docsets | captures/frozen/docsets/ or captures/refreshable/docsets/ |
Preserve upstream docset structure and record feed URL when available. |
| Curated PDFs | captures/frozen/pdfs/ |
Keep original filename plus manifest title, source URL, capture date, and checksum. |
| Preserved wikis | captures/frozen/wikis/ |
Store static export plus source revision or dump metadata. |
| Stack Exchange snapshots | captures/frozen/stack-exchange/ |
Store dump metadata and license/provenance in manifest. |
| Model-agnostic corpora | captures/frozen/corpora/ |
Plain text, markdown, or other durable formats preferred. |
| Frequency/protocol references | captures/refreshable/radio/ or captures/frozen/radio/ |
Use refreshable for public frequency databases; frozen for field manuals and protocol captures. |
Per-crate docs stay in the crate or docs/. Operational reference payloads,
large manuals, docsets, and preserved external corpora live in the instance
reference store. Theke may link to this store, but it should not own akroasis
reference data after migration.
Each capture set has a manifest under manifests/ with:
- stable capture id
- content class
- relative payload path
- title and source
- upstream URL or local provenance
- capture timestamp
- refresh policy:
frozenorrefreshable - checksum set
- license or redistribution notes
- index adapters enabled for the payload
Manifests are the contract between raw filesystem access, index builders, and future query APIs.
Access starts with the filesystem and grows toward services:
- Raw tree: agents and humans can use
rg,fd, and direct file reads underreference/captures/. - Text index: generated full-text indexes under
reference/indexes/for fast local search. - Embedded index: optional vector indexes for semantic lookup. These are generated artifacts and must be rebuildable from captures.
- Query API: future MCP or HTTP tools read manifests and indexes instead of scraping arbitrary paths. This is the programmatic surface; it should not become the only way to access the corpus.
Docsets alone were reported as 6.1 GB in the staging issue, and the expected
store is 50-100 GB. The instance root may therefore live on a larger mount such
as /storage or another operator-selected drive. The canonical path remains
~/akroasis/instance/reference; if payloads live elsewhere, use a single
symlink at ~/akroasis/instance/reference and keep all internal paths relative
to that canonical root.
Do not scatter per-content symlinks across /, /data, /storage, and
removable menos drives. Mount or link the store root once, then let manifests
describe the content inside it.
Before moving data out of the current staging area:
- Verify the source path exists on the migration host.
- Generate checksums for every payload.
- Classify each content set as frozen or refreshable.
- Write manifests for each content set.
- Create the target
reference/tree on the chosen drive. - Copy payloads into
staging/, verify checksums, then promote them intocaptures/. - Build the raw/text indexes.
- Update fleet pointers that currently name the staging path.
- Remove the staging copy only after checksum verification and operator signoff.
As of this design note, akroasis only documents the planned pinax knowledge
layer. There is no checked-in instance/ directory, no crates/pinax, and no
verified local copy of the source theke/_reference tree in this worktree.