Context
Splits out the "supplementary masked structured data" half of #68
that we closed without implementing because the shape was
unspecified.
Today the redaction pipeline produces an [Audit<M>] per document
— a per-entity record with (location, entity_kind, decision, execution, replacement, rule_id, confidence, …). Audits live as
the engine's authoritative log of what happened. They serialize
through serde for in-process consumption, but there's no path to
ask "give me every redaction in CSV form" or "give me the JSON
manifest for this run."
Use case
Analytics teams and compliance reviewers want to query "how many
SSNs were redacted in Q3" / "which rules fired on this document"
without writing a Rust program. Today they'd have to load the
serialized audit JSON and reshape it. A direct CSV/JSON export
keyed on the per-entity row makes the data dumb-tools-friendly
(grep, spreadsheets, BI tools).
Proposed shape
Two output formats, each per-pass:
-
CSV — one row per AuditEntry. Columns:
document_id, modality, entity_id, entity_kind, decision, rule_id, confidence, replacement_kind, location, executed_at.
Location serialised as a single string (text:0..10,
tabular:1,2, image:bbox(0,0,100,100), audio:0us-1000us)
so the column type stays uniform.
-
JSON — array of audit entries with the typed AnyAudit
per-modality shape preserved (existing serde output). Same as
what the registry already persists; this just exposes it
through a named export.
Both formats are produced alongside the primary redacted
artefact, not in place of it. The primary redacted file (CSV →
masked CSV, PDF → masked PDF, …) keeps shipping unchanged.
API sketch
pub trait AuditExport {
fn to_csv(audits: &[AnyAudit]) -> Result<String>;
fn to_json(audits: &[AnyAudit]) -> Result<String>;
}
Lives somewhere in nvisy-engine next to the audit types.
Engine `Exporter` gains optional flags to also write
{document_id}.audit.csv / {document_id}.audit.json next to
the primary output.
Out of scope
- Extracted-tabular-content-with-masked-values (the "PDF → CSV"
interpretation that requires table detection inside PDFs). That
needs a PDF table extractor; revisit when one ships.
- Anonymized datasets / k-anonymity exports. Separate concern.
References
Context
Splits out the "supplementary masked structured data" half of #68
that we closed without implementing because the shape was
unspecified.
Today the redaction pipeline produces an [
Audit<M>] per document— a per-entity record with
(location, entity_kind, decision, execution, replacement, rule_id, confidence, …). Audits live asthe engine's authoritative log of what happened. They serialize
through serde for in-process consumption, but there's no path to
ask "give me every redaction in CSV form" or "give me the JSON
manifest for this run."
Use case
Analytics teams and compliance reviewers want to query "how many
SSNs were redacted in Q3" / "which rules fired on this document"
without writing a Rust program. Today they'd have to load the
serialized audit JSON and reshape it. A direct CSV/JSON export
keyed on the per-entity row makes the data dumb-tools-friendly
(grep, spreadsheets, BI tools).
Proposed shape
Two output formats, each per-pass:
CSV — one row per
AuditEntry. Columns:document_id, modality, entity_id, entity_kind, decision, rule_id, confidence, replacement_kind, location, executed_at.Location serialised as a single string (
text:0..10,tabular:1,2,image:bbox(0,0,100,100),audio:0us-1000us)so the column type stays uniform.
JSON — array of audit entries with the typed
AnyAuditper-modality shape preserved (existing serde output). Same as
what the registry already persists; this just exposes it
through a named export.
Both formats are produced alongside the primary redacted
artefact, not in place of it. The primary redacted file (CSV →
masked CSV, PDF → masked PDF, …) keeps shipping unchanged.
API sketch
Lives somewhere in
nvisy-enginenext to the audit types.Engine `Exporter` gains optional flags to also write
{document_id}.audit.csv/{document_id}.audit.jsonnext tothe primary output.
Out of scope
interpretation that requires table detection inside PDFs). That
needs a PDF table extractor; revisit when one ships.
References
INGESTION.md §4.2— original spec describing supplementarymasked structured data outputs.