Skip to content

Re-export Audit as CSV or JSON for analytics consumption #270

@martsokha

Description

@martsokha

Context

Splits out the "supplementary masked structured data" half of #68
that we closed without implementing because the shape was
unspecified.

Today the redaction pipeline produces an [Audit<M>] per document
— a per-entity record with (location, entity_kind, decision, execution, replacement, rule_id, confidence, …). Audits live as
the engine's authoritative log of what happened. They serialize
through serde for in-process consumption, but there's no path to
ask "give me every redaction in CSV form" or "give me the JSON
manifest for this run."

Use case

Analytics teams and compliance reviewers want to query "how many
SSNs were redacted in Q3" / "which rules fired on this document"
without writing a Rust program. Today they'd have to load the
serialized audit JSON and reshape it. A direct CSV/JSON export
keyed on the per-entity row makes the data dumb-tools-friendly
(grep, spreadsheets, BI tools).

Proposed shape

Two output formats, each per-pass:

  • CSV — one row per AuditEntry. Columns:
    document_id, modality, entity_id, entity_kind, decision, rule_id, confidence, replacement_kind, location, executed_at.
    Location serialised as a single string (text:0..10,
    tabular:1,2, image:bbox(0,0,100,100), audio:0us-1000us)
    so the column type stays uniform.

  • JSON — array of audit entries with the typed AnyAudit
    per-modality shape preserved (existing serde output). Same as
    what the registry already persists; this just exposes it
    through a named export.

Both formats are produced alongside the primary redacted
artefact, not in place of it. The primary redacted file (CSV →
masked CSV, PDF → masked PDF, …) keeps shipping unchanged.

API sketch

pub trait AuditExport {
    fn to_csv(audits: &[AnyAudit]) -> Result<String>;
    fn to_json(audits: &[AnyAudit]) -> Result<String>;
}

Lives somewhere in nvisy-engine next to the audit types.
Engine `Exporter` gains optional flags to also write
{document_id}.audit.csv / {document_id}.audit.json next to
the primary output.

Out of scope

  • Extracted-tabular-content-with-masked-values (the "PDF → CSV"
    interpretation that requires table detection inside PDFs). That
    needs a PDF table extractor; revisit when one ships.
  • Anonymized datasets / k-anonymity exports. Separate concern.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    engineredaction engine, pipeline runtime, orchestration, configurationfeatrequest for or implementation of a new feature

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions