Skip to content

Add msgvault backup: init, create, list, verify, and restore commands (#339)#428

Open
wesm wants to merge 13 commits into
mainfrom
backup-restore-design
Open

Add msgvault backup: init, create, list, verify, and restore commands (#339)#428
wesm wants to merge 13 commits into
mainfrom
backup-restore-design

Conversation

@wesm

@wesm wesm commented Jul 4, 2026

Copy link
Copy Markdown
Member

Implements the backup and restore side of #339: msgvault backup init / create / list / verify / restore against an app-owned backup repository. Restore is included and proves itself before reporting success — see "Restore" below.

Docs: usage guide · format reference

How this maps to the #339 proposal

Kept as proposed:

  • msgvault owns the correctness boundary; retention, remote storage, and scheduling stay with restic/rclone/cron. The repository is a plain directory of write-once, immutable files — designed to be what those tools sync.
  • --include-config / --include-tokens flags, off by default. Added on top: capturing either into an unencrypted repository requires an explicit --allow-plaintext-secrets, so credentials can't leak by default.
  • Rebuildable derived state (vector db, analytics, caches) is excluded and the exclusion list is recorded in the manifest.
  • The manifest is written last, after all content is durable. A crash at any point leaves either a complete snapshot or no snapshot.
  • A missing or corrupt attachment fails the backup loudly: every attachment is re-read and re-hashed from disk at capture time, read from the storage path the database records for it (importers may namespace paths beyond the loose <aa>/<hash> layout).
  • The manifest records the msgvault version, backup format version, creation timestamp, capture options, attachment counts (content hashes live in dedicated list objects), message/conversation/account/label counts, and the exclusion list.

Where the design goes further than the proposal:

  • Snapshot repository instead of one backup directory. A repository holds many snapshots with content-addressed dedup: the database is captured at 4 KB page granularity (only changed pages stored after the first snapshot), and unchanged attachments cost nothing. Nightly backups scale with churn, not archive size, and you keep history rather than one restore point.
  • Consistency without VACUUM INTO. Instead of a full-copy snapshot, the backup checkpoints the WAL and pins a read transaction on its own connection, coordinating with a running daemon through a short freeze window (loopback API, watchdog-protected). No double-disk requirement, no downtime; the daemon resumes writes while pages are scanned. backup create refuses to run unfrozen against a live daemon.
  • Verification is stronger than existence + hash checks. verify (full mode) re-derives every referenced blob's SHA-256 from its bytes, proves the page map covers the database exactly, and cross-checks manifest counts. Failures name the damaged blob and the pack file holding it. --quick does structural checks; --all covers every snapshot, reading shared content once.
  • Versioning for the long haul. Compatibility is enforced at three levels (repository min_reader_version, per-object magic+version on every binary format, per-snapshot min_reader_version), so an older msgvault refuses a newer repository or snapshot with an explicit upgrade message instead of misreading it. Manifests are deterministic JSON; snapshot IDs are content-derived from the manifest bytes and re-checked on load, so a renamed or forged manifest is rejected.
  • Attachments are packed. Content goes into ~32 MB sealed pack files with per-blob zstd (skipped when content doesn't compress), so a 200k-file attachment tree syncs off-site as a few hundred immutable files. This is backup-repository-internal only — the live archive layout is unchanged.

Restore

backup restore --target DIR [SNAPSHOT] [--overwrite] materializes any snapshot into a usable archive home and does not trust itself:

  • Every database page is checked against the snapshot's page-hash map as it is written; every blob read re-derives its SHA-256 identity.
  • After materialization, the restored database must pass PRAGMA integrity_check and reproduce the manifest's recorded stats through exactly the queries capture ran inside the freeze; any mismatch fails the restore.
  • The end-to-end test asserts the restored database is byte-identical to the live file at capture time, for the latest snapshot and for a parent restored from an incremental chain.
  • Attachments land at the storage paths the restored database records; extras (deletions manifests, config, tokens) land at their captured relative paths and file modes, with traversal-free path validation.
  • A non-empty target is refused without --overwrite; overwrite merges but always removes the database and stale -wal/-shm sidecars first. Restoring into a running daemon's archive home is refused outright. Old backups go through normal schema migration on first open.

Performance

Page scanning hashes on all CPUs while reads stay strictly sequential (safe for NAS/spinning disks with no flag). Verify, restore, attachment capture, and changed-page packing fan out workers with a --jobs flag; --jobs 1 serializes reads for spinning-disk repositories. Progress renders as in-place bars on a TTY and plain lines through pipes, with per-stage elapsed time that keeps counting through long quiet stages such as the restore proof's integrity_check.

Changes

  • New packages: internal/pack (sealed pack container) and internal/backup (repository, locking, incremental page capture, manifests, verify, restore).
  • New CLI: msgvault backup init | create | list | verify | restore; [backup] config section (repo, zstd_level); --jobs on create/verify/restore; --progress rendering.
  • Daemon: loopback freeze endpoints with a 60s watchdog; the CLI proxy admits only backup create from the backup group — verify and restore always run locally.
  • Docs: user guide (docs/usage/backup.md) and on-disk format reference (docs/architecture/backup-format.md), including the roadmap for verify --restore-check, encryption, and retention.
  • TUI: launching against an archive with no usable analytics cache (a freshly restored one, for example) prints a notice and shows it while aggregate views load, instead of a silent slow first screen.

Sequenced later (format hooks in place)

  • Repository encryption (age-wrapped repo key, per-object AEAD) and forget/prune retention. Until prune ships, content purged from the archive persists in earlier snapshots — documented, with the privacy implications (including SQLite free-page remnants), in the usage guide.
  • verify --restore-check: the full restore materialization proof against scratch space without writing a target.

🤖 Generated with Claude Code

Comment thread internal/pack/crypter.go Fixed
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (89ece82)

Medium confidence: the PR has two medium-severity issues that should be fixed before merge.

Medium

  • internal/backup/manifest.go:123
    LoadManifest trusts the JSON body without recomputing the content-derived snapshot ID or checking it against the requested filename. Corrupted manifest metadata can therefore be accepted by list, latest, and verify.
    Fix: After unmarshal, parse created_at, recompute ComputeSnapshotID, and reject if it differs from the requested ID or embedded snapshot_id.

  • internal/backup/frozen.go:187
    MAX(size) can be NULL because attachments.size is nullable. Scanning it into int64 can make backup creation fail for otherwise valid attachment rows that have content but missing size metadata.
    Fix: Use COALESCE(MAX(size), -1) or scan into sql.NullInt64 and let capture fill the actual size from the file.


Reviewers: 2 done | Synthesis: codex, 11s | Total: 15m33s

@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (d86b7ad)

Medium severity findings:

  • internal/backup/attachments.go:109: Attachment capture reconstructs paths as <hash[:2]>/<hash> instead of using the DB storage_path / thumbnail_path. Existing SyncTech MMS imports store files under paths like synctech-sms/<prefix>/<hash>, so backup create can fail to read otherwise valid attachments.
    Fix: Carry the relative storage path from AttachmentRefs into capture, read that path after validation, and keep the hash-derived path only as a legacy fallback. Add a test with a prefixed storage_path.

  • internal/backup/fsutil.go:49: If the final directory sync fails after os.Rename, the object is left visible even though writeFileAtomic returns an error. Later operations trust visible manifests and indexes, so a retry can build on a file already reported as not durable; a subsequent crash can leave a child manifest or dedup reference pointing at an index/manifest that disappears.
    Fix: On post-rename SyncDir failure, remove the published final file best-effort and sync the directory again, or otherwise mark successfully published objects so later loads ignore files from failed publishes.

  • internal/backup/manifest.go:117: LoadManifest trusts the manifest JSON loaded from snapshots/<id>.mvmanifest without verifying that the file name, snapshot_id, and recomputed content-derived ID still match. A tampered backup repository could edit an existing manifest to point to older valid blobs while preserving the filename and snapshot_id, causing backup verify <snapshot> to verify the attacker-selected graph and report success for a snapshot ID that no longer represents its original contents.
    Fix: On manifest load, validate the requested ID equals m.SnapshotID, recompute ComputeSnapshotID from m.CreatedAt and the loaded manifest, and reject mismatches before verification or parent-chain traversal. Consider checking a longer digest than the current 8 hex chars for stronger tamper resistance.

  • internal/backup/pagehash.go:181: ApplyHashDelta allocates d.PageCount * pageHashSize bytes based on an untrusted page count from repository metadata after only checking integer overflow. A malicious repository can provide a tiny hash-delta object with a huge PageCount and no changed pages, causing backup verify or backup create to allocate enough memory to OOM.
    Fix: Validate delta growth before allocation. Reject deltas where PageCount grows beyond base.PageCount + len(d.Pages) or appended pages are not explicitly present, and impose a sane maximum supported database/hash-map size.


Reviewers: 2 done | Synthesis: codex, 15s | Total: 19m10s

wesm added a commit that referenced this pull request Jul 4, 2026
Two roborev findings on PR #428, both medium:

LoadManifest trusted the JSON body as-is, so corrupted, hand-edited, or
renamed manifest files were accepted by list, latest, and verify. It now
recomputes the content-derived snapshot ID and rejects any manifest whose
computed ID does not match both the requested filename and the embedded
snapshot_id. A side effect is that forged parent cycles can no longer
load at all (a cycle would need SHA-256 fixed-point IDs), so the mapChain
cycle tests are replaced by rejected-at-load coverage and the depth-limit
chain is built with genuine WriteManifest IDs; the verify geometry test
now models a buggy writer with a self-consistent ID, the only way wrong
geometry can still reach full verify.

AttachmentRefs scanned MAX(size) into int64, but attachments.size is
nullable: a content-bearing hash whose rows all lack size metadata made
backup create fail. COALESCE to the same -1 sentinel thumbnails already
use; capture resolves the real size from the file.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (7dc29e4)

Summary: Two medium-severity correctness/concurrency issues need attention; no high or critical findings were reported.

Medium

  • internal/backup/verify.go:336
    Full verification validates the hash-map chain and page-map chain independently, then checks only that page-map blobs exist and page ranges cover the manifest count. It does not validate pm.PageSize against the manifest or confirm that each mapped page’s bytes hash to the materialized page-hash-map entry. A checksum-valid but semantically incorrect page map could pass backup verify while restoring incorrect database bytes.
    Fix: Return the materialized hash map from checkHashMapChain, validate pm.PageSize == m.DB.PageSize, and in full mode walk page-map runs to bounds-check blob offsets and compare each mapped page’s PageHash against the hash map.

  • internal/backup/lock.go:132
    Stale lock reaping performs Stat and later Remove by path without serializing or rechecking ownership. If two processes recover the same stale exclusive.json, one can remove the other’s freshly planted lock between stale check and remove, allowing both operations to proceed as exclusive holders.
    Fix: Serialize lock-directory mutation with a guard/advisory lock, or atomically revalidate the specific lock identity immediately before unlinking and planting the replacement.


Reviewers: 2 done | Synthesis: codex, 11s | Total: 14m29s

@wesm wesm force-pushed the backup-restore-design branch from 7dc29e4 to 620a2e5 Compare July 4, 2026 13:53
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (620a2e5)

High-level verdict: changes need fixes before merge due to attachment backup correctness and snapshot consistency risks.

High

  • internal/backup/attachments.go:252
    Attachment capture hardcodes <hash[:2]>/<hash> instead of using attachments.storage_path. Existing SyncTech SMS MMS imports store files under synctech-sms/<hash[:2]>/<hash>, so backups for those archives will fail to read valid local attachments.
    Fix: Include storage_path/thumbnail_path in the frozen attachment refs and read from the recorded path, validating the hash from the bytes read.

Medium

  • internal/backup/frozen.go:62
    The daemon operation gate is released immediately after pinning the SQLite snapshot, but attachment files are read much later. A concurrent gated purge such as remove-account can delete files still referenced by the pinned DB snapshot, causing the backup to fail or no longer represent the frozen point.
    Fix: Keep a backup file-read guard active through attachment capture, or make attachment deletion defer/skip while a backup session is active.

Reviewers: 2 done | Synthesis: codex, 8s | Total: 27m42s

@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (e78858e)

High-level verdict: the PR has one High-risk restore correctness issue and three Medium issues around backup consistency, WAL pressure, and extras path safety.

High

  • internal/backup/restore.go:151
    --overwrite accepts a non-empty target without clearing SQLite sidecars, so an existing msgvault.db-wal or msgvault.db-shm can remain next to the restored msgvault.db. The restore proof opens the DB with immutable=1, which ignores those sidecars, so restore can report success for a target that a normal msgvault/SQLite open may read differently.

    Fix: Remove msgvault.db-wal and msgvault.db-shm before proofing/reporting success, or restore into a clean temp directory and replace the target contents.

Medium

  • internal/backup/frozen.go:62
    The daemon operation gate is released immediately after pinning the SQLite snapshot, but attachment files are captured later from refs in that snapshot. A concurrent purge/remove-account can delete a file that is still referenced by the frozen DB image before CaptureAttachments reads it, causing backups of otherwise valid snapshots to fail.

    Fix: Keep a daemon-side guard that prevents attachment deletion until capture finishes, or make purge paths defer/remain blocked while a backup is using frozen attachment refs.

  • internal/backup/create.go:84
    The pinned SQLite read transaction is deferred until the end of Create, so it stays open through attachment capture, extras capture, pack sealing, index writing, and manifest writing. During long backups this prevents WAL checkpointing and can let the live WAL grow until the whole backup completes.

    Fix: Close the frozen session as soon as ScanPages and storePageBlobs finish reading the database bytes; use a separate guard for attachment deletion if needed.

  • internal/backup/restore.go:536
    restoreExtrasEntry allows any local relative extras path, and Restore writes extras after restoring the verified msgvault.db and attachments/ tree. An attacker who can tamper with an off-site/synced backup repository can add a valid manifest/index/pack whose extras tree names msgvault.db or attachments/<hash[:2]>/<hash>. Restore then overwrites previously verified files and can still report success, since the final DB proof only checks SQLite integrity/stats and attachments are not re-hashed after extras are written.

    Fix: Restrict extras restore paths to the capture-produced roots (deletions/, config.toml, tokens/, client_secret*.json) and explicitly reject msgvault.db, attachments/, duplicates, and overlapping paths. Alternatively, restore extras before core files and verify the final completed tree.


Reviewers: 2 done | Synthesis: codex, 15s | Total: 16m54s

@wesm wesm changed the title Add backup repositories: incremental, verifiable snapshots (part 1 of #339) Add backup and restore: incremental snapshots with a self-proving restore (part 1 of #339) Jul 4, 2026
Squashed branch commits:
- fix: verify progress bar tracks blobs, not snapshots
- feat: backup restore with layered self-proof
- fix: address roborev findings on delta safety, storage paths, and progress

Generated with Codex (GPT-5)
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
@wesm wesm force-pushed the backup-restore-design branch from 17be3ed to f5473cc Compare July 4, 2026 16:44
Restore now places attachment blobs at database-recorded storage paths,
but version-1 readers (earlier builds of this branch) placed every blob
at the canonical <aa>/<hash> derivation. A snapshot whose database
records namespaced paths (synctech-sms style) would restore
"successfully" under such a reader while the database points at files
that do not exist.

Manifests now record format/min_reader version 2 when any attachment
row's storage or thumbnail path deviates from the canonical derivation,
so old readers refuse them explicitly via the existing LoadManifest
gate. The check inspects every row rather than the captured refs,
because MIN(storage_path) can select a canonical path while another row
records the same hash at a namespaced one. All-canonical snapshots keep
version 1 and remain readable by older code; SupportedReaderVersion
rises to 2.

Addresses roborev review 27340.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (060a73c)

Medium-risk issues found; no Critical or High findings reported.

Medium

  • internal/backup/attachments.go:168: Attachment capture defaults to one worker per CPU, and each worker later reads and retains an entire attachment file while hashing/compressing it. A library with several large videos can hold many whole files in memory at once and OOM during a default backup create.

    Fix: Add a byte-weighted memory limit or stream hashing/compression, and consider a much smaller default parallelism for attachment capture.

  • internal/backup/restore.go:594: restoreExtrasEntry only checks that an extras path is local before writing it under the restore target. A tampered backup repository can include extras such as attachments/<aa>/<hash> or msgvault.db and overwrite already-restored archive files, allowing restore to report success while materializing attacker-controlled content.

    Fix: Reject extras paths that overlap managed restore outputs, including msgvault.db, SQLite sidecars, and anything under attachments/, or restore extras before DB/attachments while still reserving those paths explicitly.

  • internal/backup/manifest.go:91: Snapshot IDs use only the first 4 bytes of the manifest SHA-256. A repository writer can feasibly brute-force a 32-bit target, replace a known <snapshot-id>.mvmanifest, and have list, verify, or restore accept the forged manifest under the same ID.

    Fix: Include a substantially longer digest in the snapshot ID, at least 128 bits, or add a signed/MACed manifest if repository tamper resistance is intended.


Reviewers: 2 done | Synthesis: codex, 12s | Total: 17m19s

Addresses roborev reviews 27340 and 27341 (five Medium findings):

- Attachment capture admits work through a byte gate (1 GiB budget) in
  addition to the worker-count window: a per-CPU pool holding whole
  files could OOM a default backup of a video-heavy library. Oversized
  files are admitted alone, and the gate unblocks on failure so an
  aborted capture cannot deadlock the dispatcher.
- restoreExtrasEntry rejects entries naming the restored database, its
  SQLite sidecars, or the attachments tree (case-insensitively, for
  macOS filesystems): a tampered extras tree could otherwise overwrite
  already-proven restore outputs after the proof ran on them.
- Snapshot IDs now embed 128 bits of the manifest SHA-256 instead of
  32: LoadManifest's recompute check is what rejects forged manifests,
  and a 32-bit suffix is brute-forceable. Full tamper resistance
  arrives with repository encryption; the longer digest closes the
  cheap forgery in the meantime. Pre-release snapshots with short IDs
  are no longer readable — recreate dev repositories.
- Full verify checks the materialized page map's page size against the
  manifest, mirroring the hash-map geometry check; the mismatch was
  previously only caught at restore time.
- writeRun's blob bounds check is subtraction-based: a decoded page-map
  run with a huge BlobOffset could wrap the addition-based comparison
  and panic on the slice instead of failing the restore.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (b8bbeda)

High-level verdict: Changes need attention before merge due to plaintext backup exposure risk.

High

  • internal/backup/create.go:133
    Create builds the pack appender with a nil crypter while repositories are initialized with encryption = "none" and encrypted repos are refused. This means the new backup path writes message database pages and attachments into readable backup packs, while the docs recommend syncing that repository off-site. Anyone with read access to the synced repository, such as a cloud-sync provider or compromised backup destination, could restore or parse the backup and recover archived messages and attachments.

    Suggested remediation: implement and require authenticated repository encryption before recommending untrusted/off-site sync, or clearly gate/document plaintext repositories as trusted-storage-only. At minimum, avoid describing plaintext repositories as safe for off-site sync without a confidentiality warning.


Reviewers: 2 done | Synthesis: codex, 7s | Total: 17m42s

wesm and others added 2 commits July 4, 2026 13:09
On a first backup every page is dirty, so create spends most of its wall
clock compressing and writing ~all page content into packs — a phase
that had no progress stage of its own (the scan bar sat frozen at its
last tick for minutes, looking hung) and ran zstd on a single core
(observed 85% CPU / 383 MB/s against an 8-core machine on a 66 GB
archive).

storePageBlobs now reports a dedicated "pack" stage (skipped entirely
when nothing is dirty) and fans read+hash+compress out to --jobs
workers with a strictly in-order collector, so pack contents and the
page-map delta are byte-identical to a serial run. --jobs 1 keeps
reads strictly serial for archives on spinning disks or NAS shares,
matching the attachment-capture contract.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The restore proof runs PRAGMA integrity_check, which reads the entire
restored database inside SQLite with no events to report — on a 66 GB
archive the proof bar sat at 0% for minutes looking hung. Stage lines
also gave no sense of how long each phase took.

Every progress line (TTY and plain) now carries the stage's elapsed
wall-clock time once it passes one second, and final lines record the
stage's total duration. An idle ticker re-renders an open TTY line
when no event has arrived for a throttle interval, so silent stages
visibly count up instead of freezing. The proof stage also reports a
sub-step: integrity_check done (1/2), stats comparison done (2/2).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (940d5fe)

No issues found.


Reviewers: 2 done | Synthesis: codex | Total: 20m3s

wesm added 3 commits July 4, 2026 13:44
Launching the TUI against an archive with no analytics cache (e.g. a
freshly restored backup) left the screen blank for minutes with only a
small spinner: in the default engine="auto" mode the daemon silently
falls back to live SQL, and the first aggregate query over a large
database can take a long time. Nothing told the user why.

Now the tui command asks the daemon for cache stats before launching
(best-effort, 5s timeout) and, when the cache is missing, prints a
notice recommending 'msgvault build-cache' and shows the same message
on the TUI info line while aggregate views load, in both email and
texts modes.
…sfires

Review findings from jobs 27346, 27350, and 27352:

- restoreExtrasEntry validated the raw extras path but wrote through
  filepath.Join, which cleans ".." away — a tampered tree entry like
  safe/../msgvault.db passed the reserved-name check yet landed on the
  restored database. Paths are now cleaned before validation.
- The progress renderer's throttle discarded events before recording
  them, so the idle ticker kept repainting stale counts (a quick proof
  1/2 inside the throttle window stayed rendered as 0/2 through the
  long stats comparison). lastEvent is now recorded before throttling.
- The TUI analytics-cache notice asked the daemon's /cli/cache-stats,
  which runs DuckDB scans over the whole cache when one exists (slowing
  every launch) and reports missing Parquet files for PostgreSQL
  backends where build-cache does not even apply. The notice now
  mirrors the daemon's own engine decision from shared local config at
  stat/glob cost and stays silent for postgres, engine="sql", and
  configured remotes.
…uide

The TUI guide claimed the daemon builds or updates the Parquet cache
"when aggregate views need it", but aggregate views never trigger a
build: the engine is chosen once at daemon startup, stale caches are
refreshed after scheduled syncs, and startup builds happen only under
engine="duckdb". This is exactly the surprise behind the blank first
view on a freshly restored archive, so document the actual triggers
and the new launch notice pointing at 'msgvault build-cache'.
@wesm wesm changed the title Add backup and restore: incremental snapshots with a self-proving restore (part 1 of #339) Add msgvault backup: init, create, list, verify, and restore commands (part 1 of #339) Jul 4, 2026
Review findings from jobs 27353 and 27354, plus the nix CI failure:

- The TUI cache notice inferred live-SQL fallback from local disk state,
  but the daemon chooses its engine once at startup: after a cache
  build the notice went silent while the running daemon stayed on live
  SQL, and the staleness probe opened the SQLite archive directly
  (checkpointing the WAL on close) on every launch. The daemon now
  records its startup choice (duckdb, sql, sql-fallback, postgres) and
  reports it as analytics_engine on /health; the notice keys off that
  authoritative O(1) signal, works for configured remotes, and tells
  the user a daemon restart is needed after build-cache.
- docs: auto_build_cache only controls the startup build under
  engine="duckdb" — scheduled-sync refreshes ignore it. Corrected the
  tui.md sample comment and the storage.md claim that the daemon
  builds the cache "when aggregate views need it".
- nix: update vendorHash for this branch's dependency changes
  (verified with a local nix build).
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (f848d0d)

Summary verdict: one medium-severity restore-safety issue remains.

Medium

  • internal/backup/verify.go:620 — Full backup verify only checks that page-map blob IDs exist and can be read; it does not validate each page-map run’s BlobOffset/length bounds or that the mapped page bytes match the materialized page-hash map. A self-consistent bad snapshot can verify cleanly but later fail restore in writeRun.
    • Fix: In full verify, validate page-map runs against the raw page blobs and hash map, reusing or extracting the restore-side bounds and page-hash checks without writing files.

Reviewers: 2 done | Synthesis: codex, 7s | Total: 21m27s

@wesm wesm changed the title Add msgvault backup: init, create, list, verify, and restore commands (part 1 of #339) Add msgvault backup: init, create, list, verify, and restore commands (#339) Jul 4, 2026
The Windows CI lane failed across internal/backup and internal/config:

- restoredDBDSN passed the OS path straight into a file: URI, so a
  Windows drive-letter path full of backslashes was percent-encoded
  and read by SQLite's URI parser as an authority ("invalid uri
  authority: C:%5CUsers..."), failing every restore proof. The path is
  now slash-converted and slash-rooted (file:///C:/dir/msgvault.db).
- Extras "not a regular file" errors printed the OS-separator relative
  path while tree entries and tests use the canonical slash form.
- TestRestoreTargetPathWithURISyntax used '?' in a directory name,
  which Windows filenames cannot contain; the Windows variant keeps
  '#' and space, and a new unit test pins the drive-letter DSN shape.
- TestLoadWithBackupConfig hardcoded "/mnt/...", which is not absolute
  on Windows and was resolved relative to the home directory; the
  fixture now substitutes a platform-absolute path.
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (8d96010)

Restore has one medium-severity path handling bug that should be fixed before merge.

Medium

  • internal/backup/restore.go:472: Relative restore targets are converted into root-relative SQLite URIs. For --target restore, restoreDBDSN("restore/msgvault.db") becomes file:///restore/msgvault.db?..., so the proof/attachment path query opens the wrong database and the restore fails.
    • Fix: Resolve non-drive-letter relative paths with filepath.Abs before building the file: URI, or make TargetDir absolute at the start of Restore; add a relative-target restore test.

Reviewers: 2 done | Synthesis: codex, 6s | Total: 16m4s

…s lane

Review 27358 (Medium): restoredDBDSN's Windows fix slash-rooted every
non-rooted path, so a relative --target like "restore-out" had its
database proven at /restore-out/msgvault.db instead of the materialized
file. The path is now made absolute before URI construction, with a
relative-target restore test and DSN unit tests (the drive-letter shape
check is Windows-only since filepath.Abs treats drive paths as relative
elsewhere).

Also the last Windows CI failure: the extras mode-preservation assert
expects the captured 0640 to round-trip, but Windows has no POSIX
permission bits and Stat reports 0666 for any writable file, so the
exact-mode check is now POSIX-only.
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (92e30d2)

Medium-risk issue found; no Critical or High findings.

Medium

  • cmd/msgvault/cmd/backup.go:364backup create uses urlFromDaemonRuntime(rt) for the daemon freeze API, but freeze handlers reject non-loopback RemoteAddr. If the daemon is bound to a specific non-loopback address like 192.168.x.x, the subprocess calls the freeze endpoint through that address and receives 404, preventing backups for an otherwise valid remote/NAS bind configuration.
    Suggested fix: Use a reachable loopback listener for freeze control, authenticate/relax the freeze endpoint for same-daemon subprocess calls, or reject and clearly handle specific non-loopback bind addresses for backup create.

Reviewers: 2 done | Synthesis: codex, 7s | Total: 22m33s

Combined review of 92e30d2 (Medium): backup create dials the daemon at
its configured bind address, but the freeze handlers required a
loopback RemoteAddr — so a daemon bound to a specific LAN IP (a NAS
bound to 192.168.x.x) answered its own freeze subprocess with 404 and
backups were impossible on an otherwise valid configuration.

The freeze gate now accepts same-host callers: loopback, or a source
address assigned to one of this machine's own interfaces, which is
exactly how a same-host connection to a LAN bind arrives. The check
still reads r.RemoteAddr (never forwarded-for headers) and API-key
auth is still required, including for the new same-host case.
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (68a3c4b)

Medium findings:

  • internal/backup/create.go:122 - Create accepts a context, but large scan, changed-page packing, and attachment capture paths do not observe cancellation. Canceling backup create during these phases can leave work running until completion or daemon termination, keeping the repo lock fresh until stale-lock timeout. Thread ctx through ScanPages, storePageBlobs, and CaptureAttachments; check cancellation before dispatch/read work and return ctx.Err() after draining workers so cleanup defers run.

  • internal/backup/repo.go:156 - CleanStaging trusts <repo>/staging as a real directory. If another principal can replace staging with a symlink, the next cleanup can remove entries in the symlink target as the msgvault user. Validate repo structural directories with os.Lstat before use and refuse or replace symlinks; for staging cleanup, remove and recreate the staging path itself or otherwise operate only on a verified real directory under the repo root.


Reviewers: 2 done | Synthesis: codex, 9s | Total: 25m7s

…aging

Combined review of 68a3c4b (two Mediums):

- Canceling backup create during the scan, changed-page packing, or
  attachment capture phases left the phase running to completion, and
  the lock heartbeat kept the repo locked meanwhile. ScanPages,
  storePageBlobs, and CaptureAttachments now take ctx: dispatchers
  stop on ctx.Done, collectors fail fast and drain their workers, and
  a post-drain ctx check prevents an early dispatcher exit from
  returning partial work as success.
- CleanStaging trusted <repo>/staging blindly: a "staging" symlink
  planted by another principal would have its target's entries
  RemoveAll'd as the msgvault user on the next cleanup. The path is
  now Lstat'd and anything that is not a real directory is refused.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants