fix(branch): outbox JSON-parse race kills branch daemon (ops-vr39)#281
Conversation
The branch daemon's drainOutbox loop races with concurrent queueOutboxMessage
writers: writeFileSync creates the destination filename atomically but
the file content streams in non-atomically. readdir picks up the new
filename mid-write, JSON.parse throws SyntaxError on empty/partial content,
and the unhandled exception takes the whole daemon down.
Real example: tps-reed branch office on 2026-05-16T17:17Z. Reed-VM mail
watcher and a manual tps-mail-send fired ~2 minutes apart; one of them
caught the daemon mid-readdir. Daemon process exit, branch office offline
silently, two outbound messages stranded in outbox/new/ until manual
`tps branch start` restart drained them.
Fix:
1. queueOutboxMessage: atomic write via dot-prefixed tmp file in the
same directory, then rename. drainOutbox already filters .endsWith(".json"),
we extend that filter to also exclude .startsWith(".") so dot-tmp files
are invisible to the reader. POSIX rename(2) within the same filesystem
is atomic — the file appears in the new/ listing only once it's complete.
2. drainOutbox: defense-in-depth try/catch around JSON.parse. A bad file
(manual edit, mid-write before rename, prior crash) gets logged and
quarantined to sent/.malformed-<name>, the loop continues, the daemon
stays up. Pre-fix this was an uncaught throw — branch process exit.
Tests:
- New: dot-prefixed tmp files are ignored by drainOutbox.
- New: malformed JSON is quarantined, not thrown; remaining good files drain.
- New: queueOutboxMessage leaves no stray dot-tmp after success.
- Existing 2 outbox tests still pass.
Filed as ops-vr39 (P2). Branch daemon offline silently is the more
dangerous failure mode; this closes that class.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tps-kern
left a comment
There was a problem hiding this comment.
📐 Kern — Architecture Review: APPROVED
Solid fix for a real production issue (ops-vr39). The layered defense-in-depth approach is exactly right.
Architecture assessment:
-
Atomic write (write-to-tmp + rename): Correct. POSIX
rename(2)within the same filesystem is atomic — the destination filename appears fully-formed or not at all. This alone resolves the race. -
Dot-prefix filter: Correct second layer. Even if a tmp file somehow survives (crash between write and rename),
drainOutboxnever sees it. Also incidentally filters.DS_Storeand similar cruft. -
try/catch + quarantine: Correct third layer. A corrupt-but-published file (manual edit, prior crash detritus) gets logged, quarantined with a
.malformed-prefix for forensics, and the loop continues. The daemon stays up. This is the right failure mode. -
.map()→for...of: Required by thecontinuepattern. Clean refactor. -
Quarantine fallback (
rename → unlink): Good. Cross-filesystem edge case handled without losing data.
Tests: 4 new tests cover all three layers + tmp cleanup. Existing tests preserved. The malformed-JSON test directly reproduces the tps-reed 2026-05-16T17:17Z production crash and asserts it no longer throws.
Non-blocking observation: Orphaned .tmp files from a failed rename (e.g., ENOSPC between write and rename) have no periodic cleanup. They're invisible to the reader and small (~bytes), so not a practical concern, but worth noting for future operational hygiene.
No blocking issues. Merge when ready.
📐 Kern
tps-kern
left a comment
There was a problem hiding this comment.
Architecture review — PR #281
Scope assessed
Two-function fix in packages/cli/src/utils/outbox.ts (+30/−7) addressing an outbox JSON-parse race that killed the branch daemon on tps-reed (ops-vr39). Three new tests affirm the atomic-write + quarantine behavior.
Findings
Atomic write: ✅ Correct.
writeFileSync(tmp) → renameSync(tmp, final) within the same directory is atomic on POSIX. The reader sees either "file absent" or "complete file" — no partial-write window. The dot-prefix filter (!f.startsWith(".")) on the read side is the correct pairing: even if renameSync were somehow reordered by a future edit, or an orphaned .tmp file survives a crash, the reader will never touch it. The new test proves this explicitly by injecting a mid-write .pending.json.tmp and confirming the drain ignores it.
Defense-in-depth: ✅ Sound.
The try/catch around JSON.parse is the right second line. Three layers now protect the daemon: (1) atomic rename prevents partial reads, (2) dot-prefix filter ignores in-flight writes, (3) parse wrapper catches anything that sneaks through — manual edits, pre-fix crash detritus, filesystem corruption. The quarantine path (sent/.malformed-*) preserves forensics; the unlinkSync fallback prevents infinite retry loops if the rename itself fails. The console.error surfaces the corruption visibly, so it won't be silently swallowed.
Portability: 🟡 Flag only, not a blocker.
This codebase is Linux/macOS-only today. renameSync within the same directory is atomic on all modern POSIX filesystems. On Windows, same-directory renames are atomic on NTFS when the target doesn't exist (which is the case here — tmp and final have different names), but the guarantee is weaker. Anyone porting the daemon to Windows later would need to audit this path. The .tmp + rename idiom is the standard cross-platform pattern though, so this is acceptable without an abstraction layer today. Worth a comment noting the POSIX assumption if we want to be explicit.
Tests: ✅ Covers the failure modes.
queueOutboxMessage leaves no dot-prefixed tmp files— proves cleanupdrainOutbox ignores dot-prefixed in-flight tmp files— proves the race guarddrainOutbox quarantines malformed JSON without throwing— reproduces the production crash and asserts no-throw + quarantine
All five outbox tests pass. Anvil verified the 7 pre-existing failures elsewhere are unchanged with/without this patch.
Code quality: Clean.
The .map() → for...of refactor is necessary for the continue path and is clean. Nested try/catch for rename→unlink fallback is correct. The unlinkSync import addition is scoped.
Note
Verified: atomic-write via .tmp + renameSync within same dir. Dot-prefix filter on readdir is the correct reader-side pairing. Defense-in-depth try/catch + quarantine pattern preserves forensics (sent/.malformed-*) while keeping the daemon alive. No abstraction required for current Linux/macOS-only scope; portability note flagged above is non-blocking.
APPROVE
tps-sherlock
left a comment
There was a problem hiding this comment.
Security review — PR #281
Scope assessed
Two-function fix in packages/cli/src/utils/outbox.ts + 3 new tests.
Findings
1. Atomic-write pattern — PASS
Code correctly stages to a dot-prefixed tmp file and renames into place:
const tmp = join(dir, `.${filename}.tmp`);
writeFileSync(tmp, content, "utf-8");
renameSync(tmp, join(dir, filename));The rename(2) within the same directory is atomic on POSIX. The dot-prefix ensures concurrent drainOutbox readers never observe a half-written file.
2. Dot-prefix filter — PASS
The filter is f.endsWith(".json") && !f.startsWith("."), which filters only literal ASCII . at position 0. Legitimate filenames generated by queueOutboxMessage use ${timestamp}-${uuid}.json; the timestamp has colons/dots replaced with hyphens, so valid files will never start with a dot. No false-positive skip risk.
3. Quarantine one-way trip — PASS
Malformed files are renamed to sent/.malformed-${f}:
renameSync(src, join(sentDir, `.malformed-${f}`));drainOutbox only reads from newDir; sentDir is never re-scanned for processing. The .malformed-* files are a one-way forensic sink. No retry path exists.
4. try/catch scoping — PASS (with note)
The outer try/catch wraps JSON.parse(readFileSync(...)). The rename that moves good files to sent/ is outside this block:
renameSync(src, join(sentDir, f)); // will throw on EACCES
out.push(msg);Therefore EACCES (or other failure) on the good-file rename surfaces as an uncaught exception, as required. The quarantine rename has its own nested try/catch with an unlink fallback, which is appropriate defense-in-depth.
5. Path traversal in quarantine — PASS
f is derived from readdirSync(newDir), which returns basenames only (no path separators). Both join(newDir, f) and join(sentDir, \.malformed-${f}`)resolve under their respective trusted directories. The.malformed-` prefix is applied to a basename; no escape is possible.
Note
Verified: atomic-write pattern via .tmp + rename within same dir; dot-prefix filter on readdir; defense-in-depth try/catch; quarantine path .malformed-* in sent/. All threat-model questions answered satisfactorily.
APPROVE
Summary
Fixes a race in the TPS branch daemon where concurrent
tps mail send(writer) and outbox drain (reader) can crash the daemon withSyntaxError: JSON Parse error: Unexpected EOF. Surfaced live on tps-reed 2026-05-16T17:17Z during Reed Phase 2 branch-office dogfood — branch daemon process exit, two outbound messages stranded until manualtps branch startrestart.Root cause
queueOutboxMessageusedwriteFileSyncdirectly to~/.tps/outbox/new/<name>.json. The destination filename appears inreaddirthe moment the file is created — before content is fully written. The daemon'sdrainOutboxloop sees the partial file,JSON.parsethrows on incomplete content, and the unhandled exception takes the whole process down.The branch-office relay then sits silent. Mail queues up in
outbox/new/but never drains. The failure mode is dangerous because there's no visible signal — heartbeat stops, but agents don't notice immediately.Fix (two parts)
queueOutboxMessagewrites to a dot-prefixed.tmpfile in the same directory, thenrenameSyncinto the final filename. POSIXrename(2)within the same filesystem is atomic. The file appears in the reader'sreaddirlisting only once content is complete.drainOutboxfilters out dot-prefixed names (the in-flight writes), AND wrapsJSON.parsein try/catch. A bad file (manual edit, mid-write before rename, prior crash detritus) gets logged and quarantined tosent/.malformed-<name>for forensics — the loop continues, the daemon stays up. Defense in depth.Tests
drainOutbox ignores dot-prefixed in-flight tmp files (atomic-write race guard)— proves the writer's tmp file is invisible to the reader.drainOutbox quarantines malformed JSON without throwing— reproduces the production crash, asserts no throw + correct quarantine.queueOutboxMessage leaves no dot-prefixed tmp files after a successful write— ensures tmp is cleaned up via rename.bun teston packages/cli: 5/5 in outbox.test.ts pass. 7 pre-existing failures elsewhere (5 in mail.test.ts hit live $HOME via missing TPS_MAIL_DIR isolation; 1 in nono.test.ts is unrelated) are identical with/without this change — verified by stash + retest.Validation
Real-world trigger from yesterday's tps-reed Phase 2 bring-up:
Filed as ops-vr39.
Test plan
🤖 Generated with Claude Code