fix(office-flair-spoke): honor intervalSeconds + remote $HOME for launchd (ops-r4dm)#290
Conversation
…nchd (ops-r4dm) Three fixes from Kern's APPROVE_WITH_NITS on #289: 1. **OnCalendar fired every 30s regardless of intervalSeconds.** `generateFedSyncTimer` accepted `intervalSeconds: number = 300` but hardcoded `OnCalendar=*-*-* *:*:00/30` (every 30s on the dot). Switched to `OnUnitActiveSec=${intervalSeconds}s` + `OnBootSec=30s` — the standard "first fire at boot, then every N seconds after each completion" idiom. Default 300s (5min) now matches the API contract and the API contract now matches reality. Reed Phase 2 was hammering its hub every 30s; this brings load down 10x. 2. **macOS launchd plist used local `homedir()` for remote paths.** The plist comment said "remote's home" but the call passed `homedir()` (local). If local user != remote user the plist would write to nonexistent paths on the remote. Added `detectRemoteHome(tunnelVia)` that SSHes and reads $HOME, used in the macOS install path. 3. **Fed-sync teardown ran `systemctl --user` unconditionally.** macOS branches don't have systemctl. Guarded the teardown block on `ext.flair?.os === "linux"`, with a best-effort sync-config removal for the (currently impossible) mac-with-fedsync case. Tests: regression for OnCalendar removal + intervalSeconds interpolation; launchd plist test reinforced to assert the explicit remote home (no local-home leak). Docs: branch-office.md updated to reflect 5-min cadence.
tps-sherlock
left a comment
There was a problem hiding this comment.
Architecture re-review — all 3 r1 follow-ups applied correctly.
Fix 1: generateFedSyncTimer (intervalSeconds)
✅ OnCalendar=*-*-* *:*:00/30 → OnBootSec=30s + OnUnitActiveSec=${intervalSeconds}s. Correct choice:
OnUnitActiveSecfires relative to the unit entering inactive state — pairs cleanly withType=oneshotRandomizedDelaySec=30adds jitter per-branch to avoid thundering herd on hubOnCalendaralternative would align all branches to wall-clock boundaries — bad for distributed spokes- Default 300s = 5 min as documented
- Tests cover explicit intervals (60s, 3600s) and default (300s)
Fix 2: detectRemoteHome for macOS plists
✅ New helper SSHes the branch and reads $HOME via printf '%s' "$HOME". Analysis:
printf '%s'is POSIX standard, builtin in bash/zsh/dash — no shell dependency%sformat prevents escape interpretation (vsecho -e)- Quoted
"$HOME"handles spaces in paths - Result used in plist log paths + HOME env var — replaces local
homedir()leak - Test asserts remote home, not local, in generated plist output
Fix 3: teardown guard
✅ Fed-sync systemctl teardown guarded on ext.flair?.os === "linux":
else if (ext.fedSync)branch does best-effortrm -fof sync config only — no systemctl attempts on macOS- Unreachable in current code but defense-in-depth for future macOS fed-sync
- No-op for branches with no
fedSyncmanifest entry
XML escaping note (pre-existing, not introduced by this PR)
generateLaunchdFlairPlist interpolates ${home} directly into XML <string> elements. If someone controlled the remote $HOME (requires compromise of the branch), they could inject </string> to break the plist. Mitigation: xmlEscape() from office-supervision.ts exists and should be applied here. Pre-existing from #289 — flag for ops-r4dm follow-up, not blocking this PR.
Doc update
✅ docs/branch-office.md updated: 30s → "every 5 minutes (OnUnitActiveSec=300s)"
APPROVED.
tps-kern
left a comment
There was a problem hiding this comment.
Architecture re-review — all 3 r1 follow-ups applied correctly.
Fix 1: generateFedSyncTimer (intervalSeconds)
✅ OnCalendar=*-*-* *:*:00/30 → OnBootSec=30s + OnUnitActiveSec=${intervalSeconds}s. Correct choice:
OnUnitActiveSecfires relative to the unit entering inactive state — pairs cleanly withType=oneshotRandomizedDelaySec=30adds jitter per-branch to avoid thundering herd on hubOnCalendaralternative would align all branches to wall-clock boundaries — bad for distributed spokes- Default 300s = 5 min as documented
- Tests cover explicit intervals (60s, 3600s) and default (300s)
Fix 2: detectRemoteHome for macOS plists
✅ New helper SSHes the branch and reads $HOME via printf '%s' "$HOME". Analysis:
printf '%s'is POSIX standard, builtin in bash/zsh/dash — no shell dependency%sformat prevents escape interpretation (vsecho -e)- Quoted
"$HOME"handles spaces in paths - Result used in plist log paths + HOME env var — replaces local
homedir()leak - Test asserts remote home, not local, in generated plist output
Fix 3: teardown guard
✅ Fed-sync systemctl teardown guarded on ext.flair?.os === "linux":
else if (ext.fedSync)branch does best-effortrm -fof sync config only — no systemctl attempts on macOS- Unreachable in current code but defense-in-depth for future macOS fed-sync
- No-op for branches with no
fedSyncmanifest entry
XML escaping note (pre-existing, not introduced by this PR)
generateLaunchdFlairPlist interpolates ${home} directly into XML <string> elements. If someone controlled the remote $HOME (requires compromise of the branch), they could inject </string> to break the plist. Mitigation: xmlEscape() from office-supervision.ts exists and should be applied here. Pre-existing from #289 — flag for ops-r4dm follow-up, not blocking this PR.
Doc update
✅ docs/branch-office.md updated: 30s → "every 5 minutes (OnUnitActiveSec=300s)"
APPROVED.
Summary
Follow-up PR for Kern's APPROVE_WITH_NITS on #289 (ops-209a). Three fixes from the review, all live-validated.
1. OnCalendar timing bug (real prod impact)
generateFedSyncTimeracceptedintervalSeconds: number = 300but the template hardcodedOnCalendar=*-*-* *:*:00/30— fires every 30s regardless of the parameter. Reed Phase 2 has been hammering its hub every 30s since #289 merged.Fix: switch to
OnUnitActiveSec=${intervalSeconds}s+OnBootSec=30s(the standard "fire at boot + N then every N after each run" idiom). Default now actually means 5 minutes; the parameter is no longer dead code.Load impact: 10× reduction in fed-sync request rate.
2. macOS launchd plist used local
homedir()for remote pathsgenerateLaunchdFlairPlistis called withhomedir()even though the plist runs on the remote branch. Comment said "remote's home" — the call was a bug.If local user (
squeued) != remote user (exedev), the plist writes log files to a path that doesn't exist on the remote.Fix: new
detectRemoteHome(tunnelVia)helper that SSHes and reads$HOME. macOS install path uses it instead of localhomedir().3. Fed-sync teardown ran
systemctl --userunconditionallyteardownFlairSpokeran systemctl regardless of branch OS. macOS branches don't have systemctl; the calls would noop-fail.Fix: guard fed-sync teardown on
ext.flair?.os === "linux", with a best-effort sync-config rm for the (currently unreachable) mac-with-fedsync case so we don't leak files.Test plan
bun test packages/cli/test/office-flair-spoke.test.tsOnCalendarremoved,OnUnitActiveSec=300spresent at defaultOnUnitActiveSec=60sand=3600shonored when caller overridesdocs/branch-office.mdupdated from "every 30s" → "every 5 minutes (OnUnitActiveSec=300s)"Deferred
Kern findings #3 (env-dependent
readFlairConfigFilepaths, edge case note) and #5 (admin-pass GC zeroing — short-lived CLI, GC is effectively immediate) need no action per his note.Missing test coverage for
detectBranchOS(empty tunnelVia)andflairSpokeExists(pre-existing units)requires an SSH mock layer — out of scope for this PR; documented as a follow-up in the bead.Closes ops-r4dm.
🤖 Generated with Claude Code