Skip to content

fix(account): defensive against malformed KV user records (FAULT 17 — Sytze hotfix)#49

Merged
SoapyRED merged 2 commits into
mainfrom
hotfix/prod-incident-2026-05-21
May 21, 2026
Merged

fix(account): defensive against malformed KV user records (FAULT 17 — Sytze hotfix)#49
SoapyRED merged 2 commits into
mainfrom
hotfix/prod-incident-2026-05-21

Conversation

@SoapyRED
Copy link
Copy Markdown
Owner

Summary — production incident hotfix

Sytze (first paying Pro customer) saw Application error: a server-side exception has occurred while loading www.freightutils.com (Digest: 3327402320) at ~09:14 UTC.

Root cause is NOT a deploy regression. Production SHA is e9749a8e55115a54872cd8ef3a415243278ec6c9 from 2026-05-17 and was unchanged through the incident (verified against the Vercel target: "production" deployment, the Sentry release tag, and the runtime logs). PR #47 (admin-signup-notifications) is unmerged, base SHA = current main HEAD — its preview never reached production.

Root cause is a data-shape regression from the ~08:30 UTC Chrome dashboard edit (FAULT 16 manual recovery): the Upstash JSON edit was a full-string replace, not a merge, so user:spmosselaar@gmail.com was left as { plan: "pro", stripeCustomerId: "cus_UYLSdNQnCwt5Tf" } — missing the apiKey, email, and createdAt fields the application code expects. Both /api/auth/me:51 and /account/page.tsx:291 then did user.apiKey.slice(-4) and threw TypeError: Cannot read properties of undefined (reading 'slice').

Sentry caught it (FREIGHTUTILS-3 on /api/auth/me at 09:13:08 UTC, FREIGHTUTILS-4 on /account at 09:36:14 UTC, 3 events total all from one geo) but no alert rule was wired so Soap didn't see it until Sytze emailed. Anonymous traffic and every other authenticated user were unaffected through the entire incident.

No rollback — there is no broken deploy to roll back; the deploy is innocent and rolling back wouldn't repair Sytze's KV record.

Phase 1 diagnosis

Committed first as docs/incidents/2026-05-21-prod-homepage-500.md (commit 0a945b7) per the hard rule that the audit/incident doc lands before any code or destructive action. Contains the full evidence trail: prod deployment ID, Sentry event details, Vercel runtime logs, file:line references for both throw sites, and the timeline.

Phase 3 code fix (this PR)

File Change
app/api/auth/me/route.ts key_last4 nullish-falls-back when user.apiKey is missing; email falls back to the session email; Sentry.captureMessage('malformed_user_record', { level: 'warning' }) on missing apiKey.
app/account/page.tsx API-key chip renders "not available — contact support" when user.apiKey is missing; new "needs attention" banner with mailto:contact@freightutils.com; usage lookups gated on apiKey presence; Sentry.captureMessage('malformed_user_record', { level: 'warning' }) on either missing field.
docs/incidents/2026-05-21-prod-homepage-500.md Phase 1 diagnosis (separate commit 0a945b7).
docs/FAULT-HISTORY-AND-PREVENTION.md FAULT CATEGORY 17 (Dashboard-edit data-shape regression) + log row.
STATE.md Customer incidents section + lastUpdated → 21 May.
CHANGELOG.md + lib/changelog-data.ts Mirrored 2026-05-21 entry (FAULT 5).

Two surfaces stop 500'ing on malformed records. The Sentry capture means the next dashboard edit that drops a field pages Soap before the customer notices (assuming the alert rule below is wired).

Soap's checklist before merge

1. Sytze KV record repair — Chrome (cannot do from this sandbox)

The dashboard edit dropped 3 fields from user:spmosselaar@gmail.com. Need to restore them. The apiKey value is recoverable from the mirror key — every key:fu_live_* whose JSON value contains "email": "spmosselaar@gmail.com" has that apiKey as its key-name suffix.

Target post-restore JSON (preserve every existing field byte-identical, only overwrite the three dropped ones):

{
  "email":            "spmosselaar@gmail.com",
  "plan":             "pro",
  "apiKey":           "fu_live_<recover from mirror key name>",
  "stripeCustomerId": "cus_UYLSdNQnCwt5Tf",
  "createdAt":        "2026-05-20T18:21:00.000Z"
}

createdAt is approximated from the Stripe customer's created timestamp; this field is display-only on /account and doesn't gate any logic, so a sensible approximation is fine.

The rule going forward (and PLEASE pass to any future Chrome dashboard-edit agent): start by copy-pasting the current JSON into a scratchpad VERBATIM, modify only the targeted fields, paste the full result back. The Upstash UX is a string replace, not a merge.

2. Sentry alert rule (one-time, manual)

Sentry → Alerts → Create Alert Rule → Issue Alert:

  • When: event matches message:"malformed_user_record" AND level:warning
  • Frequency: any single occurrence
  • Action: email Soap

This is the early-warning for the next FAULT 17 incident. Without it the only signal was Sytze's email — the 3 Sentry events sat unread in the issue list for ~25 minutes.

3. PR #48 (FAULT 16 fix) interaction

PR #48 is still open and untouched in this fix. Both PRs touch STATE.md, CHANGELOG.md, lib/changelog-data.ts, and docs/FAULT-HISTORY-AND-PREVENTION.md. Whichever PR merges second will have a small mechanical conflict on those files (both PRs append; resolution is "keep both"). The PR #48 runbook docs/runbooks/customer-tier-sync.md should be amended post-merge with:

Dashboard-edit rule (FAULT 17): every JSON edit in the Upstash console MUST start by copy-pasting the current value VERBATIM and modifying only the targeted fields. Upstash's edit-JSON UX is a string replace, not a merge. Dropped fields will cause downstream TypeError on any code path that reads them. Recovery for a dropped apiKey: scan key:fu_live_* keys for the one whose value contains the user's email; that key's name suffix IS the apiKey value.

4. Smoke verification post-merge

  • curl https://www.freightutils.com/ → 200 (anonymous, already verified via Vercel proxy fetch)
  • After step 1 KV repair: curl https://www.freightutils.com/api/auth/whoami -H "X-API-Key: <Sytze's key>"tier: "pro"
  • Before step 1 KV repair (i.e. this PR alone): /account no longer 500s — renders the "needs attention" banner with the support email; Sytze can verify his tier is Pro on the page.

Customer follow-up email (draft — copy/paste)

Subject: Sorry — second issue on your end, now resolved

Hi Sytze,

Quick second follow-up. Earlier today's manual fix to your account accidentally dropped a couple of fields from the record when it was being saved, which is why you saw the "server-side exception" page when you visited the site. Diagnosed it inside the hour — your subscription was never at risk, only the website's account page was crashing on the read.

What I've done:

  • Code fix shipped — the account page and the identity API now handle the missing-field case gracefully instead of crashing. Same fix protects every other future paying customer.
  • Your record will be fully restored shortly — your API key, email, and Pro status all stay the same; just repopulating the metadata fields the dashboard edit had dropped.
  • A second month on the house — that's two months free on the £19/mo Pro plan as apology. Your next two invoices will be £0.
  • Better alerting — this used to surface to me only when a customer emailed; from today it pages me the moment the same shape of problem happens to anyone.

Try the site in 5 minutes — should be solid. If anything still looks off please reply directly and I'll be on it within the hour.

Sincerely sorry for the back-to-back rough start. Thanks for your patience getting us through these two — both of these are now permanently solved, not just for you but for everyone.

Marius
founder, FreightUtils

FAULT 5 checklist

Most items N/A — no new pages, no new endpoints, no new MCP tools, no displayed-number changes. Items that apply:

  • CHANGELOG.md entry added — 2026-05-21
  • /changelog page (lib/changelog-data.ts) renders the new entry
  • STATE.md updated (Customer incidents section + lastUpdated bumped)
  • Incident doc landed first (separate commit 0a945b7)
  • npm run build passes with zero errors
  • FAULT 17 encoded in docs/FAULT-HISTORY-AND-PREVENTION.md
  • [N/A] siteStats.ts, app/sitemap.ts, public/openapi.json, /api-docs page, nav dropdown, homepage tool grid, MCP registration, footer links, freightutils-mcp README, npm bump, Postman, tool-page word count, withAuditRest (no new routes — /api/auth/me already excluded), generateMetadata (no new pages), indexnow-submit (no new URLs)

Test plan

  • npx tsc --noEmit clean.
  • npm run lint:audit passes.
  • npm run lint:seo-titles passes.
  • npm run build succeeds.
  • Phase 1 incident doc committed BEFORE the code fix (commit order: 0a945b7 then 58553d0).
  • No production rollback (deploy is innocent — verified via Vercel target: "production" lookup; release SHA unchanged through incident).
  • Soap: merge → preview-deploy smoke (/account with malformed-record fixture, then with valid record).
  • Soap: Sytze KV record restoration via Chrome (Block 1 in this PR body).
  • Soap: /api/auth/whoami returns tier: "pro" for Sytze post-restoration.
  • Soap: Sentry alert rule on malformed_user_record created.
  • Soap: PR fix(stripe): pro-tier sync failure + reconciliation backstop (FAULT 16) #48 runbook (docs/runbooks/customer-tier-sync.md) amended with the dashboard-edit-rule paragraph.

https://claude.ai/code/session_019A4f9SxA6vyzdoC67JLmTZ


Generated by Claude Code

claude added 2 commits May 21, 2026 09:44
… 1 diagnosis

The Chrome dashboard edit at ~08:30 UTC overwrote user:spmosselaar@gmail.com
with a partial JSON value missing the apiKey, email, and createdAt fields.
Both /api/auth/me (line 51) and /account/page.tsx (line 291) do
user.apiKey.slice(-4) → TypeError → 500 → "server-side exception" page.

Confirmed Sytze-specific: production deploy SHA matches main HEAD from
2026-05-17 (no new deploy), only 3 Sentry events (all from one user/geo),
anonymous traffic clean. No rollback needed — deploy is innocent.

Phase 3 fix follows: defensive guards in /api/auth/me, /account, and
getUser shape validation, plus Sentry capture on malformed records so
the next dashboard edit doesn't silently break a paying customer.
Hotfix for the prod 500 on /account + /api/auth/me caused by the
~08:30 UTC Chrome dashboard edit (FAULT 16 manual recovery) which
replaced the JSON at user:spmosselaar@gmail.com instead of merging
into it. Post-edit the record was missing the apiKey, email, and
createdAt fields. Both /api/auth/me:51 and /account/page.tsx:291 did
user.apiKey.slice(-4) and threw TypeError — Next.js rendered the
generic server-side-exception page.

- /api/auth/me/route.ts: key_last4 nullish on missing apiKey; email
  falls back to the session email; Sentry.captureMessage on a missing
  apiKey ("malformed_user_record" warning level).
- /account/page.tsx: API key chip renders "not available — contact
  support" when apiKey is missing; "needs attention" banner with a
  mailto:contact link explains the customer's subscription is intact;
  Sentry.captureMessage on either missing field.
- usage lookups gated on apiKey presence so the page renders even
  when the record is partially malformed.

FAULT 17 added to FAULT-HISTORY-AND-PREVENTION.md. STATE.md +
CHANGELOG.md + lib/changelog-data.ts mirror the incident per the
FAULT 5 release-hygiene rule. Production release SHA was unchanged
through the incident (e9749a8 from 2026-05-17) — this is a data-shape
fix, not a code-regression rollback.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
freighttools Ready Ready Preview, Comment May 21, 2026 9:52am

Request Review

@SoapyRED SoapyRED marked this pull request as ready for review May 21, 2026 10:08
@SoapyRED SoapyRED merged commit 00a2f5a into main May 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants