Feature5:57 AM · ZRosserMcIntosh
add database health check + graceful fallback for connection pool errors
Update5:51 AM · ZRosserMcIntosh
connection pool exhaustion in new arrivals cache + add twitter CSP domains
Update5:47 AM · ZRosserMcIntosh
productview entrypage and exitpage schema type - string not boolean
Docs5:35 AM · ZRosserMcIntosh
update nivoda schema fix with final enum and structure corrections
Update5:33 AM · ZRosserMcIntosh
nivoda diamonditem schema - correct nested structure and enum handling
Docs5:20 AM · ZRosserMcIntosh
add nivoda graphql schema fix documentation
Update5:19 AM · ZRosserMcIntosh
nivoda diamond graphql schema mismatch - update field names and structure
Docs5:10 AM · ZRosserMcIntosh
add incident report for P1 infrastructure failure (May 11 05:00 UTC)
Update5:09 AM · ZRosserMcIntosh
brevo product URL mapping + connection pool cleanup
- Products were syncing with invalid URLs to Brevo
- Caused 110+ product sync failures with '400 Invalid product URL'
- Root cause: URL path mismatch (didn't exist in app)
- Fix: Use correct route /[locale]/products/[slug]
- Prevents database connection pool exhaustion
- Caused P1001 errors on /admin/support page load
- Fix: Call db.$disconnect() in finally block
5:05 AM · ZRosserMcIntosh
meetings admin page pagination + search (load 10 initial, add load more button)
Update4:56 AM · ZRosserMcIntosh
Standardize all cron auth patterns and add comprehensive health monitoring
- x-vercel-cron: 1 header (Vercel managed crons)
- ?secret=${CRON_SECRET} query param (manual triggers)
- x-cron-secret header (custom integrations)
- Added cron health tracking: monitors recent errors, status, schedule
- Enhanced database health: detects P2024 connection pool exhaustion
- Added service tracking: Brevo, LiveKit in addition to existing services
- Health endpoint now shows:
- Overall system status
- All external services status
- All cron jobs recent error count
- Database connection pool status
- Last check timestamps
- Track up to 5 recent minutes of cron errors
- Distinguish between temporary (recovering) and persistent (failing) issues
- Pool exhaustion detection (P2024 errors flagged)
- Per-cron error counts help identify problem crons
- Email scheduled 401 errors (auth logic)
- Openclaw reclaim 200 with error logs (auth + error handling)
- Email snooze 401 errors (auth logic)
- No visibility into cron failures (now tracked in health endpoint)
Update4:46 AM · ZRosserMcIntosh
Add Prisma connection pool retry middleware and support escalation cron hardening
- Add Prisma middleware to automatically retry queries on P2024 (connection pool timeout)
- Retry logic: exponential backoff 50ms, 100ms, 200ms (3 attempts)
- Handles Supabase pooler connection limit (10 connections) with 20s timeout
- Prevents cascading failures when pool is temporarily exhausted
- Product page P2024 timeouts (connection pool exhaustion)
- Support escalation 401 errors (auth logic)
- Support escalation 500 errors (database connectivity)
Update4:26 AM · ZRosserMcIntosh
Resolve cron 500 errors and database connectivity issues
- 204 No Content means empty body per HTTP spec
- Response.json() always adds body, violating the spec and causing 'Invalid response status code' error
- Must use: new Response(null, { status: 204 })
4:23 AM · ZRosserMcIntosh
Fix cron 500 errors, meeting grace period bug, and guest invite slowness
Update4:04 AM · ZRosserMcIntosh
critical cron routes throwing 500 errors instead of 204
- Added missing `maxDuration = 20` export
- Removed broken `requireKaturaAdmin()` call (was causing 401 → 500)
- Fixed error handler to return 204 instead of 500 (prevents retry cascade)
- Kept kill switch and Brevo timeout protections
- Status: Still running every minute as expected, now returns 204 on errors
- Added missing `maxDuration = 30` export (Vercel free tier: 60s limit)
- Added kill switch: FEATURE_CRON_CONSULTATION_REMINDERS (default: false)
- Added Brevo timeout wrapper (3s Promise.race)
- Fixed error handler to return 204 instead of cascading
- Per-tick cap already in place (50 reminders max)
- Fixed variable shadowing bug (livekitTimeout Promise shadowing counter)
- Renamed Promise to livekitTimeoutPromise for clarity
- Fixed error handler to return 204 instead of 500
- Have explicit `maxDuration` configured
- Return 204 on errors (prevents Vercel retry cascade)
- Are killed if they timeout (Promise.race wrappers)
- Have kill switches to disable safely (except zombie sweep, which is already hardened)
Docs3:39 AM · ZRosserMcIntosh
add comprehensive infrastructure hardening sprint summary
- Kill switch (FEATURE_* env, default: false)
- Hard timeout (Vercel maxDuration explicit)
- API/DB timeout (Promise.race, 2-3 seconds)
- Per-tick cap (LIMIT/take clauses)
- State logging (duration_ms + error context)
Update3:39 AM · ZRosserMcIntosh
harden /api/cron/meetings-zombie-sweep with kill switch + timeouts
- Raw SQL query has no timeout
- deleteRoom() call to LiveKit has no timeout
- If either is slow, entire cron hangs up to 20s
- Disabled by default
- Fast 204 return if disabled
- Limit 100 results per query
- Fail fast on slow database
- Each meeting update wrapped
- deleteRoom() fails fast instead of hanging
- Track timeout count separately
- All responses include duration_ms
- Timeout errors logged with context
- Tighter constraint
Update3:38 AM · ZRosserMcIntosh
harden /api/cron/blog-publish with kill switch + DB timeout
- Prisma queries (findMany, updateMany) have no timeout
- If DB is slow or connection pool exhausted, entire cron hangs
- Hits Vercel maxDuration: 20s limit every time
- Creates retry storm on cascading infrastructure
- Disabled by default
- Enable via env var
- Fast 204 return if disabled
- Both findMany and updateMany wrapped
- Fail fast instead of hanging
- Log timeout errors with context
- Limit work per invocation
- Prevents unbounded queries
- Monitor performance trends
- Detect slowdowns early
- Actionable error messages
- Timeout vs. DB error distinction
- Kill switch first ✓
- Hard timeout (Vercel maxDuration: 20) ✓
- DB timeout (2s Promise.race) ✓
- Per-tick cap (LIMIT 50) ✓
- State logging (start/duration/errors) ✓
Docs3:37 AM · ZRosserMcIntosh
update crash prevention framework with recent fixes
Update3:36 AM · ZRosserMcIntosh
harden /api/admin/email/scheduled with kill switch + timeout
- Cron runs every minute (1,440x/day) sending emails via Brevo
- Brevo API calls have no timeout
- If Brevo is slow, entire cron hangs for full 20-second maxDuration
- Results in 504 errors every minute, Vercel retry storms
- Cron disabled by default
- Enable deliberately via env var
- Fast 204 return if disabled
- All Brevo API calls wrapped
- Fail fast instead of hanging
- Logged timeout errors with context
- Max 10 emails processed per invocation
- Log start, duration, skip reasons
- Distinguish between timeout vs. API error
- Don't lose failure information
- Kill switch first ✓
- Hard timeout (implicit via Vercel maxDuration: 20) ✓
- External API timeout (3s Promise.race) ✓
- Per-tick cap (LIMIT 10) ✓
- State logging (start/duration/errors) ✓
Docs3:33 AM · ZRosserMcIntosh
document Next.js routing conflict root cause & fix
Update3:33 AM · ZRosserMcIntosh
resolve critical Next.js routing conflict in /admin/invoices
- [id] directory
- [orderId] directory
- Moved /api/admin/invoices/[orderId]/ to /api/admin/invoices/[id]/actions/
- Updated route.ts to use 'id' parameter (destructured as orderId internally)
- Updated documentation comments to reflect new URL pattern
Docs3:16 AM · ZRosserMcIntosh
add crash prevention framework overview
- Why each document exists
- How to use them (3 scenarios: new cron, investigation, cost reduction)
- The 5 blast-door rules (kill switch, schema drift, no minutely fan-out, fast-fail, state logging)
- Implementation timeline (4 phases over 2 months)
- The bottom line: optimize for crash-proofing, not cost savings
Docs3:15 AM · ZRosserMcIntosh
transform cron analysis into hardened policy with teeth
- Kill switch first (before Prisma/APIs)
- Hard timeout on all routes
- External API timeouts (Promise.race)
- Idempotent operations
- Per-tick caps (LIMIT/MAX)
- Claim/lease pattern before expensive work
- New tables require migration gates
- Default-disabled for new crons
- No minutely AI workflows
- Fast-fail on missing dependencies
- State transition logging
- Never be the first path to discover schema drift
- Level 0: Static/no dependency (5–10 min OK)
- Level 1: DB read-only (5–15 min OK)
- Level 2: DB writes/notifications (5–15 min OK)
- Level 3: External APIs (10–30 min, business-critical allows more)
- Level 4: AI/agents/automation (NEVER minutely without queue + cap + meter)
- Migration exists
- Applied to production
- Verified in prod
- Feature flag disabled until verified
- Handles missing table gracefully
- Rollback path documented
- The hard rule: 'A cron route must never be the first production code path to discover a missing table.'
- Marked as 'NEEDS IMMEDIATE REDESIGN' (not just frequency reduction)
- Architectural issue: admin routes shouldn't contain cron logic
- Action plan: rename to /api/cron/email-scheduled, add feature flag, per-tick cap, timeout, deduplication, claim pattern, retry counter
- Corrected 'Consider reducing to hourly if queue backed up' to 'Consider INCREASING to hourly only if backed up AND response < 10s'
- Added clarification: 'Increasing frequency means MORE load when backed up'
Docs2:59 AM · ZRosserMcIntosh
add root cause + cron analysis to postmortem
- /api/admin/email/scheduled: EVERY MINUTE (1,440x/day) — too aggressive
- /api/cron/meetings-zombie-sweep: */5 (288x/day) — LiveKit dependency
- /api/cron/calendar-reminders: */10 (144x/day) — calendar API dependency
- /api/admin/email/snooze: */5 (288x/day) — unnecessary frequency
- /api/cron/blog-publish: */5 (288x/day) — not user-facing
- /api/health?ping=true: */5 (288x/day) — overkill for health check
- Reduce email/scheduled from minutely to */5 or move to event-driven
- Reduce 5-minute jobs to */10 (except critical paths)
- Add 2–5 second timeouts to all crons to prevent hangs
- Add metrics + logging to detect issues early
heartbeats2:50 AM · ZRosserMcIntosh
completely delete yen-heartbeats cron
- Deleted src/app/api/cron/yen-heartbeats/route.ts
- Removed cron entry from vercel.json
- Updated postmortem with complete removal notes
- Queue-based dequeue model (one heartbeat per minute)
- Atomic state transitions (claim before work)
- Per-heartbeat timeout (10s max)
- Single-heartbeat cap (MAX_FIRES_PER_TICK = 1)
- Comprehensive logging + metrics
- Chaos test suite
Updateheartbeats2:31 AM · ZRosserMcIntosh
disable cron + postmortem
- Unbounded fan-out: no cap on heartbeats per tick
- Missing overlap guard: checked nonexistent updatedAt field
- State written too late: lastFiredAt updated after completion
- Each failure created duplicate invocation in next minute
- 1.5K invocations in 2 hours, 46s P75, 100% error rate
- Cascaded to notifications (79% error), admin pollers (99% error)
- Peak burn: ~/day run rate on Fluid Compute
- Redesign as dequeue (process one heartbeat per minute)
- Add application-level logging + metrics
- Write chaos tests (Claude API timeouts, etc.)
- Soft launch at 5-minute intervals
- Monitor 24h before re-enabling at full schedule
Updateyen-heartbeats2:26 AM · ZRosserMcIntosh
correct overlap guard + tighten maxDuration
- Two bugs in the previous patch:
- 1. Overlap guard used hb.updatedAt which does not exist on YenHeartbeat
- schema — the lease check was always falsy, meaning RUNNING state never
- actually blocked a concurrent cron tick. Fixed to use lastFiredAt (which
- IS written atomically before runAgent starts) as the lease timestamp.
- 2. maxDuration=15 was lower than DB_TIMEOUT_MS(8s) + AGENT_TIMEOUT_MS(10s).
- Raised to 20s with a comment explaining the budget arithmetic. This
- prevents Vercel from killing the function before our own timeouts fire,
- which would cause a TCP reset (NetworkError) instead of a clean JSON error.
- Control plane summary after both patches:
- maxDuration: 20s (was 60s)
- DB query timeout: 8s (fail-fast on Supabase slowness)
- Agent timeout: 10s (was 30s)
- Max heartbeats/tick: 1 (was unlimited)
- Claim written: BEFORE runAgent (atomic overlap guard)
- RUNNING lease: lastFiredAt-based (correct field)
Updateyen-heartbeats2:21 AM · ZRosserMcIntosh
add timeout + early-exit for 20x Fluid Compute spike
- 1.5K yen-heartbeat invocations × 46s CPU = massive Fluid Compute bill
- Each failure slowed down downstream pollers (notifications, etc.)
- Admin users saw 79% error rate on /api/admin/notifications
- Retry storms amplified the load
Update2:11 AM · ZRosserMcIntosh
production hardening from Opus session
- Polled intervals: reduce admin page load from 10 req/min to 4 req/min (header-clock-in, notification-bell, parked-transactions, meetings list)
- LiveKit timeout: 7-second race on ensureRoom() to prevent function hangs eating /day in Fluid Compute costs
- Meetings UX: same-tab navigation (remove dead about:blank), persistent error toasts, support modal fixes
- Notifications: graceful schema drift handling (500→200 empty list)
Updatecosts+meetings1:41 AM · ZRosserMcIntosh
slash Fluid Compute spend + LiveKit timeout + UX fixes
- header-clock-in: poll 10s → 60s. Clock-status hitting every 10s on every admin page was the biggest contributor — kept function instances perpetually warm under Fluid Compute billing.
- notification-bell: 30s → 60s (+ stops after 5 consecutive 5xx failures)
- parked-transactions: 30s → 60s
- meetings list: 30s → 60s Combined: drops persistent poll load from ~10 req/min/tab to ~4 req/min/tab.
- livekit.ts: Promise.race with 7-second timeout on ensureRoom(). Without this, a slow or misconfigured LiveKit server caused Vercel functions to hang for the full platform maxDuration (60s) before being killed with a TCP reset → browser shows NetworkError with no diagnostic info, AND each hung invocation ate ~60 GB-seconds of Fluid Compute memory.
- create + join routes: maxDuration=15 so any surviving hangs get a clean 504 JSON error instead of a TCP reset.
- Same-tab navigation (no more dead about:blank tab)
- Persistent error toasts (Infinity duration, stable id, stopPropagation)
- Support ticket modal: createPortal to body + re-asserts pointer-events (Radix Dialog sets pointer-events:none on body, was blocking the panel)
- 500 on schema drift now returns 200 + empty list so the bell renders