Ops Runbook for Always-On AI Viral Labs
Fun labs still need serious operations. This runbook explains how we keep credits, models, ads, and support humming even when traffic spikes or a provider blinks.
Tiered incident levels
We label incidents P1-P3. P1 means users cannot run labs (model outage, credit ledger corrupt). P2 means degraded performance (slow responses, partial credit spends). P3 covers cosmetic bugs (typo, share copy mismatch). Each tier has an owner and an SLA: P1 fixes within 30 minutes, P2 within 4 hours, P3 within the next deployment window.
Provider rotation plan
Our default stack calls OpenAI. If latency climbs or error rate exceeds 5%, the service toggles to DeepSeek. Gemini and Claude sit as tertiary options that we can manually activate from the admin console. Switching providers also adjusts prompt templates because each model thrives on different instruction lengths. The runbook includes ready-to-use prompt variations so we can flip within 2 minutes.
Credit ledger integrity
Credits live in SQLite now, with one row per client ID. Every API request that burns or grants credits writes an append-only event to a log table. Nightly jobs reconcile totals and raise alerts if differences exceed 0.5% between the ledger and event table. If corruption appears, we freeze new burns, rehydrate balances from the event log, and post a banner explaining the pause.
Ad and fallback behavior
Ads fund the lab, but they also break. When the ad server fails, we swap in a iral orb ?animation that keeps the space alive and displays a friendly CTA ( ant to sponsor? Email us ?. The runbook documents how to clear ad caches, rotate creative, and confirm that tracking pixels resume sending data once the incident clears.
Support macros and escalation
Support uses canned responses stored in the support hub. Each macro references the incident number, impact, and current ETA. If a player escalates via social DM, on-call ops responds in-thread and links to the macro; this keeps the narrative consistent. Chargeback threats or harassment escalate to founders immediately.
Observability stack
We monitor three dashboards: Credits, Compute, and Commerce. Credits shows issuance vs burn per hour, with alerts on anomalies. Compute tracks request latency per provider plus error breakdown by model. Commerce shows Stripe Checkout sessions, conversion, and refund volume. PagerDuty ties into these dashboards so the right person gets paged based on metric thresholds.
Chaos drills
Once a month we run a chaos drill: intentionally fail one dependency (e.g., block OpenAI) and verify that fallbacks kick in. We log the findings in the runbook and update automation scripts. Chaos drills also keep the team familiar with manual overrides, so nobody fumbles when a real outage hits.
Post-incident reviews
Every P1 or P2 triggers a 24-hour postmortem. The template includes timeline, blast radius, credits refunded, share copy updates, and new action items. We publish sanitized summaries on the internal blog so future teammates understand why certain guardrails exist.