/reflect drain run burned 41.5M tokens in 9.6 min (~$713)
for zero net-new learnings — it handed a 123K-token transcript to an Opus agent that
roamed with Bash for 223 turns, the same transcript was reflected 16×, and the daily cap blew 20→61.
A 30-day backfill showed reflect was burning ~1.2B tokens / 446 runs / ~$7k, almost all on Opus. This is
the five-workstream fix that stops it.
Sequenced so the bleeding stops first (W1–W3, low-risk additive guards), then capture behaviour (W4), then the structural rebuild (W5). All shipped; 82 in-scope tests green.
Hard caps (--max-turns 8, 180s wall, >2M-token poison), an atomic mkdir lock replacing the racy PID file, a debounce window, a REFLECT_DISABLED kill switch, and sonnet as the default model. The runaway is now structurally impossible.
A $0 regex gate over a transcript's dialogue: skips reflect-on-reflect / no-signal / clean sessions and anything already queued or processed. The incident transcript now skips at enqueue for $0.
Full token envelope per run in the cost log, a reflect cost CLI (by day/transcript/model/outcome with the cached-vs-uncached split + outlier flags), and a backfill that reconstructs history. This is how the ~$7k/30d figure surfaced.
Slices the transcript to just signal-bearing windows (~10× smaller) before /reflect, which then runs on Sonnet under the W1 caps. A real 150K-token dialogue sliced to 15K. Reuses the existing write workflow — no KB-layer reinvention.
Surfacer retired (single consumer), graphml_repair.py self-heals the doubled-close-tag corruption that caused the rabbit hole, neutral cwd, a one-shot backlog re-gate (114→13), and a weekly Opus synthesis pass + launchd timers.
The old design ran every transcript through one fat Opus path. The new design puts a $0 gate in front that drops most work, then runs the survivors sliced and cheap on Sonnet under hard caps. The surfacer is gone; a launchd timer (not per-session spawns) drives the single drainer.
Red = the unbounded v3.6 path that caused the incident. Olive = the v4.0 gated/sliced path. Clay dashed = the $0 drop that eliminates ~89% of work before any model runs.
Two things compounded: the cache never amortised (so 223 turns each re-paid a ~180K context), and five independent guard-rails were missing. The cost lever was context × turns × cache-miss, not model price.
| cache_read (cheap reuse) | frozen 21,670 |
| cache_creation (2× writes) | grew 59K → 199K |
| creation share of 41.5M | 67% |
| read share | 27% |
| avg context replayed / turn | 176K (max 221K) |
Only the static ~21.7K system head was ever reused; a volatile block above the transcript busted the cache for everything below it, so the 123K transcript was re-cached every turn at 2× rates.
The two load-bearing pieces: the $0 gate verdict (W2) and the drain's hard caps + atomic lock (W1).
def evaluate(path): if is_reflect_on_reflect(path): return GateVerdict("skip", "reflect-on-reflect") text = extract_dialogue(path) # user/asst text only, # NOT tool output noise signals = detect_signals(text) if not signals: return GateVerdict("skip", "no-signal") return GateVerdict("reflect", "has-signal", len(signals)) # dedup: already in queue OR terminal in cost log → skip
# hard caps (was --max-turns 25, 600s, no token cap) MAX_TURNS=8; ENTRY_TIMEOUT=180; TOKEN_MAX=2000000 DRAIN_MODEL=sonnet # atomic mkdir lock (was racy check-then-write PID file) acquire_lock() { if mkdir "$LOCK_DIR" 2>/dev/null; then return 0; fi # owner alive? defer : reclaim stale } # post-hoc: run > TOKEN_MAX → poison (never retried)
reflect cost flags zero-output runs; manual /reflect re-run retained; weekly synthesis is a backstop.regate_backlog.py cover the immediate need.regate_backlog.py once to collapse the live 114-entry backlog; load the two launchd plists (drain 600s, synthesis weekly); verify Sonnet extract quality on the first few real runs via reflect cost.