Implementation · reflect plugin · v4.0.0 (shipped)

Reflect cost rearchitecture

Why this exists One background /reflect drain run burned 41.5M tokens in 9.6 min (~$713) for zero net-new learnings — it handed a 123K-token transcript to an Opus agent that roamed with Bash for 223 turns, the same transcript was reflected 16×, and the daily cap blew 20→61. A 30-day backfill showed reflect was burning ~1.2B tokens / 446 runs / ~$7k, almost all on Opus. This is the five-workstream fix that stops it.
Incident run
41.5M tok · ~$713
30-day reflect spend
~1.2B tok · ~$7k
Backlog re-gate
114 → 13 (−89%)
Version
3.6.0 → 4.0.0
01

Five workstreams

Sequenced so the bleeding stops first (W1–W3, low-risk additive guards), then capture behaviour (W4), then the structural rebuild (W5). All shipped; 82 in-scope tests green.

W1 · done

Circuit breaker

Hard caps (--max-turns 8, 180s wall, >2M-token poison), an atomic mkdir lock replacing the racy PID file, a debounce window, a REFLECT_DISABLED kill switch, and sonnet as the default model. The runaway is now structurally impossible.

reflect-drain-bg.sh6 tests50131d2
W2 · done

Skip-gate + dedup

A $0 regex gate over a transcript's dialogue: skips reflect-on-reflect / no-signal / clean sessions and anything already queued or processed. The incident transcript now skips at enqueue for $0.

reflect_gate.py+ both producers15 tests2c421d9
W3 · done

Cost observability

Full token envelope per run in the cost log, a reflect cost CLI (by day/transcript/model/outcome with the cached-vs-uncached split + outlier flags), and a backfill that reconstructs history. This is how the ~$7k/30d figure surfaced.

reflect_cost.pybackfill_costs.py8 tests73e0241
W4 · done

Cascade (gate + slice)

Slices the transcript to just signal-bearing windows (~10× smaller) before /reflect, which then runs on Sonnet under the W1 caps. A real 150K-token dialogue sliced to 15K. Reuses the existing write workflow — no KB-layer reinvention.

reflect_cascade.pydefault-on9 testscfe39cb
W5 · done

Structural rebuild

Surfacer retired (single consumer), graphml_repair.py self-heals the doubled-close-tag corruption that caused the rabbit hole, neutral cwd, a one-shot backlog re-gate (114→13), and a weekly Opus synthesis pass + launchd timers.

graphml_repair.pyregate_backlog.pyreflect_synthesis.py11 tests2eca356
02

Architecture — before & after

The old design ran every transcript through one fat Opus path. The new design puts a $0 gate in front that drops most work, then runs the survivors sliced and cheap on Sonnet under hard caps. The surfacer is gone; a launchd timer (not per-session spawns) drives the single drainer.

BEFORE — v3.6 · unbounded one fat path · every transcript → full Opus agent 2 producers PreCompact · Stop JSONL queue flat · no dedup 2 consumers surfacer + drainer UNBOUNDED Opus agent FULL 123K ctx · Bash · 223 turns reindex inline 41.5M tokens · ~$713 · 0 net-new AFTER — v4.0 · gated + capped $0 gate drops most work · survivors run sliced on Sonnet under hard caps 2 producers PreCompact · Stop $0 SKIP-GATE reflect-on-reflect / no-signal / dup → drop bg drainer sole consumer CASCADE slice ~10% → Sonnet caps 8t · 180s · 2M poison write + heal · reindex gate $0 drop (89% of backlog) skipped · $0 launchd 600s atomic lock · debounce drives (not per-session) surfacer retired · neutral cwd · weekly Opus synthesis (separate batch) diagram via /fireworks-tech-graph

Red = the unbounded v3.6 path that caused the incident. Olive = the v4.0 gated/sliced path. Clay dashed = the $0 drop that eliminates ~89% of work before any model runs.

03

The failure mode

Two things compounded: the cache never amortised (so 223 turns each re-paid a ~180K context), and five independent guard-rails were missing. The cost lever was context × turns × cache-miss, not model price.

A · Inverted cache (per-turn, ×223)
cache_read (cheap reuse)frozen 21,670
cache_creation (2× writes)grew 59K → 199K
creation share of 41.5M67%
read share27%
avg context replayed / turn176K (max 221K)

Only the static ~21.7K system head was ever reused; a volatile block above the transcript busted the cache for everything below it, so the 123K transcript was re-cached every turn at 2× rates.

B · Five missing guard-rails
  • No turn/token bound — only a 600s wall-clock; the agent did 223 turns inside it. → W1
  • No gate — reflect-on-reflect transcript processed in full for 0 net-new. → W2/W4
  • No dedup — same transcript reflected 16×. → W2
  • Cap race — check-then-write PID file let the 20/day cap hit 61. → W1
  • Wrong cwd — ran in research-tech while analysing a cochilli transcript. → W5
04

Key code

The two load-bearing pieces: the $0 gate verdict (W2) and the drain's hard caps + atomic lock (W1).

scripts/reflect_gate.py — verdict (W2)
def evaluate(path):
    if is_reflect_on_reflect(path):
        return GateVerdict("skip", "reflect-on-reflect")
    text    = extract_dialogue(path)   # user/asst text only,
                                       # NOT tool output noise
    signals = detect_signals(text)
    if not signals:
        return GateVerdict("skip", "no-signal")
    return GateVerdict("reflect", "has-signal", len(signals))

# dedup: already in queue OR terminal in cost log → skip
hooks/reflect-drain-bg.sh — caps + lock (W1)
# hard caps (was --max-turns 25, 600s, no token cap)
MAX_TURNS=8; ENTRY_TIMEOUT=180; TOKEN_MAX=2000000
DRAIN_MODEL=sonnet

# atomic mkdir lock (was racy check-then-write PID file)
acquire_lock() {
  if mkdir "$LOCK_DIR" 2>/dev/null; then return 0; fi
  # owner alive? defer : reclaim stale
}
# post-hoc: run > TOKEN_MAX → poison (never retried)
05

Risks & honest edges

Edge
Sev
Handling
Cascade extraction misses a real lesson (straight-to-default, no A/B).
MED
reflect cost flags zero-output runs; manual /reflect re-run retained; weekly synthesis is a backstop.
Live Sonnet extract quality, KB vector dedup, launchd timers, weekly Opus call — not unit-testable in a bare worktree.
MED
Deterministic cores are unit-tested (82 in-scope tests); LLM/launchd edges verify on deploy.
Cache-miss root cause (volatile block busting the prefix) is a hypothesis.
LOW
Evidence is the frozen-21,670-read / growing-creation signature across all 223 turns; the fix (slice + caps) holds regardless of exact trigger.
graphml truncate-repair could in theory drop valid trailing data.
LOW
Repairs only on parse-failure, backs up first, re-validates, and restores the original if the repair doesn't parse.
06

Deferred & open

Full sqlite-queue migration — deferred
W2 dedup (path + processed-set) and W4 signal-hash already deliver idempotency in practice. The sqlite rewrite is robustness polish at the highest migration risk — tracked as a focused follow-up; the JSONL queue + regate_backlog.py cover the immediate need.
Follow-up · own PR
Deploy-time steps
Run regate_backlog.py once to collapse the live 114-entry backlog; load the two launchd plists (drain 600s, synthesis weekly); verify Sonnet extract quality on the first few real runs via reflect cost.
On deploy
Tunables left at sane defaults
Weekly synthesis cadence (Sun 03:00), launchd drain interval (10 min), vector-dedup threshold (cos > 0.85). All overridable; revisit once cost telemetry accrues.
Tune later