Implementation · reflect plugin · v4.0.0 (shipped)

Reflect cost rearchitecture

Why this exists One background /reflect drain run burned 41.5M tokens in 9.6 min (~$713) for zero net-new learnings — it handed a 123K-token transcript to an Opus agent that roamed with Bash for 223 turns, the same transcript was reflected 16×, and the daily cap blew 20→61. A 30-day backfill showed reflect was burning ~1.2B tokens / 446 runs / ~$7k, almost all on Opus. This is the five-workstream fix that stops it.

Incident run

41.5M tok · ~$713

30-day reflect spend

~1.2B tok · ~$7k

Backlog re-gate

114 → 13 (−89%)

Version

3.6.0 → 4.0.0

Five workstreams

Sequenced so the bleeding stops first (W1–W3, low-risk additive guards), then capture behaviour (W4), then the structural rebuild (W5). All shipped; 82 in-scope tests green.

W1 · done

Circuit breaker

Hard caps (--max-turns 8, 180s wall, >2M-token poison), an atomic mkdir lock replacing the racy PID file, a debounce window, a REFLECT_DISABLED kill switch, and sonnet as the default model. The runaway is now structurally impossible.

reflect-drain-bg.sh6 tests50131d2

W2 · done

Skip-gate + dedup

A $0 regex gate over a transcript's dialogue: skips reflect-on-reflect / no-signal / clean sessions and anything already queued or processed. The incident transcript now skips at enqueue for $0.

reflect_gate.py+ both producers15 tests2c421d9

W3 · done

Cost observability

Full token envelope per run in the cost log, a reflect cost CLI (by day/transcript/model/outcome with the cached-vs-uncached split + outlier flags), and a backfill that reconstructs history. This is how the ~$7k/30d figure surfaced.

reflect_cost.pybackfill_costs.py8 tests73e0241

W4 · done

Cascade (gate + slice)

Slices the transcript to just signal-bearing windows (~10× smaller) before /reflect, which then runs on Sonnet under the W1 caps. A real 150K-token dialogue sliced to 15K. Reuses the existing write workflow — no KB-layer reinvention.

reflect_cascade.pydefault-on9 testscfe39cb

W5 · done

Structural rebuild

Surfacer retired (single consumer), graphml_repair.py self-heals the doubled-close-tag corruption that caused the rabbit hole, neutral cwd, a one-shot backlog re-gate (114→13), and a weekly Opus synthesis pass + launchd timers.

graphml_repair.pyregate_backlog.pyreflect_synthesis.py11 tests2eca356

Architecture — before & after

The old design ran every transcript through one fat Opus path. The new design puts a $0 gate in front that drops most work, then runs the survivors sliced and cheap on Sonnet under hard caps. The surfacer is gone; a launchd timer (not per-session spawns) drives the single drainer.

Red = the unbounded v3.6 path that caused the incident. Olive = the v4.0 gated/sliced path. Clay dashed = the $0 drop that eliminates ~89% of work before any model runs.

The failure mode

Two things compounded: the cache never amortised (so 223 turns each re-paid a ~180K context), and five independent guard-rails were missing. The cost lever was context × turns × cache-miss, not model price.

A · Inverted cache (per-turn, ×223)

cache_read (cheap reuse)	frozen 21,670
cache_creation (2× writes)	grew 59K → 199K
creation share of 41.5M	67%
read share	27%
avg context replayed / turn	176K (max 221K)

Only the static ~21.7K system head was ever reused; a volatile block above the transcript busted the cache for everything below it, so the 123K transcript was re-cached every turn at 2× rates.

B · Five missing guard-rails

No turn/token bound — only a 600s wall-clock; the agent did 223 turns inside it. → W1
No gate — reflect-on-reflect transcript processed in full for 0 net-new. → W2/W4
No dedup — same transcript reflected 16×. → W2
Cap race — check-then-write PID file let the 20/day cap hit 61. → W1
Wrong cwd — ran in research-tech while analysing a cochilli transcript. → W5

Key code

The two load-bearing pieces: the $0 gate verdict (W2) and the drain's hard caps + atomic lock (W1).

scripts/reflect_gate.py — verdict (W2)
def evaluate(path):
    if is_reflect_on_reflect(path):
        return GateVerdict("skip", "reflect-on-reflect")
    text    = extract_dialogue(path)   # user/asst text only,
                                       # NOT tool output noise
    signals = detect_signals(text)
    if not signals:
        return GateVerdict("skip", "no-signal")
    return GateVerdict("reflect", "has-signal", len(signals))

# dedup: already in queue OR terminal in cost log → skip

hooks/reflect-drain-bg.sh — caps + lock (W1)
# hard caps (was --max-turns 25, 600s, no token cap)
MAX_TURNS=8; ENTRY_TIMEOUT=180; TOKEN_MAX=2000000
DRAIN_MODEL=sonnet

# atomic mkdir lock (was racy check-then-write PID file)
acquire_lock() {
  if mkdir "$LOCK_DIR" 2>/dev/null; then return 0; fi
  # owner alive? defer : reclaim stale
}
# post-hoc: run > TOKEN_MAX → poison (never retried)

Risks & honest edges

Edge

Sev

Handling

Cascade extraction misses a real lesson (straight-to-default, no A/B).

MED

reflect cost flags zero-output runs; manual /reflect re-run retained; weekly synthesis is a backstop.

Live Sonnet extract quality, KB vector dedup, launchd timers, weekly Opus call — not unit-testable in a bare worktree.

MED

Deterministic cores are unit-tested (82 in-scope tests); LLM/launchd edges verify on deploy.

Cache-miss root cause (volatile block busting the prefix) is a hypothesis.

LOW

Evidence is the frozen-21,670-read / growing-creation signature across all 223 turns; the fix (slice + caps) holds regardless of exact trigger.

graphml truncate-repair could in theory drop valid trailing data.

LOW

Repairs only on parse-failure, backs up first, re-validates, and restores the original if the repair doesn't parse.

Deferred & open

Full sqlite-queue migration — deferred

W2 dedup (path + processed-set) and W4 signal-hash already deliver idempotency in practice. The sqlite rewrite is robustness polish at the highest migration risk — tracked as a focused follow-up; the JSONL queue + regate_backlog.py cover the immediate need.

Follow-up · own PR

Deploy-time steps

Run regate_backlog.py once to collapse the live 114-entry backlog; load the two launchd plists (drain 600s, synthesis weekly); verify Sonnet extract quality on the first few real runs via reflect cost.

On deploy

Tunables left at sane defaults

Weekly synthesis cadence (Sun 03:00), launchd drain interval (10 min), vector-dedup threshold (cos > 0.85). All overridable; revisit once cost telemetry accrues.

Tune later