Options Paper · memory architecture decision

Recall/Reflect architecture vs Hindsight memory

Decision support for whether Lambda should bin the custom reflect/recall plugin and standardize on Vectorize Hindsight.

Date2026-06-08 AuthorMotoko 👁️ Scopearchitecture · features · token cost · self-improvement Sourced draft

Unbiased call

Do not bin the custom reflect/recall system wholesale yet. Adopt Hindsight as a shadow/hybrid memory substrate first. Keep the custom self-improvement loop, correction ledger, fleet learnings, and human-editable observability layer.

Best target state: Hindsight for temporal/entity/graph recall; custom Lambda reflect pipeline for policy, correction capture, skill promotion, and ops-grade auditability. If the pilot proves token cost and correction visibility, retire the custom retrieval backend. Not the self-improvement control plane.

Problem

What Stevie cares about

Self-improvement matters more than generic long-term memory.
Temporal memory matters: what happened when, what changed, what was learned after correction.
Observability/correction must be first-class, not buried in opaque embeddings.
Token usage matters. Memory that auto-injects 4k every turn can become a tax.

The real decision

This is not “custom memory vs packaged memory.” It is control-plane vs substrate.

Custom reflect/recall is a Lambda-specific learning control plane. Hindsight is a general memory engine with stronger retrieval, temporal modeling, entity graph, observations, and API/UI surface.

Evaluation criteria

Weights tuned to your stated priorities. Scores are 1–5. Weighted total is /5.0.

Self-improvement fitcorrections → patterns → skills → behavior

0.25

Temporal fidelitysession lineage, event time, learned time, recency

0.22

Observability & correctioninspect, edit, delete, replay, trace evidence

0.22

Token efficiencyretrieval budget, auto-injection behavior, LLM calls

0.20

Maintenance draglocal code, bugs, upgrades, integration work

0.08

Portabilitymulti-agent, multi-profile, cloud/local modes

0.03

Architecture: current recall/reflect

Current system is file-first and hook-driven. It treats memory as learning artifacts, not a general conversation database.

Mechanics found in source

recall.py wraps ~/.learnings/cli/learnings for GraphRAG/vector search.
It fans out to QMD BM25 in parallel and fuses results via Reciprocal Rank Fusion.
It reranks by confidence, recency, and tag overlap, then renders compact markdown or JSON.
SessionStart hook injects top-3 learnings with a 1500 character cap.

Self-improvement path

PreCompact hook cannot run an LLM, so it queues transcript metadata to ~/.reflect/pending_reflections.jsonl.
Next SessionStart surfaces pending work and instructs the agent to run /reflect.
Corrections, discoveries, patterns, and skills remain file-backed and human-editable.

Architecture: Hindsight

Hindsight is a memory engine. It turns content into structured facts, entities, graph links, temporal metadata, observations, and reflect responses with citations.

Hindsight core

retain() extracts structured facts, entity resolution, graph links, causal links, event time and learned time.
recall() runs semantic, keyword, graph, and temporal search, fuses results, reranks, and stops at a token budget.
reflect() runs an agentic loop over mental models, observations, raw facts, and expand tools.

Hermes integration

Provider modes: cloud, local embedded, local external.
Memory modes: context, tools, hybrid.
Controls: auto_retain, auto_recall, retain_every_n_turns, retain_async, recall_max_tokens, recall_budget, tags, bank templates.

Options

The decision is not binary. Four realistic paths.

Option A

Keep custom only

Leave reflect/recall as the source of truth. No Hindsight adoption except maybe manual research.

Improve4

Temporal2

Observe5

Tokens4

Maint2

weighted3.76 / 5

Option B

Adopt Hindsight only

Replace custom reflect/recall and depend on Hindsight retain/recall/reflect plus its UI/API.

Improve3

Temporal5

Observe4

Tokens2

Maint4

weighted3.76 / 5

Option C

Hybrid substrate

Hindsight stores/retrieves temporal memory. Custom Lambda reflect remains the learning/control plane.

Improve5

Temporal5

Observe4

Tokens4

Maint3

weighted4.59 / 5

Option D

Shadow pilot first

Run Hindsight silently beside the current system. Compare recall quality, tokens, corrections, and latency before cutover.

Improve4

Temporal4

Observe5

Tokens5

Maint2

weighted4.42 / 5

Full feature comparison

Feature	Current reflect/recall	Hindsight	Winner	Notes
Primary purpose	Fleet self-improvement and learning retrieval	General agent memory engine	Depends	Different jobs. Current system encodes Lambda process. Hindsight encodes memory mechanics.
Storage substrate	Markdown/YAML/JSONL, `~/.learnings`, `~/.reflect`, graph cache, QMD docs	Memory banks on PostgreSQL/Oracle or embedded/local/cloud	Hindsight	Hindsight has a coherent database model. Custom is inspectable but fragmented.
Automatic write path	PreCompact queues transcripts; agent later runs /reflect; discoveries/corrections append explicitly	Hermes provider can auto-retain every N turns; async retain; document/session metadata	Hindsight	Hindsight is stronger for ordinary memory ingestion. Current is safer for intentional learning capture.
Fact extraction	LLM-driven /reflect creates learnings/patterns by instruction; no universal fact schema	LLM extracts structured facts, speaker perspective, entities, time, causal links	Hindsight	Hindsight wins factual memory. Custom wins policy-specific reflection.
Entity resolution	Mostly whatever GraphRAG/learnings index infers; not exposed as a first-class correction surface	Explicit entity recognition/resolution and graph links	Hindsight	Major Hindsight advantage.
Temporal model	Recency via archive timestamp half-life; session/queue timestamps; branch/project query context	Tracks event time and learned time; temporal recall parses date windows and spreads across period	Hindsight	This maps directly to your “temporal matters” requirement.
Search modes	GraphRAG via learnings CLI + QMD BM25; RRF fusion; confidence/recency/tag rerank	Semantic + keyword + graph + temporal; RRF fusion; cross-encoder rerank; boosts for recency/proof/time	Hindsight	Current is credible. Hindsight is deeper and productized.
Token budgeting	Hard character caps: SessionStart top 3, 1500 chars; explicit recall default 2000 chars	`max_tokens` budget; `budget` low/mid/high; recall default 4096 tokens in Hermes provider	Custom by default	Hindsight has better knobs. Current defaults are cheaper. Misconfigured Hindsight can tax every turn.
Read-path LLM use	recall retrieval itself no LLM; /reflect uses main agent when asked/queued	`recall()` no LLM; `reflect()` uses LLM loop; retain/consolidation use LLM	Tie	For retrieval only, both can be cheap. Hindsight write path costs more.
Self-improvement workflow	Corrections → discoveries → patterns → skills; fleet rules and promotion pipeline exist	Observations evolve with evidence; directives and mental models exist, but not Lambda correction/skill workflow	Custom	This is the strongest reason not to bin custom.
Observation/consolidation	Patterns are explicit, but promotion is agent/process-driven	Automatic observations with proof count, freshness, evidence quotes, contradiction handling	Hindsight	Hindsight’s observations are closer to a memory substrate; custom patterns are closer to operational policy.
Correction/deletion semantics	Edit files directly; corrections logged in markdown; archive queues manually	Delete memory/document/bank; derived observations invalidated and re-consolidated; clear observations endpoint	Hindsight	Custom is more transparent; Hindsight is more internally consistent after deletion.
Observability	Plain files, JSONL logs, recall cache/log, forensics breadcrumbs; easy grep/diff	Control plane/UI, API, traces/tool calls in reflect, usage metrics, Prometheus metrics	Tie	Different observability. Custom is Unix-visible. Hindsight is product-visible.
Citations/evidence	Learning snippets and markdown sources; not always proof-counted	Reflect returns based_on, citations, trace, usage; observations carry source memories and quotes	Hindsight	Important if you want memory to defend itself.
Human editability	Excellent. Edit markdown/JSONL. Git diffable.	Good via API/UI, but database-backed and less “open a file and patch.”	Custom	For emergency correction, files are hard to beat.
Multi-agent isolation	Depends on paths/profile convention and fleet discipline	Bank IDs, tags, bank templates: profile/workspace/platform/user/session	Hindsight	Hindsight provider code already supports dynamic bank ID templates and tags.
Operational maturity	Local custom scripts; fragile but owned	Public repo, client SDKs, local/cloud modes, metrics; still new and dependency-heavy	Hindsight	Hindsight lowers maintenance, but creates vendor/project dependency.
Local/offline	Works if local learnings/QMD/GraphRAG stack exists	Local embedded/external supported; can use local LLM/embeddings/reranker	Tie	Both need dependency hygiene.
Failure mode	Silent no-op by design; empty context if CLI/hook fails	Provider logs failures; retain queue can drop on shutdown timeout; daemon/client failure modes	Depends	Current fails quiet. Hindsight fails richer but has more moving parts.

Token economics

Cheap configuration

Use Hindsight in tools-only or low-injection mode during pilot:

memory_mode=tools or auto_recall=false for normal turns.
When auto recall is enabled: recall_budget=low, recall_max_tokens=1024–2048, recall_max_input_chars=400–800.
retain_async=true, retain_every_n_turns > 1 for noisy channels.
Reserve hindsight_reflect for explicit synthesis, not every prompt.

Expensive configuration

Do not enable the naive “auto retain every turn + auto recall 4096 tokens + reflect prefetch” setup across Discord.

That creates two costs: write-side LLM extraction/consolidation, and read-side context inflation.

Path	Retrieval LLM?	Write LLM?	Context injected	Cost risk
Current SessionStart recall	No	Only later /reflect	~1500 chars cap	Low
Current explicit /reflect	Main agent reasoning	Learning creation	Intentional	Medium
Hindsight recall tool	No	Retain/consolidation elsewhere	Caller controls max_tokens	Low-medium
Hindsight auto context	No	Depends auto_retain	Default 4096 tokens unless tuned	High if untuned
Hindsight reflect	Yes, agentic loop	May depend on retained data	Response + evidence	Medium-high

Recommendation

1stHybrid substrate4.59

2ndShadow pilot4.42

3rd tieCustom only3.76

3rd tieHindsight only3.76

What I would do

Run Hindsight shadow for 2 weeks on Motoko only, tools mode first.
Mirror selected session summaries/corrections into Hindsight with tags: agent:motoko, clan:lambda, type:correction, type:incident, type:pattern.
Build a small comparison harness: same query → current recall vs Hindsight recall → judge relevance, tokens, latency, correction inspectability.
If Hindsight wins ≥80% of temporal/entity queries and stays under token budget, replace custom retrieval backend.
Keep custom correction logging, discovery gossip, fleet patterns, and skill promotion. Those are not Hindsight’s job.

Decision rule

Bin custom retrieval only after Hindsight proves:

p95 recall latency acceptable on local/cloud mode.
Average injected memory < 1.5k tokens in normal chat.
Every memory used in a recommendation is traceable to source/citation.
Correction workflow can delete/update source facts and verify observation invalidation.
Fleet self-improvement artifacts remain inspectable and git-diffable somewhere.

Pilot acceptance tests

Test	Pass condition	Why
Temporal query	“What changed after the heartbeat cutover?” returns ordered facts with event/learned time separation.	Validates Stevie’s temporal priority.
Correction query	Inject wrong fact, correct it, verify old observation is stale/refined or removed.	Validates memory correction, not just recall.
Token budget	Normal Discord turns inject ≤1500 tokens p95; explicit research can request more.	Avoids permanent memory tax.
Self-improvement loop	A correction still lands in corrections/patterns/skill pipeline, with Hindsight as evidence substrate only.	Preserves Lambda behavior.
Observability	Operator can inspect source memory, derived observation, proof count, trace, and usage within 60 seconds.	2am test.

Source notes

Current recall: ~/.claude/skills/recall/scripts/recall.py — GraphRAG wrapper, QMD BM25 booster, RRF fusion, confidence/recency/tag rerank, cache and logs.
Current SessionStart recall: ~/.claude/skills/recall/hooks/session_start_recall.py — project/branch/recent-commit query, top-3 injection, 1500 char cap.
Current reflect queue: ~/.claude/skills/reflect/hooks/precompact_reflect.py and sessionstart_drain_reflections.py — transcript queue, next-session LLM processing.
Hermes Hindsight provider: d/git/hermes-agent/plugins/memory/hindsight/__init__.py — cloud/local modes, auto retain/recall, tags, bank templates, tools/context/hybrid mode, prefetch, retain queue, session switch handling.
Hindsight docs: retain.md, retrieval.md, observations.mdx, reflect.mdx, performance.md, configuration.md from vectorize-io/hindsight.