Options Paper · memory architecture decision

Recall/Reflect architecture vs Hindsight memory

Decision support for whether Lambda should bin the custom reflect/recall plugin and standardize on Vectorize Hindsight.

Date2026-06-08 AuthorMotoko 👁️ Scopearchitecture · features · token cost · self-improvement Sourced draft
Unbiased call

Do not bin the custom reflect/recall system wholesale yet. Adopt Hindsight as a shadow/hybrid memory substrate first. Keep the custom self-improvement loop, correction ledger, fleet learnings, and human-editable observability layer.

Best target state: Hindsight for temporal/entity/graph recall; custom Lambda reflect pipeline for policy, correction capture, skill promotion, and ops-grade auditability. If the pilot proves token cost and correction visibility, retire the custom retrieval backend. Not the self-improvement control plane.

01

Problem

What Stevie cares about

  • Self-improvement matters more than generic long-term memory.
  • Temporal memory matters: what happened when, what changed, what was learned after correction.
  • Observability/correction must be first-class, not buried in opaque embeddings.
  • Token usage matters. Memory that auto-injects 4k every turn can become a tax.

The real decision

This is not “custom memory vs packaged memory.” It is control-plane vs substrate.

Custom reflect/recall is a Lambda-specific learning control plane. Hindsight is a general memory engine with stronger retrieval, temporal modeling, entity graph, observations, and API/UI surface.

02

Evaluation criteria

Weights tuned to your stated priorities. Scores are 1–5. Weighted total is /5.0.

Self-improvement fitcorrections → patterns → skills → behavior
0.25
Temporal fidelitysession lineage, event time, learned time, recency
0.22
Observability & correctioninspect, edit, delete, replay, trace evidence
0.22
Token efficiencyretrieval budget, auto-injection behavior, LLM calls
0.20
Maintenance draglocal code, bugs, upgrades, integration work
0.08
Portabilitymulti-agent, multi-profile, cloud/local modes
0.03
03

Architecture: current recall/reflect

Current system is file-first and hook-driven. It treats memory as learning artifacts, not a general conversation database.

SessionStartproject/branch query recall.pyGraphRAG + QMD/BM25 RRF + rerankconfidence × recency × tags additionalContexttop 3, max 1500 chars PreCompact hookqueue transcript path pending_reflections.jsonlproducer/consumer handoff Next SessionStartagent runs /reflect learnings KBmarkdown + graph cache corrections.md patterns.jsonl/md discoveries.jsonl skills pipeline

Mechanics found in source

  • recall.py wraps ~/.learnings/cli/learnings for GraphRAG/vector search.
  • It fans out to QMD BM25 in parallel and fuses results via Reciprocal Rank Fusion.
  • It reranks by confidence, recency, and tag overlap, then renders compact markdown or JSON.
  • SessionStart hook injects top-3 learnings with a 1500 character cap.

Self-improvement path

  • PreCompact hook cannot run an LLM, so it queues transcript metadata to ~/.reflect/pending_reflections.jsonl.
  • Next SessionStart surfaces pending work and instructs the agent to run /reflect.
  • Corrections, discoveries, patterns, and skills remain file-backed and human-editable.
04

Architecture: Hindsight

Hindsight is a memory engine. It turns content into structured facts, entities, graph links, temporal metadata, observations, and reflect responses with citations.

retain()content + context LLM extractionfacts, entities, time Memory bankPostgres / Oracle background consolidationobservations + proof count + freshness recall()semantic · BM25 · graphtemporal · rerank token budgetmax_tokens + budget reflect()agentic loopmental models → obs → facts responsecitations · trace · usage Hermes provider plugin cloud / local embedded / external auto retain / auto recall tools / context / hybrid

Hindsight core

  • retain() extracts structured facts, entity resolution, graph links, causal links, event time and learned time.
  • recall() runs semantic, keyword, graph, and temporal search, fuses results, reranks, and stops at a token budget.
  • reflect() runs an agentic loop over mental models, observations, raw facts, and expand tools.

Hermes integration

  • Provider modes: cloud, local embedded, local external.
  • Memory modes: context, tools, hybrid.
  • Controls: auto_retain, auto_recall, retain_every_n_turns, retain_async, recall_max_tokens, recall_budget, tags, bank templates.
05

Options

The decision is not binary. Four realistic paths.

Option A

Keep custom only

Leave reflect/recall as the source of truth. No Hindsight adoption except maybe manual research.

Improve4
Temporal2
Observe5
Tokens4
Maint2
weighted3.76 / 5
Option B

Adopt Hindsight only

Replace custom reflect/recall and depend on Hindsight retain/recall/reflect plus its UI/API.

Improve3
Temporal5
Observe4
Tokens2
Maint4
weighted3.76 / 5
Option C

Hybrid substrate

Hindsight stores/retrieves temporal memory. Custom Lambda reflect remains the learning/control plane.

Improve5
Temporal5
Observe4
Tokens4
Maint3
weighted4.59 / 5
Option D

Shadow pilot first

Run Hindsight silently beside the current system. Compare recall quality, tokens, corrections, and latency before cutover.

Improve4
Temporal4
Observe5
Tokens5
Maint2
weighted4.42 / 5
06

Full feature comparison

FeatureCurrent reflect/recallHindsightWinnerNotes
Primary purposeFleet self-improvement and learning retrievalGeneral agent memory engineDependsDifferent jobs. Current system encodes Lambda process. Hindsight encodes memory mechanics.
Storage substrateMarkdown/YAML/JSONL, ~/.learnings, ~/.reflect, graph cache, QMD docsMemory banks on PostgreSQL/Oracle or embedded/local/cloudHindsightHindsight has a coherent database model. Custom is inspectable but fragmented.
Automatic write pathPreCompact queues transcripts; agent later runs /reflect; discoveries/corrections append explicitlyHermes provider can auto-retain every N turns; async retain; document/session metadataHindsightHindsight is stronger for ordinary memory ingestion. Current is safer for intentional learning capture.
Fact extractionLLM-driven /reflect creates learnings/patterns by instruction; no universal fact schemaLLM extracts structured facts, speaker perspective, entities, time, causal linksHindsightHindsight wins factual memory. Custom wins policy-specific reflection.
Entity resolutionMostly whatever GraphRAG/learnings index infers; not exposed as a first-class correction surfaceExplicit entity recognition/resolution and graph linksHindsightMajor Hindsight advantage.
Temporal modelRecency via archive timestamp half-life; session/queue timestamps; branch/project query contextTracks event time and learned time; temporal recall parses date windows and spreads across periodHindsightThis maps directly to your “temporal matters” requirement.
Search modesGraphRAG via learnings CLI + QMD BM25; RRF fusion; confidence/recency/tag rerankSemantic + keyword + graph + temporal; RRF fusion; cross-encoder rerank; boosts for recency/proof/timeHindsightCurrent is credible. Hindsight is deeper and productized.
Token budgetingHard character caps: SessionStart top 3, 1500 chars; explicit recall default 2000 charsmax_tokens budget; budget low/mid/high; recall default 4096 tokens in Hermes providerCustom by defaultHindsight has better knobs. Current defaults are cheaper. Misconfigured Hindsight can tax every turn.
Read-path LLM userecall retrieval itself no LLM; /reflect uses main agent when asked/queuedrecall() no LLM; reflect() uses LLM loop; retain/consolidation use LLMTieFor retrieval only, both can be cheap. Hindsight write path costs more.
Self-improvement workflowCorrections → discoveries → patterns → skills; fleet rules and promotion pipeline existObservations evolve with evidence; directives and mental models exist, but not Lambda correction/skill workflowCustomThis is the strongest reason not to bin custom.
Observation/consolidationPatterns are explicit, but promotion is agent/process-drivenAutomatic observations with proof count, freshness, evidence quotes, contradiction handlingHindsightHindsight’s observations are closer to a memory substrate; custom patterns are closer to operational policy.
Correction/deletion semanticsEdit files directly; corrections logged in markdown; archive queues manuallyDelete memory/document/bank; derived observations invalidated and re-consolidated; clear observations endpointHindsightCustom is more transparent; Hindsight is more internally consistent after deletion.
ObservabilityPlain files, JSONL logs, recall cache/log, forensics breadcrumbs; easy grep/diffControl plane/UI, API, traces/tool calls in reflect, usage metrics, Prometheus metricsTieDifferent observability. Custom is Unix-visible. Hindsight is product-visible.
Citations/evidenceLearning snippets and markdown sources; not always proof-countedReflect returns based_on, citations, trace, usage; observations carry source memories and quotesHindsightImportant if you want memory to defend itself.
Human editabilityExcellent. Edit markdown/JSONL. Git diffable.Good via API/UI, but database-backed and less “open a file and patch.”CustomFor emergency correction, files are hard to beat.
Multi-agent isolationDepends on paths/profile convention and fleet disciplineBank IDs, tags, bank templates: profile/workspace/platform/user/sessionHindsightHindsight provider code already supports dynamic bank ID templates and tags.
Operational maturityLocal custom scripts; fragile but ownedPublic repo, client SDKs, local/cloud modes, metrics; still new and dependency-heavyHindsightHindsight lowers maintenance, but creates vendor/project dependency.
Local/offlineWorks if local learnings/QMD/GraphRAG stack existsLocal embedded/external supported; can use local LLM/embeddings/rerankerTieBoth need dependency hygiene.
Failure modeSilent no-op by design; empty context if CLI/hook failsProvider logs failures; retain queue can drop on shutdown timeout; daemon/client failure modesDependsCurrent fails quiet. Hindsight fails richer but has more moving parts.
07

Token economics

Cheap configuration

Use Hindsight in tools-only or low-injection mode during pilot:

  • memory_mode=tools or auto_recall=false for normal turns.
  • When auto recall is enabled: recall_budget=low, recall_max_tokens=1024–2048, recall_max_input_chars=400–800.
  • retain_async=true, retain_every_n_turns > 1 for noisy channels.
  • Reserve hindsight_reflect for explicit synthesis, not every prompt.

Expensive configuration

Do not enable the naive “auto retain every turn + auto recall 4096 tokens + reflect prefetch” setup across Discord.

That creates two costs: write-side LLM extraction/consolidation, and read-side context inflation.

PathRetrieval LLM?Write LLM?Context injectedCost risk
Current SessionStart recallNoOnly later /reflect~1500 chars capLow
Current explicit /reflectMain agent reasoningLearning creationIntentionalMedium
Hindsight recall toolNoRetain/consolidation elsewhereCaller controls max_tokensLow-medium
Hindsight auto contextNoDepends auto_retainDefault 4096 tokens unless tunedHigh if untuned
Hindsight reflectYes, agentic loopMay depend on retained dataResponse + evidenceMedium-high
08

Recommendation

1stHybrid substrate4.59
2ndShadow pilot4.42
3rd tieCustom only3.76
3rd tieHindsight only3.76

What I would do

  1. Run Hindsight shadow for 2 weeks on Motoko only, tools mode first.
  2. Mirror selected session summaries/corrections into Hindsight with tags: agent:motoko, clan:lambda, type:correction, type:incident, type:pattern.
  3. Build a small comparison harness: same query → current recall vs Hindsight recall → judge relevance, tokens, latency, correction inspectability.
  4. If Hindsight wins ≥80% of temporal/entity queries and stays under token budget, replace custom retrieval backend.
  5. Keep custom correction logging, discovery gossip, fleet patterns, and skill promotion. Those are not Hindsight’s job.

Decision rule

Bin custom retrieval only after Hindsight proves:

  • p95 recall latency acceptable on local/cloud mode.
  • Average injected memory < 1.5k tokens in normal chat.
  • Every memory used in a recommendation is traceable to source/citation.
  • Correction workflow can delete/update source facts and verify observation invalidation.
  • Fleet self-improvement artifacts remain inspectable and git-diffable somewhere.
09

Pilot acceptance tests

TestPass conditionWhy
Temporal query“What changed after the heartbeat cutover?” returns ordered facts with event/learned time separation.Validates Stevie’s temporal priority.
Correction queryInject wrong fact, correct it, verify old observation is stale/refined or removed.Validates memory correction, not just recall.
Token budgetNormal Discord turns inject ≤1500 tokens p95; explicit research can request more.Avoids permanent memory tax.
Self-improvement loopA correction still lands in corrections/patterns/skill pipeline, with Hindsight as evidence substrate only.Preserves Lambda behavior.
ObservabilityOperator can inspect source memory, derived observation, proof count, trace, and usage within 60 seconds.2am test.
10

Source notes