Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion apps/memos-local-plugin/adapters/openclaw/tools.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import { Type, type Static } from "@sinclair/typebox";

import type { AgentKind, RuntimeNamespace, SkillId, TraceId } from "../../agent-contract/dto.js";
import type { MemoryCore } from "../../agent-contract/memory-core.js";
import { reflectionAsText } from "../../core/capture/types.js";

import { bridgeSessionId } from "./bridge.js";
import type {
Expand Down Expand Up @@ -242,7 +243,7 @@ export function registerOpenClawTools(api: OpenClawPluginApi, opts: ToolsOptions
episodeId: trace.episodeId,
ts: trace.ts,
value: trace.value,
reflection: clip(trace.reflection, bodyCap),
reflection: clip(reflectionAsText(trace.reflection) ?? undefined, bodyCap),
userText: clip(trace.userText, bodyCap),
toolCalls: trace.toolCalls.map((tc) => ({
name: tc.name,
Expand Down
200 changes: 102 additions & 98 deletions apps/memos-local-plugin/core/capture/ALGORITHMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,103 +26,102 @@ Edge cases:
The V7 spec keeps all sub-agent traces under the root episode so
`R_task` backprops correctly up the decision tree.

## V7 §3.2.2Reflection extraction
## V7 §3.2 — Windowed binary path-relevance scoring

Procedure `ExtractReflection(τ_t)`:
The original per-step reflection scorer (`reflection-extractor` →
`reflection-synth` → `alpha-scorer`) was removed in the 2026-05 redesign
(see [docs/superpowers/specs/2026-05-27-l1-batch-reflection-binary-design.md](../../docs/superpowers/specs/2026-05-27-l1-batch-reflection-binary-design.md)).
Reflection no longer produces free-form natural-language text. Instead, every
step gets a fixed-label path relevance judgement and an aligned numeric `α`:

```
if τ_t.meta.reflection is non-empty:
return τ_t.meta.reflection # adapter-native
elif regex_match(τ_t.agentText):
return cleaned_match(…) # inline reasoning
elif config.synthReflections:
return LLM(Synthesis, τ_t) # synthesized
else:
return ∅
α_t ∈ {0, 0.5, 1}
reflection_t ∈ { "PIVOTAL", "RELATED", "IRRELEVANT", "RELATED_DEFAULT" }
```

Implemented by `reflection-extractor.ts` (steps 1-2) +
`reflection-synth.ts` (step 3). Prompt for synthesis is minimal and
temperature=0.1 — we want a terse, agent-voiced explanation, never a
judgment.
with the semantics:
- `PIVOTAL` → `α_t = 1` —关键转折点。
- `RELATED` → `α_t = 0.5` —相关但非关键路径。
- `IRRELEVANT` → `α_t = 0` —无关/偏航路径。
- `RELATED_DEFAULT` → `α_t = 0.5` —missing-window 或 episode fallback 的安全默认值。

## V7 §3.2.3 — α scoring
### Window topology

V7 defines the "reflection utility" α via a four-axis rubric:
Windows are owned by `runEpisodeBatchScoring` in `capture.ts`. Two passes:

```
α_t = judge(state_t, action_t, outcome_t, reflection_t)
= weighted_mean(faithfulness, causal_insight,
transferability, concreteness)
usable_t = 1 iff α_t ≥ 0.4 AND non_tautological(reflection_t)
if usable_t = 0:
α_t ← 0 # equation 5: unusable reflections cannot skew backprop
```

The judge is `REFLECTION_SCORE_PROMPT` (see
`core/llm/prompts/reflection.ts`), which returns a JSON object. Our
implementation clamps α to [0, 1], applies the `usable` mask, and
guarantees finite values.
| Pass | `windowSize` | `overlap` | per-window retries |
|---------|--------------|-----------|--------------------|
| primary | 20 | 3 | 1 |
| degrade | 9 | 3 | 2 |

When `alphaScoring=false` OR the LLM fails:

```
α_t = 0.5 # neutral; Phase 7 backprop still runs, half-weighted
usable_t = 1
```
Stride is `windowSize − overlap` (17 for primary, 6 for degrade). The
last window of either pass is allowed to be shorter than `windowSize`.
`buildWindows(length, windowSize, overlap)` returns half-open `[start,
end)` pairs in ascending order.

This preserves the "graceful degradation" property V7 asks for: a local
setup without a paid LLM still accrues L1 traces with meaningful
priority once reward arrives.
### Merge rule

## V7 §3.2 batched variant — `batch-scorer.ts`
`mergeWindowScores` aggregates per-window results by absolute
`global_idx = win.start + i`. Per-step combination is:

The per-step path (`reflection-synth.ts` + `alpha-scorer.ts`) issues 2N
LLM calls per N-step episode. `batch-scorer.ts` collapses them into ONE:
Window overlap 合并按标签优先级(已替代旧的二值 merge 口径):

```
inputs = [{idx, state, action, outcome, reflection, synth_allowed}, …]
↓ BATCH_REFLECTION_PROMPT
outputs = {scores: [{idx, reflection_text, alpha, usable, reason}, …]}
PIVOTAL > RELATED / RELATED_DEFAULT > IRRELEVANT
```

Dispatch (in `capture.ts`):

| `cfg.batchMode` | `cfg.batchThreshold` | behavior |
|-------------------|----------------------|----------|
| `per_step` | (ignored) | legacy: 2N calls |
| `per_episode` | (ignored) | always batch |
| `auto` (default) | `12` | batch when `N ≤ 12`; else per-step |

The dispatcher also refuses to batch when no LLM is wired — same fallback
path as missing-LLM in per-step mode.

Why batched mode tends to produce **better** reflections (not just cheaper):
the prompt sees the full episode timeline including the final outcome, so
it can credit-attribute across steps. V7 §3.2.3's `causal_insight` and
`transferability` axes both benefit from the wider context. Per-step
synth, in contrast, can only rationalize from local `(s, a, o)`.

Failure handling:
Numeric `alpha` follows final label mapping:

- LLM throws / facade gives up after `malformedRetries=1` → capture
catches in `runBatchScoring`, surfaces a `{stage: "batch"}` warning,
and the per-step path runs as a fallback.
- Validator rejects on length mismatch, missing/non-numeric `alpha`,
non-boolean `usable`, non-string `reflection_text`. Same fallback.

Bookkeeping (`CaptureResult.llmCalls`):
```
PIVOTAL=1, RELATED=0.5, RELATED_DEFAULT=0.5, IRRELEVANT=0
```

- `batchedReflection`: 0 or 1 per episode (1 on a successful batch).
- `reflectionSynth` / `alphaScoring`: only nonzero when the per-step path
ran (either selected directly, or as fallback after a batch failure).
> 旧口径(`alpha=1` 覆盖 `alpha=0`,且 missing-window 默认 `alpha=1`)已废弃。

### Failure ladder

1. **Per-window** — up to `maxRetries+1` calls (1 attempt + retries).
A malformed payload from the LLM is one of: array length ≠ window
length, `relevance` outside {IRRELEVANT, RELATED, PIVOTAL}, or
missing `idx`. The validator in
`batch-scorer.ts :: validateBatchPayload` raises
`LLM_OUTPUT_MALFORMED` and the facade's own malformed-retry triggers
once before our outer retry kicks in. A missing/empty `reason` is
NOT malformed — the entry is kept and we emit a `batch.reason_missing`
warn instead, so a stray reason omission never costs the whole
episode its (relevance, alpha) signal.
2. **Window pass** — if every window in the primary pass eventually
succeeded, we accept its results. Otherwise we discard the partial
primary results and re-run with the degrade pass over the whole
episode.
3. **Episode-wide fallback** — if the degrade pass also has any failed
window, every step in the episode is overwritten with
`{ alpha: 0.5, text: "RELATED_DEFAULT", reason: "FALLBACK_RELATED_DEFAULT" }`
and we log `reflection_fallback_related_default` at error level with
`{ degraded: true, episodeId, stepsCount, failedWindows }`.
4. **No reflect LLM wired** — short-circuits straight to the
episode-wide fallback (`reason: "no_llm"`).

The downstream reward / L2 / Skill chain runs in every case; the
fallback is meant to keep the pipeline available, not to gate it.

### Bookkeeping (`CaptureResult.llmCalls`)

- `batchedReflection` — number of successful batch calls this episode.
One per window that actually returned a usable payload (so a long
episode can be >1, and the degrade pass can add more).
- `reflectionSynth` / `alphaScoring` — permanently `0`. Retained on the
`CaptureResult` interface for backward-compatible analytics consumers.

### Stable prompt fingerprint

Stable prompt fingerprint:
```
op = capture.reflection.batch.v<BATCH_REFLECTION_PROMPT.version>
```

- `op = capture.reflection.batch.v3` (see `BATCH_OP_TAG` constant; version
matches `BATCH_REFLECTION_PROMPT.version`).
Bumping `BATCH_REFLECTION_PROMPT.version` changes the op tag so audit
rows remain attributable.
Bumping `BATCH_REFLECTION_PROMPT.version` in
`core/llm/prompts/reflection.ts` rolls the op tag automatically so audit
rows stay attributable to a specific prompt revision.

## V7 §3.2.4 — Reward wiring

Expand All @@ -131,8 +130,8 @@ Capture does NOT compute `r_step` or `V_t`. It writes:
```
trace.value = 0 # V_t will be filled by Phase 7
trace.r_human = null # assigned on feedback (Phase 7 R_human path)
trace.alpha = α_t # from §3.2.3
trace.priority = 0 # recomputed after backprop
trace.alpha = α_t # {0, 0.5, 1} from relevance mapping
trace.priority = 0.5 # seeded so retrieval can find it pre-reward
```

Phase 7 updates these via `tracesRepo.updateScore` once the
Expand All @@ -148,7 +147,7 @@ priority(f¹_t) ∝ max(V_t, 0) · decay(Δt)
- `decay(Δt)` = half-life ≈ 30 days (Phase 7 constant)
- `V_t` = backpropagated value from the R_task + step rewards (Phase 7)

Capture initialises `priority=0`. The formula activates in
Capture initialises `priority=0.5`. The formula activates in
`core/reward/backprop.ts` (Phase 7).

## Text & vector conventions
Expand All @@ -171,26 +170,31 @@ marker. Rationale:
- Tail keeps "what the agent concluded with" — often the most useful
sentence for Tier 2 recall.
- Dropping the middle rarely hurts (that's usually thinking + tool
rationales that the reflection already summarises).
rationales that the windowed scorer already collapses into a binary
judgement).

Per-tool-call outputs use the same clamp with `maxToolOutputChars`.

## Concurrency

Reflection + α stages iterate per-step. We run them with
`config.capture.llmConcurrency` workers (default 4). The embedding stage
uses the embedder's own batching — one call for ALL steps.

Typical budget for a 10-step episode with alpha scoring on and an
external LLM: 10 α calls ÷ 4 workers ≈ 3 batches, plus one embed call.
Wall clock usually 3-10s on a mid-tier OpenAI-compat endpoint.

## Stable prompt fingerprints

Every LLM call carries:
- `op = capture.alpha.reflection.score.v1` (alpha scorer)
- `op = capture.reflection.synth` (reflection synth)

Bumping `REFLECTION_SCORE_PROMPT.version` in `core/llm/prompts/reflection.ts`
changes the op tag automatically, so historical α values remain
attributable to their scoring prompt generation.
The windowed scorer is sequential per episode (windows run in order,
not in parallel) because the merge rule benefits from short feedback
loops on failures — a failing primary pass is detected before the
degrade pass starts. Summariser and embedder stages still use
`config.capture.llmConcurrency` workers (default 4).

Typical budget for a 60-step episode with the primary pass succeeding:
`ceil((60 - 3) / 17) = 4` batch calls, plus one embed call. Wall clock
is dominated by the batch latency of the reflect model.

## Downstream consumers and the enum reflection field

`traces.reflection` is now one of `PIVOTAL | RELATED | IRRELEVANT |
RELATED_DEFAULT` (plus legacy free-form text from pre-2026-05 traces).
Downstream modules that previously fed the reflection string into LLM
prompts, error-signature heuristics, or keyword blobs use the
`reflectionAsText` helper exported from `core/capture/types.ts` to
filter the three fixed labels back to `null`. That keeps the L2
signature bucket, L2/L3 induction prompts, skill crystallisation /
verification, feedback evidence, and feedback-builder notes from
treating `RELATED_DEFAULT` as natural language.
Loading