Skip to content

SII-MemQ/MemQ

Repository files navigation

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Note: This repository is also maintained at jwliao-ai/MemQ. If you find this work useful, please consider starring both repositories!

arXiv License: MIT Python 3.9+

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently—evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories.

MemQ applies TD(λ) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as (γλ)^d with DAG depth d, replacing temporal distance with structural proximity.

MemQ Framework Overview

Key Features

  • No Weight Updates: Frozen LLM backbone — all learning happens through Q-value updates on episodic memories
  • Provenance DAG: Tracks memory creation dependencies for multi-step credit assignment
  • TD(λ) Eligibility Traces: Propagates credit backward through memory chains, not just single-step updates
  • EC-MDP Formalization: Exogenous-Context MDP that factors state into an exogenous task stream and endogenous memory store
  • Q-Integrated Retrieval: Two-phase retrieval with locality filtering followed by Q-guided ε-greedy selection

Algorithm

MemQ Algorithm

Results

MemQ achieves the highest success rate on all six benchmarks in both generalization evaluation and runtime learning evolving.

Generalization Evaluation on Held-out Test Tasks

Evaluation Results

Runtime Learning Results

Runtime Learning Results

Installation

pip install -e .

Or with development dependencies:

pip install -e ".[dev]"

Quick Start

  1. Copy a template config and fill in your API credentials:
cp configs/llb_template.yaml configs/my_experiment.local.yaml
# Edit the file: set api_key, base_url, model, etc.
  1. Run an experiment:
# Lifelong Agent Bench (OS Interaction)
python run/run_llb.py --config configs/my_experiment.local.yaml

# GPQA (Graduate-level QA)
python run/run_gpqa.py --config configs/gpqa_template.yaml

# BFCL (Function Calling)
python run/run_bfcl.py --config configs/bfcl_template.yaml

# ERQA (Embodied Reasoning)
python run/run_erqa.py --config configs/erqa_template.yaml

# MMMU Pro (Multimodal)
python run/run_mmmu.py --config configs/mmmu_template.yaml

# LiveCodeBench
python run/run_lcb.py --config configs/lcb_template.yaml

Algorithm Variants

Configure via experiment.algorithm in YAML configs:

Algorithm Description
memq MemQ — TD(λ) eligibility traces + per-task Q updates over provenance DAG
memrl MemRL — Single-step Q-value updates per task (γ=0)
self_rag Self-RAG — Adaptive retrieval with self-reflection
rag RAG — Standard similarity-based retrieval
memp MemP — Memory-only proceduralization (no RL)

Architecture

memq/
├── agent/          # LLM agent (frozen backbone)
├── configs/        # Pydantic configuration models
├── providers/      # LLM & Embedding provider interfaces
├── service/        # Core memory service
│   ├── memory_service.py   # Build → Retrieve → Update loop
│   ├── memory_store.py     # In-memory vector store with Q-values
│   ├── value_driven.py     # Q-guided ε-greedy selection
│   ├── builders.py         # Proceduralization / trajectory → memory
│   ├── retrievers.py       # Locality filtering + scoring
│   └── updater.py          # Reflection-based update strategies
├── run/            # Benchmark runners
└── *_eval/         # Benchmark-specific adapters

Core Loop

  1. Retrieve: Locality filter (cosine similarity ≥ θ) → Q-guided ε-greedy top-k selection
  2. Act: Frozen LLM generates trajectory conditioned on retrieved memories
  3. Build: Compress trajectory into new episodic memory (proceduralization or reflection)
  4. Update: Compute TD error, BFS backward through provenance DAG, accumulate ΔQ with (γλ)^d decay

Key Hyperparameters

Parameter Description Typical Range
alpha Q-learning rate 0.1–0.5
gamma Discount factor (provenance horizon) 0.3–0.8
lambd (λ) Eligibility trace decay 0.5–0.9
k_retrieve Number of candidates within local consistency 3–10
topk Number of memories forming the context 3–10
weight_sim Similarity weight in hybrid scoring 0.3–1.0
weight_q Q-value weight in hybrid scoring 0.0–0.7
epsilon Exploration probability 0.0–0.1

Benchmarks

Benchmark Domain Task Type
Lifelong Agent Bench (LLAB) OS Interaction / Database Multi-step interactive
Berkeley Function Calling Leaderboard (BFCL) Function Calling Multi-turn API calls
Graduate-Level Google-Proof QA (GPQA) Science QA Expert-level reasoning
Embodied Reasoning QA (ERQA) Embodied Reasoning Visual + spatial
MMMU Pro Multimodal Understanding Image + text
LiveCodeBench (LCB) Code Generation Programming

Citation

@misc{liao2026memqintegratingqlearningselfevolving,
      title={MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs}, 
      author={Junwei Liao and Haoting Shi and Ruiwen Zhou and Jiaqian Wang and Shengtao Zhang and Wei Zhang and Weinan Zhang and Ying Wen and Zhiyu Li and Feiyu Xiong and Bo Tang and Muning Wen},
      year={2026},
      eprint={2605.08374},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.08374}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Code for paper: MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages