📢 News • 🧭 Introduction • 🏆 Leaderboard • 🚀 Getting Started
🛠️ Run • 📦 Outputs • 🙏 Acknowledgement • 📚 Citation
- [May 2026]
π-BENCHis available on arXiv: 2605.14678. - [May 2026] Project page is online: https://simplified-reasoning.github.io/Pi-Bench/
π-BENCH evaluates proactive personal assistant agents in long-horizon,
multi-session workflows. It contains 100 multi-turn tasks across 5
personas (researcher, marketer, pharmacist, law_trainee,
financier) in persistent workspaces where user requirements are often
underspecified and emerge over time.
The benchmark reports Proactivity (PROC) and Completeness (COMP). PROC
measures whether an agent discovers or infers hidden intents early; COMP
measures whether final artifacts satisfy checklist requirements. Unlike
short-horizon, GUI/mobile, or memory-only benchmarks, π-BENCH focuses on
persistent artifact workflows with hidden intents, inter-task dependencies, and
cross-session continuity.
Model results are maintained on the online leaderboard, with overall and per-persona PROC/COMP scores:
https://simplified-reasoning.github.io/Pi-Bench/#results
- Create and activate a Python environment:
conda create -n pi-bench python=3.11
conda activate pi-bench- Install local dependencies and prepare AppWorld data:
pip install -e .
pip install -e third_party/nanobot
bash scripts/setup_appworld.sh- Create a local environment file:
cp env.example.sh env.shEdit env.sh with your credentials, then run source env.sh. The default
model configs read MODEL_BASE_URL, MODEL_API_KEY, USER_BASE_URL,
USER_API_KEY, JUDGER_BASE_URL, JUDGER_API_KEY, and
BRAVE_SEARCH_API_KEY.
- Pull the benchmark Docker image:
docker pull zzzhr97/pi-bench:latest- Optional: edit
config/models/<model-id>.yamlfor model-specific names, endpoints, proxy settings, or timeouts. The filename stem is thepibenchmodel id; seeconfig/models/example.full.yamlfor the full schema.
Run from the repository root. Use --run 3 for
leaderboard-style reporting:
pibench --model-id deepseek-v3.2 --run 3Each repeat is written to a separate __runNN output directory. If a repeated
run is interrupted or fails, rerun only the missing/failed repeat with:
pibench --model-id deepseek-v3.2 --run 3 --rerun-failedCompleted repeats are reused and are not launched again.
Additional examples:
| Goal | Command |
|---|---|
| Single trial | pibench --model-id deepseek-v3.2 |
| Specific user | pibench --user-id law_trainee --model-id deepseek-v3.2 |
| Multiple models | pibench --model-id deepseek-v3.2,MiniMax-M2.5 |
| Multiple users and models | pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5 |
Main outputs:
outputs/<model-id>/<user-id>/
Container runtime logs:
outputs/<model-id>/<user-id>/run/<timestamp>-runtime/
Pi-Bench is built on AppWorld and NanoBot. We thank the contributors to these open-source projects.
@misc{zhang2026pibenchevaluatingproactivepersonal,
title={$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows},
author={Haoran Zhang and Luxin Xu and Zhilin Wang and Runquan Gui and Shunkai Zhang and Haodi Lei and Zihao He and Bingsu He and Chicheng Qin and Tong Zhu and Xiaoye Qu and Yang Yang and Yu Cheng and Yafu Li},
year={2026},
eprint={2605.14678},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.14678}
}