Skip to content

Simplified-Reasoning/Pi-Bench

Repository files navigation

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

Pi-Bench Overview

arXiv Project Page GitHub Dataset HF Daily Paper

📢 News🧭 Introduction🏆 Leaderboard🚀 Getting Started

🛠️ Run📦 Outputs🙏 Acknowledgement📚 Citation


📢 News

🧭 Introduction

π-BENCH evaluates proactive personal assistant agents in long-horizon, multi-session workflows. It contains 100 multi-turn tasks across 5 personas (researcher, marketer, pharmacist, law_trainee, financier) in persistent workspaces where user requirements are often underspecified and emerge over time.

The benchmark reports Proactivity (PROC) and Completeness (COMP). PROC measures whether an agent discovers or infers hidden intents early; COMP measures whether final artifacts satisfy checklist requirements. Unlike short-horizon, GUI/mobile, or memory-only benchmarks, π-BENCH focuses on persistent artifact workflows with hidden intents, inter-task dependencies, and cross-session continuity.

🏆 Leaderboard

Model results are maintained on the online leaderboard, with overall and per-persona PROC/COMP scores:

https://simplified-reasoning.github.io/Pi-Bench/#results

🧰 Setup

  1. Create and activate a Python environment:
conda create -n pi-bench python=3.11
conda activate pi-bench
  1. Install local dependencies and prepare AppWorld data:
pip install -e .
pip install -e third_party/nanobot
bash scripts/setup_appworld.sh
  1. Create a local environment file:
cp env.example.sh env.sh

Edit env.sh with your credentials, then run source env.sh. The default model configs read MODEL_BASE_URL, MODEL_API_KEY, USER_BASE_URL, USER_API_KEY, JUDGER_BASE_URL, JUDGER_API_KEY, and BRAVE_SEARCH_API_KEY.

  1. Pull the benchmark Docker image:
docker pull zzzhr97/pi-bench:latest
  1. Optional: edit config/models/<model-id>.yaml for model-specific names, endpoints, proxy settings, or timeouts. The filename stem is the pibench model id; see config/models/example.full.yaml for the full schema.

▶️ Run

Run from the repository root. Use --run 3 for leaderboard-style reporting:

pibench --model-id deepseek-v3.2 --run 3

Each repeat is written to a separate __runNN output directory. If a repeated run is interrupted or fails, rerun only the missing/failed repeat with:

pibench --model-id deepseek-v3.2 --run 3 --rerun-failed

Completed repeats are reused and are not launched again.

Additional examples:

Goal Command
Single trial pibench --model-id deepseek-v3.2
Specific user pibench --user-id law_trainee --model-id deepseek-v3.2
Multiple models pibench --model-id deepseek-v3.2,MiniMax-M2.5
Multiple users and models pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5

📦 Outputs

Main outputs:

outputs/<model-id>/<user-id>/

Container runtime logs:

outputs/<model-id>/<user-id>/run/<timestamp>-runtime/

🙏 Acknowledgement

Pi-Bench is built on AppWorld and NanoBot. We thank the contributors to these open-source projects.

📚 Citation

@misc{zhang2026pibenchevaluatingproactivepersonal,
  title={$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows},
  author={Haoran Zhang and Luxin Xu and Zhilin Wang and Runquan Gui and Shunkai Zhang and Haodi Lei and Zihao He and Bingsu He and Chicheng Qin and Tong Zhu and Xiaoye Qu and Yang Yang and Yu Cheng and Yafu Li},
  year={2026},
  eprint={2605.14678},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.14678}
}

About

Benchmark for proactive personal assistant agents in long-horizon workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors