π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

📢 News • 🧭 Introduction • 🏆 Leaderboard • 🚀 Getting Started

🛠️ Run • 📦 Outputs • 🙏 Acknowledgement • 📚 Citation

📢 News

[May 2026] π-BENCH is available on arXiv: 2605.14678.
[May 2026] Project page is online: https://simplified-reasoning.github.io/Pi-Bench/

🧭 Introduction

π-BENCH evaluates proactive personal assistant agents in long-horizon, multi-session workflows. It contains 100 multi-turn tasks across 5 personas (researcher, marketer, pharmacist, law_trainee, financier) in persistent workspaces where user requirements are often underspecified and emerge over time.

The benchmark reports Proactivity (PROC) and Completeness (COMP). PROC measures whether an agent discovers or infers hidden intents early; COMP measures whether final artifacts satisfy checklist requirements. Unlike short-horizon, GUI/mobile, or memory-only benchmarks, π-BENCH focuses on persistent artifact workflows with hidden intents, inter-task dependencies, and cross-session continuity.

🏆 Leaderboard

Model results are maintained on the online leaderboard, with overall and per-persona PROC/COMP scores:

https://simplified-reasoning.github.io/Pi-Bench/#results

🧰 Setup

Create and activate a Python environment:

conda create -n pi-bench python=3.11
conda activate pi-bench

Install local dependencies and prepare AppWorld data:

pip install -e .
pip install -e third_party/nanobot
bash scripts/setup_appworld.sh

Create a local environment file:

cp env.example.sh env.sh

Edit env.sh with your credentials, then run source env.sh. The default model configs read MODEL_BASE_URL, MODEL_API_KEY, USER_BASE_URL, USER_API_KEY, JUDGER_BASE_URL, JUDGER_API_KEY, and BRAVE_SEARCH_API_KEY.

Pull the benchmark Docker image:

docker pull zzzhr97/pi-bench:latest

Optional: edit config/models/<model-id>.yaml for model-specific names, endpoints, proxy settings, or timeouts. The filename stem is the pibench model id; see config/models/example.full.yaml for the full schema.

▶️ Run

Run from the repository root. Use --run 3 for leaderboard-style reporting:

pibench --model-id deepseek-v3.2 --run 3

Each repeat is written to a separate __runNN output directory. If a repeated run is interrupted or fails, rerun only the missing/failed repeat with:

pibench --model-id deepseek-v3.2 --run 3 --rerun-failed

Completed repeats are reused and are not launched again.

Additional examples:

Goal	Command
Single trial	`pibench --model-id deepseek-v3.2`
Specific user	`pibench --user-id law_trainee --model-id deepseek-v3.2`
Multiple models	`pibench --model-id deepseek-v3.2,MiniMax-M2.5`
Multiple users and models	`pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5`

📦 Outputs

Main outputs:

outputs/<model-id>/<user-id>/

Container runtime logs:

outputs/<model-id>/<user-id>/run/<timestamp>-runtime/

🙏 Acknowledgement

Pi-Bench is built on AppWorld and NanoBot. We thank the contributors to these open-source projects.

📚 Citation

@misc{zhang2026pibenchevaluatingproactivepersonal,
  title={$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows},
  author={Haoran Zhang and Luxin Xu and Zhilin Wang and Runquan Gui and Shunkai Zhang and Haodi Lei and Zihao He and Bingsu He and Chicheng Qin and Tong Zhu and Xiaoye Qu and Yang Yang and Yu Cheng and Yafu Li},
  year={2026},
  eprint={2605.14678},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.14678}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
assets		assets
config		config
data		data
page		page
scripts		scripts
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.example.sh		env.example.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

📢 News

🧭 Introduction

🏆 Leaderboard

🧰 Setup

▶️ Run

📦 Outputs

🙏 Acknowledgement

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

📢 News

🧭 Introduction

🏆 Leaderboard

🧰 Setup

▶️ Run

📦 Outputs

🙏 Acknowledgement

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages