Agent-R1 is a unified, modular framework for Agentic Reinforcement Learning. It trains multi-step LLM agents through a step-native RL loop, where the model observes an environment, generates an action, receives tool or environment feedback, and continues until the task is solved or terminated.
Unlike single-turn RL pipelines that treat interaction as one growing prompt-response sequence, Agent-R1 models every turn as a step-level MDP transition. This makes tool use, environment state, context management, reward assignment, and policy optimization explicit parts of the same training substrate.
- [2026.05.29] Agent-R1 integrates StepPO, expands recipe coverage, and releases processed data. The framework now includes StepPO-style training support together with recipe integrations for HotpotQA, ALFWorld, WebShop, and academic paper search. Processed datasets are available on ModelScope.
- [2026.03.23] Agent-R1 v0.1.0 is the first official release of the refactored architecture. It introduces the Step-level MDP foundation and new Layered Abstractions. The previous implementation is archived on the
legacybranch. - [2026.03.04] Claw-R1 is released. It extends Agentic RL to general agents such as OpenClaw through a middleware-style design. See AgentR1/Claw-R1.
Earlier Updates
- [2026.01.10] PaperScout is released: an autonomous academic paper search agent trained with Agent-R1 and Proximal Sequence Policy Optimization. Read the paper here.
- [2025.11.18] The Agent-R1 technical report is released on arXiv.
- [2025.05.06] Tool environments are redesigned to support more flexible agent-tool interaction patterns.
- [2025.05.06] GRPO and REINFORCE training crashes caused by NaN values are fixed. See issue #30.
- [2025.04.01] Basic inference scripts and an interactive chat interface are added.
- [2025.03.18] Multi-modal support is added for vision-language model agents.
- [2025.03.18]
verlis moved to a git submodule and Agent-R1 extensions are separated from upstream code. - [2025.03.16] Process rewards are supported for per-tool-call feedback.
Modern LLM infrastructure already has strong serving systems such as vLLM and SGLang, and strong distributed training systems such as DeepSpeed, FSDP, and Megatron-LM. Agentic RL needs to reconnect these two sides into a rollout -> reward -> replay -> update loop where the model interacts with tools and environments over multiple turns.
Agent-R1 is built around three design goals:
- Step-level trajectory representation: each transition stores observation, action, environment feedback, reward, termination state, and next observation while preserving action boundaries and avoiding fragile
Token -> Text -> Tokenreconstruction. - Flexible context management: the environment decides what the model sees next, so history can be appended, truncated, summarized, rewritten, or augmented.
- Algorithm-system decoupling: task workflows, environments, rollout, rewards, advantage estimators, and policy objectives can evolve independently.
In multi-turn agent training, the model is not just continuing a token sequence. Each model output can invoke tools, change the environment state, receive external feedback, and shape the next observation. Agent-R1 therefore treats the agent step as the basic interaction unit: a step records what the model saw, what action it produced, what feedback and reward the environment returned, and what observation should be exposed next. This step-level trajectory representation keeps rollout, replay, context construction, and credit assignment aligned with real agent decisions, while still allowing token-level policy losses inside each generated action.
Agent-R1 uses layered abstractions so new tasks can reuse the same trainer without rewriting the full RL stack.
| Layer | Responsibility | When to Use |
|---|---|---|
AgentFlowBase |
Full control over prompt construction, model calls, branching, context management, and step assembly. | Complex custom agents that do not fit a standard environment loop. |
AgentEnvLoop |
Generic loop connecting model generation with an environment's reset() / step() interface. |
Agent tasks that can be modeled as environment interaction, including traditional RL-style environments. |
AgentEnv |
Task environment interface returning observations, rewards, termination, and metadata. | Implementing the full environment logic for AgentEnvLoop. |
ToolEnv |
Built-in environment for standard multi-turn tool calling. | Tool-augmented tasks where you only need to define tools. |
BaseTool |
Standard interface for registering executable tools. | Adding calculators, search tools, APIs, or task-specific checkers. |
The main loop is:
- Load a sample containing
prompt,agent_name,reward_model, and optionalenv_kwargs. - Create the configured
AgentFlowand environment. - Generate an action from the current observation.
- Parse the action, execute tools or update the environment, and return feedback.
- Record the step and continue until
done=Trueormax_stepsis reached. - Convert the structured trace into rewards, advantages, masks, and policy updates.
Agent-R1 uses the same environment setup as verl, and the current version requires verl==0.7.0. You only need to clone this repository; there is no separate Agent-R1 installation step.
The recommended path is:
- Read the Getting Started page for the minimal setup flow.
- Download the processed data release from ModelScope, then place or symlink each task's files to the paths expected by the corresponding recipe.
- Use
examples/gsm8k/run_steppo.shas a sanity check that the environment is wired correctly. - Move to the Agent Task Tutorial for the minimal GSM8K + Tool example based on
ToolEnv + BaseTool. - Read Recipes and Algorithms for the current task integrations and launch-script layout.
Download the processed data with either the ModelScope CLI or git:
pip install modelscope
modelscope download --dataset Melmaphother/Agent-R1-data --local_dir data/agent-r1-datagit lfs install
git clone https://www.modelscope.cn/datasets/Melmaphother/Agent-R1-data.git data/agent-r1-dataUse the processed GSM8K files from the data release, or regenerate a minimal GSM8K dataset locally, then run the single-step script:
python3 -m recipes.gsm8k.data_preprocess.process_gsm8k --local_save_dir ~/data/gsm8k
bash examples/gsm8k/run_steppo.shThis stage is only a setup check. It helps confirm that your environment, model path, dataset path, and training stack are wired correctly.
GSM8K + Tool is the simplest ToolEnv + BaseTool example. Use the processed GSM8K tool files from the data release, or regenerate the tool-augmented dataset locally, then launch the multi-step tool-calling script:
python3 -m recipes.gsm8k.data_preprocess.process_gsm8k_tool --local_save_dir ~/data/gsm8k_tool
bash examples/gsm8k/run_steppo_tool.shThis path uses the generic AgentEnvLoop with the built-in ToolEnv and recipe-local calc_gsm8k_reward tool. The plain GSM8K script remains a single-turn environment sanity check.
Core concepts:
The Agent-R1 report evaluates Qwen3-4B across representative agent scenarios. The table below summarizes the main results; see Experiments for the experimental setting, task coverage, optimizer comparison, and context-management analysis.
| Method | GSM8K Acc. (%) | HotpotQA Acc. (%) | ALFWorld SR Seen (%) | ALFWorld SR Unseen (%) | WebShop Score (%) | WebShop SR (%) |
|---|---|---|---|---|---|---|
| ReAct | 53.1 | 25.8 | 7.14 | 2.98 | 51.58 | 23.8 |
| GRPO | 83.3 | 59.4 | 81.29 | 74.58 | 65.83 | 44.2 |
| PPO | 78.1 | 56.7 | 76.42 | 72.38 | 70.18 | 46.0 |
| REINFORCE | 78.9 | 52.8 | 73.84 | 69.57 | 63.41 | 41.8 |
| RLOO | 81.6 | 55.2 | 79.08 | 73.46 | 68.02 | 45.1 |
For a new task, keep the trainer intact and implement the task-specific layers:
recipes/<task>/
base.yaml
data_preprocess/process_<task>.py
<task>_agent_flow.py
reward_fn.py
prompts.py
utils.py
env/ # optional environment service or wrappers
Typical migration checklist:
- Data: emit parquet rows with
prompt,reward_model,agent_name, andenv_kwargs. - Environment / tools: define how state updates, tool observations, rewards, and termination work.
- Agent flow: connect model actions to the environment loop and expose step records.
- Training script: set paths, rollout steps, batch sizes, estimator, and policy loss through Hydra overrides.
- Project homepage: https://agentr1.github.io/agent-r1
- Documentation: https://agentr1.github.io/agent-r1/docs/
maincontains the current v0.1.0 architecture based on Step-level MDP and layered abstractions.legacypreserves the previous implementation for reference.- Use a recent source checkout of
verlthat includes the AgentFlow / async rollout stack required by this repository.
- TableMind: an autonomous programmatic agent for tool-augmented table reasoning.
- PaperScout: an autonomous academic paper search agent trained with Agent-R1 and Proximal Sequence Policy Optimization.
- Cast-R1: an agentic framework that reformulates time-series forecasting as sequential decision making.
- StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning, a step-level Agentic RL method that treats the agent step as the action unit and aligns credit assignment with multi-turn agent decisions.
This work is conducted at the State Key Laboratory of Cognitive Intelligence, USTC. We gratefully acknowledge the ideas and infrastructure from DeepSeek-R1, veRL, and RAGEN. We also thank Prof. Qi Liu and Prof. Mingyue Cheng for their guidance and support.
If you find Agent-R1 useful in your research, please cite:
@misc{cheng2025agentr1trainingpowerfulllm,
title={Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning},
author={Mingyue Cheng and Jie Ouyang and Shuo Yu and Ruiran Yan and Yucong Luo and Zirui Liu and Daoyu Wang and Qi Liu and Enhong Chen},
year={2025},
eprint={2511.14460},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.14460}
}

