DreamPRM-Code is a coding-focused Process Reward Model that enables reliable test-time scaling for LLM coding. It resolves the two main blockers for coding PRMs: (1) missing PRM step definitions and (2) noisy intermediate PRM training labels. For 1, it leverages a chain-of-functions prompt to define PRM steps at the function level. For 2, it denoises Monte-Carlo sampled PRM training label with meta-learning guided by unit-test outcomes.
Inspired by Chain-of-Thought, DreamPRM-Code uses Chain-of-Function (CoF) prompt to steer the LLM toward producing independent code blocks whose logic can be isolated and encapsulated into separate functions.
Example of generated code under CoF prompting
(Step-1)
def main():
'''
Strategy: Use Dijkstra's algorithm to find the shortest path...
'''
# implementation
(Step-2)
def dijkstra(graph, start, end):
'''
Implements Dijkstra's algorithm with a min-heap priority queue...
'''
# implementation
(Step-3)
def build_graph(n, m):
'''
Build adjacency list from stdin input...
'''
# implementationNoisy MC-sampled PRM training labels are treated as learnable variables and refined via a meta-learning scheme that is anchored by clean final step rewards, producing more faithful intermediate supervision.
We use LiveCodeBench (post-2025-02) as the test set, OpenAI o4-mini-high as the base LLM model, and Qwen-2.5-Coder-3B as the PRM.
| Method | Easy | Medium | Hard | Overall |
|---|---|---|---|---|
| Gemini-2.5 | 100 | 82.1 | 52.5 | 72.5 |
| O3 | 100 | 71.8 | 57.4 | 71.8 |
| DeepSeek-R1 | 99.7 | 77.7 | 47.2 | 68.7 |
| O4-mini-high | 100 | 89.7 | 57.4 | 77.1 |
| ORM (o4-mini-high) | 100 | 89.7 | 62.3 | 79.4 |
| PRM-CoF (o4-mini-high) | 100 | 92.3 | 62.3 | 80.2 |
| DreamPRM-Code | 100 | 92.3 | 63.9 | 80.9 |
This section provides a minimal end-to-end guide for training and evaluating DreamPRM-Code on LiveCodeBench.
First, clone the repository and create the conda environment using the provided environment.yml file:
git clone https://github.com/ruz048/DreamPRM-Code.git
cd DreamPRM-Code
conda env create -f environment.yml
conda activate dreamprm-code
DreamPRM-Code relies on Chain-of-Functions (CoF)–structured code as PRM training data.
Use the following script to generate CoF-style code solutions from the base LLM:
bash gen_cof.sh
This step produces function-structured code that define PRM reasoning steps.
At this stage, the generated data does not contain reward labels.
To obtain initial supervision for PRM training, we perform Monte-Carlo (MC) sampling to assign noisy correctness labels to intermediate CoF steps:
bash gen_cof_label.sh
These labels serve as the starting point for training and will later be automatically refined by the meta-learning–based label correction framework.
To generate multiple LLM solutions for trained PRM to select from:
bash gen_sol.sh
It currently uses OpenAI o4-mini-high to generate solutions, which is the same as original LiveCodeBench settings.
With CoF data and MC-sampled labels prepared, you can start training DreamPRM-Code under the bi-level optimization framework:
bash run_train_eval.sh
This script:
- Trains the PRM on function-level steps
- Performs meta-learning–based label correction
- Automatically evaluates test-time scaling performance after training if multiple LLM solutions have been generated in previous step
If you already have a trained DreamPRM-Code checkpoint, you can directly run evaluation without retraining:
bash run_eval.sh
This evaluates the PRM under test-time scaling settings on the specified benchmark.
We provide our trained checkpoint of DreamPRM-Code here: [DreamPRM-Code-ckpt]. Using this checkpoint together with LLM generated solutions, you can directly reproduce our results following instructions in step 6️⃣.
This repository is under Apache License 2.0.
If you find this work useful, please cite:
@article{zhang2025dreamprmcode,
title = {DreamPRM-Code: Function-as-Step Process Reward Model for LLM Coding},
author = {Zhang, Ruiyi and Qin, Peijia and Cao, Qi and Xie, Pengtao},
journal = {arXiv preprint},
year = {2025}
}
For questions or collaborations, please contact ruz048@ucsd.edu
