RewardBench: the first evaluation tool for reward models.
-
Updated
Feb 16, 2026 - Python
RewardBench: the first evaluation tool for reward models.
Free and open source code of the https://tournesol.app platform. Meet the community on Discord https://discord.gg/WvcSG55Bf3
Explore concepts like Self-Correct, Self-Refine, Self-Improve, Self-Contradict, Self-Play, and Self-Knowledge, alongside o1-like reasoning elevation🍓 and hallucination alleviation🍄.
Multi-Agent DPO Data Synthesis Factory — 多智能体偏好训练数据自动合成框架 | 红队攻击 → 多persona审核 → 终审裁决 → DPO偏好对
A Survey of Direct Preference Optimization (DPO)
The MAGICAL benchmark suite for robust imitation learning (NeurIPS 2020)
This repository contains the source code for our paper: "NaviSTAR: Socially Aware Robot Navigation with Hybrid Spatio-Temporal Graph Transformer and Preference Learning". For more details, please refer to our project website at https://sites.google.com/view/san-navistar.
Python-based GUI to collect Feedback of Chemist in Molecules
Accelerate LLM preference tuning via prefix sharing with a single line of code
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
Code for the paper "Aligning LLM Agents by Learning Latent Preference from User Edits".
Official code for ICML 2024 paper, "RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences" (ICML 2024 Spotlight)
PyTorch implementations for Offline Preference-Based RL (PbRL) algorithms
[ICLR 2026] MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
[ICLR 2025 Spotlight] Weak-to-strong preference optimization: stealing reward from weak aligned model
Data and models for the paper "Configurable Safety Tuning of Language Models with Synthetic Preference Data"
Official PyTorch implementation of "LPOI: Listwise Preference Optimization for Vision Language Models" (ACL 2025 Main)
Code for "Monocular Depth Estimation via Listwise Ranking using the Plackett-Luce Model" as published at CVPR 2021.
Official implementation of "Learning User Preferences for Image Generation Models"
This repository contains the source code for our paper: "Feedback-efficient Active Preference Learning for Socially Aware Robot Navigation", accepted to IROS-2022. For more details, please refer to our project website at https://sites.google.com/view/san-fapl.
Add a description, image, and links to the preference-learning topic page so that developers can more easily learn about it.
To associate your repository with the preference-learning topic, visit your repo's landing page and select "manage topics."