Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Robometer: Dense, General-Purpose, Video-Language Reward Model

Specific Contributions

      Training Rewards on Different Expertise Data

       Robometer enables scalable training of rewards on both (1) reward-labeled / expert demonstrations and (2) reward-unlabeled, suboptimal trajectories.
       This enables well-calibrated, dense rewards usable zero-shot for policy learning.
    

      RBM-1M Dataset

      We release RBM-1M, a large-scale reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data.

      Downstream Applications

      Robometer significantly improves robot learning performance across a diverse set of downstream applications, including model-free online RL, model-based online RL, offline RL, failure detection, and data retrieval for IL.

Method

Dual-Objective Training

          Frame-Level Progress and Success Prediction

          Anchors the magnitude of predicted rewards by regressing per-frame progress and success values on expert demonstration data,
          providing dense supervision at the intra-trajectory level.
        

          Trajectory-Comparison Preference Prediction

          Imposes global ordering constraints across pairs of trajectories of the same task, enabling robust learning
          from suboptimal and failed data via a preference comparison objective.
        

Model Training Inputs and Objectives

            Different Expertise

            We pair together expert and suboptimal trajectories of the same task, training per-frame progress/success prediction on the expert trajectory and training the model to prefer the expert trajectory.

            Different Tasks

            We pair trajectories of different tasks together, training the model to prefer the video corresponding to the correct input language instruction.
            We can train per-frame progress/success on both types of trajectories.
          

Augmented Trajectory
We also augment a given trajectory to simulate failures, for example by using video rewind. The model is trained to prefer the original, unaugmented trajectory, and per-frame progress/success prediction can occur on either trajectory.

RBM-1M Dataset

1M+

Trajectories

21

Robot Embodiments

140,000+

Trajs. from Mixed Expertise Datasets

Distribution of different trajectory types in RBM-1M.

Example trajectories from RBM-1M. Dataset composition across robot embodiments, tasks, and trajectory quality levels. Suboptimal and failure data constitute a substantial fraction, enabling the preference-based learning component of Robometer.

Offline Reward Model Evaluations

Video-language Reward Confusion Matrix

Video-language reward confusion matrix. For each task (unseen data), we score all video–instruction pairs. Robometer yields the most diagonal matrix, indicating strong alignment between demos and instructions. Values are column-normalized diagonal means (fraction of reward on aligned pairs).

Qualitative Reward Curves

Reward curves predicted by Robometer and ablated variants on example trajectories.

Qualitative reward curves across expert, suboptimal, and failed trajectories. Robometer predictions reflect subtle failures and best align with true progress.

Quantitative Reward Evaluations

          Setting.
          Reward alignment is measured with Value Order Correlation (VOC Pearson $r$) between predicted rewards and ground-truth progress; trajectory ranking is measured with Kendall $\tau$ on RBM-EVAL (in-distribution and out-of-distribution). We compare baselines (GVL, VLAC, RoboDopamine), models trained on RoboReward data, and ReWiND / Robometer on full RBM-1M.
        

Dataset	Baselines			w/ RoboReward Training Data			w/ our RBM-1M data
Dataset	GVL	VLAC	RoboDopamine	RoboReward-4B	RoboReward-8B	Robometer	ReWiND	Robometer
(a) Reward alignment (VOC Pearson $r$) ↑
RBM-EVAL-ID	0.16	0.16	0.13	0.77	0.82	0.84	0.46	0.92
RBM-EVAL-OOD	0.21	0.17	0.08	0.88	0.88	0.93	0.51	0.95
(b) Trajectory ranking (Kendall $\tau$) ↑
RBM-EVAL-OOD	0.19	0.08	0.11	0.50	0.47	0.55	0.01	0.66

(a) Reward alignment (VOC Pearson $r$) and (b) trajectory ranking (Kendall $\tau_a$) on RBM-EVAL. Baselines by training data: GVL (w/ GPT-5-mini, unknown data), VLAC (300k trajectories), RoboDopamine (100k). We compare Robometer to RoboReward-4B/8B on their data, and ReWiND and Robometer on the full RBM-1M dataset.

RBM-EVAL-ID
RBM-EVAL-OOD

Dataset	Baselines			w/ RoboReward Training Data			w/ our RBM-1M data
Dataset	GVL	VLAC-8B	RoboDopamine	RoboReward-4B	RoboReward-8B	Robometer	ReWiND	Robometer
RACER	0.131	0.156	0.197	0.491	0.652	0.937	0.561	0.943
OXE (BC-Z)	0.142	−0.150	−0.109	0.643	0.809	0.683	0.442	0.922
OXE (Berkeley Cable Routing)	0.075	−0.425	−0.384	0.657	0.801	0.900	0.491	0.887
OXE (Bridge V2)	0.143	−0.875	−0.834	0.891	0.898	0.633	0.526	0.920
OXE (Jaco Play)	0.114	−0.151	−0.110	0.793	0.785	0.861	0.550	0.872
OXE (Toto)	0.254	−0.416	−0.375	0.886	0.939	0.947	0.340	0.930
OXE (Viola)	0.264	0.479	0.520	0.915	0.896	0.967	−0.014	0.947
Metaworld	0.134	0.211	0.252	0.746	0.779	0.737	0.630	0.900
Libero (90)	0.254	0.217	0.258	0.874	0.846	0.912	0.592	0.967
Average	0.168	0.089	0.130	0.766	0.823	0.842	0.458	0.921

Dataset	Baselines			w/ RoboReward Training Data			w/ our RBM-1M data
Dataset	GVL	VLAC-8B	RoboDopamine	RoboReward-4B	RoboReward-8B	Robometer	ReWiND	Robometer
USC Franka	0.102	0.356	0.061	0.909	0.909	0.959	0.772	0.959
USC Koch	0.176	0.074	−0.221	0.866	0.916	0.969	0.585	0.950
USC Trossen	0.542	0.256	−0.039	0.781	0.726	0.925	0.226	0.911
USC xArm	0.282	0.454	0.159	0.896	0.914	0.951	0.435	0.961
MIT Franka	0.268	0.601	0.306	0.896	0.899	0.868	0.531	0.954
UT Dallas SO101	0.203	0.511	0.216	0.926	0.930	0.888	0.497	0.952
Average	0.262	0.375	0.080	0.879	0.882	0.927	0.508	0.948

Kendall's $\tau$ (out-of-distribution)

Kendall's rank correlation. Higher is better.

Dataset	Baselines				w/ RoboReward Training Data			w/ our RBM-1M data
Dataset	GVL	VLAC-2B	VLAC-8B	RoboDopamine	RoboReward-4B	RoboReward-8B	Robometer	ReWiND	Robometer
USC Franka	0.250	0.292	0.271	0.167	0.625	0.625	0.583	−0.125	0.646
USC Koch	−0.008	0.167	0.064	0.175	0.332	0.264	0.533	0.336	0.471
USC Trossen	0.292	−0.111	−0.417	0.000	0.333	0.389	0.646	0.028	0.653
USC xArm	0.056	0.167	0.139	0.014	0.528	0.347	0.403	−0.167	0.694
MIT Franka	0.306	−0.017	0.072	0.220	0.494	0.396	0.479	0.080	0.601
UT Dallas SO101	0.300	−0.033	0.167	0.067	0.700	0.767	0.667	−0.067	0.867
Average	0.199	0.077	0.049	0.107	0.502	0.465	0.552	0.014	0.655

Setting. We evaluate on the RoboRewardBench benchmark: mean absolute error (MAE) between predicted rewards and human-annotated progress labels. Lower is better.

Model Type	Model	RoboRewardBench MAE ↓
Qwen3-4B Models	Robometer (ours)	0.72
	Robometer (RoboReward data only)	0.75
	RoboReward-4B	0.85
	Qwen3-VL-4B-Instr.	1.03

Evaluation on the RoboRewardBench benchmark (mean absolute error, lower is better).

          Setting.
          We compare Robometer to RL-VLM-F on pairwise preference accuracy over 500 comparisons per RBM-EVAL-OOD dataset: (i) Different Quality — same task, different trajectory quality; (ii) Different Task — different tasks, model must prefer the trajectory matching the instruction.
        

Different Quality

Dataset	RL-VLM-F (%)	Robometer (%)
USC Franka	52.1	75.0
USC Koch	54.4	79.4
USC Trossen	66.7	76.2
USC xArm	48.6	88.9
MIT Franka	54.4	85.4
UT Dallas SO-101	56.7	90.0
Average	55.5	82.5

Different quality trajectory pairwise preference accuracy.

Different task

Dataset	RL-VLM-F (%)	Robometer (%)
USC Franka	70.7	100.0
USC Koch	54.7	89.8
USC Trossen	64.0	99.0
USC xArm	73.3	98.2
MIT Franka	55.3	98.4
UT Dallas SO-101	72.7	100.0
Average	65.1	97.6

Different task trajectory pairwise preference accuracy.

Reward curves by task

Select a task to view the reward curve video.

Task

Policy Learning Experiments

Robometer is evaluated across five diverse downstream applications. In every setting Robometer is used as a plug-in reward function without task-specific fine-tuning.

Single-Stage RL
Multi-Stage RL

Setting. Robometer rewards are used as dense reward signals for single-stage, online reinforcement learning with DSRL + $\pi_0$. Successes are *automatically* detected by the reward model.

RoboReward 100x speed real world training clip. RoboReward consistently falsely predicts success/full reward before the robot finishes the task, leading to poor evaluation performance.

Robometer 100x speed real world training clip.

Method	Success Rate ↑
$\pi_0$ base (before RL)	0.20
RoboReward	0.55
Robometer (ours) ✓	0.85

Setting. Robometer rewards are used as dense reward signals for multi-stage online policy learning with DSRL + $\pi_0$. We perform a long-horizon, multi-stage RL task by using a VLM to break down the task into 2 sub-tasks, "put the corn in the pot on the dish rack" and "put the pot lid on the pot in the dish rack". Successes and stage advancement are *automatically* detected by the reward model.

RoboReward Multi-stage RL at 100x speed. RoboReward is often falsely predicts success for the first stage when the robot drops the corn next to but not into the pot, leading to poor downstream evaluation performance where the robot rarely puts the corn into the pot.

Robometer Multi-stage RL at 100x speed. Robometer is robust to when the robot drops the corn next to but not into the pot, leading to few false stage advancements.

Method	Success Rate ↑
$\pi_0$ base (before RL)	0.20
RoboReward	0.20
Robometer (ours) ✓	0.70

          Setting.
          Robometer rewards serve as dense feedback for offline RL with IQL on mixed-expertise data. By providing intermediate 
          supervision, they reduce long-horizon credit assignment and enable effective trajectory stitching at lower discount 
          factors $\gamma$; we sweep $\gamma$ for each reward model and report the best-performing configuration. 

Task	Sparse	RoboReward	Robometer
Put the bread in the oven (Setting 1)	0.50	0.55	0.85
Put red bowl on the blue plate (Setting 2)	0.05	0.00	0.15
Average	0.275	0.275	0.50

We observe that Robometer performs best at a lower discount factor $\gamma=0.9$. Compared to sparse rewards and prior reward models, Robometer achieves up to a 2.4× higher success rate by providing temporally aligned feedback that enables more effective trajectory stitching.

Policy trained on Robometer rewards on Setting 1.

Policy trained on Robometer rewards on Setting 2.

Setting. Robometer is used as a reward model to guide model-based planning with DreamZero via using Robometer rewards on imagined trajectories to choose the best next action chunk.

Method	Task: "Put the ice cream cone in the pink plate"
DreamZero Base	0.20
+Robometer (ours) ✓	0.70

Model-based planning with DreamZero. It mostly fails to put the ice cream cone in the correct plate. Video plays at 30x speed.

Model-based planning guided by Robometer reward. It successfully guides DreamZero to execute the action chunk corresponding to putting the ice cream cone in the correct plate. Video plays at 100x speed -- inference delay comes from sampling multiple action chunks and not from the model itself.

          Setting.
          Robometer rewards are used for zero-shot failure detection on various real-world manipulation trajectories from an unseen robot platform, 
          spanning seven tasks and including both irreversible and insufficient-progress failures. Failures are detected via temporal inconsistencies 
          in predicted per-frame rewards. Evaluated with precision, recall and F1 against human-labeled outcomes.
        

Per-task TPR, TNR, and F1 for failure detection. Best F1 per task in green.

Task	GPT-5-mini			RoboReward-4B			Robometer
Task	TPR	TNR	F1	TPR	TNR	F1	TPR	TNR	F1
move banana	0.31	1.00	0.48	1.00	0.62	0.91	1.00	0.75	0.94
move mouse	0.80	1.00	0.89	0.80	0.75	0.80	1.00	0.75	0.91
pour pebble	0.14	1.00	0.25	0.57	1.00	0.73	0.71	1.00	0.83
fold towel	0.15	1.00	0.27	0.31	0.67	0.40	0.54	0.56	0.58
pull tissue	0.00	1.00	0.00	0.55	0.00	0.57	0.73	0.50	0.76
put spoon	0.14	1.00	0.25	0.57	1.00	0.73	0.57	1.00	0.73
stir pot	0.09	1.00	0.17	0.91	1.00	0.95	0.82	1.00	0.90
Average	0.20	1.00	0.33	0.69	0.63	0.74	0.77	0.70	0.81

Robometer detects a failure when the pebbles are spilled during pouring.

Robometer detects a failure when the spoon is dropped.

          Setting.
          Given a multi-task, unlabeled play dataset, Robometer retrieves the most relevant subtrajectories for a task instruction using either 
          (i) preference-based pairwise trajectory comparisons or (ii) progress-based ranking via per-timestep reward predictions and 
          value–order correlation.

Data retrieval and imitation learning results

Task success rates of LoRA-finetuned $\pi_{0.5}$ on 100 retrieved segments with highest similarity.

Task	$\pi_{0.5}$ Base	SigLIP	STRAP	RoboReward-4B	Robometer-Pref
Stir Pot	0/20	3/20	2/20	0/20	13/20
Open Red Drawer	0/20	0/20	3/20	0/20	14/20

Instruction: Open the red drawer
Example retrieved subtrajectories by Robometer. Note that, this task requires 2-arm coordination to open the drawer, with one arm pushing down on the drawer and the other arm pulling down on the drawer handle.

$\pi_{0.5}$ LoRA-finetuned on Robometer-retrieved data.

Robometer Demo

Try the reward model evaluation UI below, or separate visit in a new tab at https://robometer-rewardeval-ui.hf.space.

Low-rank Finetuning of Robometer

Robometer serves as a strong initialization for domain-specific reward fine-tuning: practitioners can adapt it to new tasks or embodiments with limited data and compute (e.g., LoRA with a single GPU) instead of training a reward model from scratch. We demonstrate this on RoboFAC; fine-tuning from Robometer substantially outperforms fine-tuning the base VLM across reward and ranking metrics.

Method	VOC $r$ ↑	Kendall $\tau$ ↑	Succ-Fail Diff ↑
Robometer-4B (Zero-shot)	0.652	0.436	0.141
Qwen3-VL (LoRA)	0.701	0.067	0.005
Qwen3-VL (Full FT)	0.727	0.102	0.008
Robometer-4B (LoRA)	0.875	0.786	0.271
Robometer-4B (Full FT)	0.884	0.802	0.302

BibTeX

@article{liang2026robometer, title = {Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons}, author={Anthony Liang and Yigit Korkmaz and Jiahui Zhang and Minyoung Hwang and Abrar Anwar and Sidhant Kaushik and Aditya Shah and Alex S. Huang and Luke Zettlemoyer and Dieter Fox and Yu Xiang and Anqi Li and Andreea Bobu and Abhishek Gupta and Stephen Tu and Erdem Biyik and Jesse Zhang}, year={2026}, journal = {arXiv preprint arXiv:2603.02115} }