Reward Learning through Ranking Mean Squared Error

Abstract

Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human ratings rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 uses a novel ranking mean squared error loss that learns from a dataset of trajectory-rating pairs, treating the human-provided discrete ratings (e.g., “bad,” “neutral,” “good”) as ordinal targets. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using both human-provided and simulated ratings, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic benchmarks from OpenAI Gym and the DeepMind Control Suite.

Method

TODO: describe your method here. You can include figures using standard Markdown or HTML.

Loss Function

The RMSE objective for a preference pair $(\sigma^+, \sigma^-)$ with preference label $y \in [0, 1]$ is:

TODO: add your equation here. The site config already includes KaTeX so standard $...$ and $$...$$ delimiters work.

Why Regression over Classification?

TODO: expand the intuition and theoretical analysis here.

Key insight: Cross-entropy treats a 51%:49% preference identically to a 99%:1% preference in terms of the gradient signal. RMSE naturally scales the learning signal with the strength of the expressed preference.

Experiments

TODO: describe your experimental setup, environments, and baselines.

Results

The table below reports TODO (metric) on TODO (benchmark). Values are averaged over TODO seeds ± standard deviation.