Notes on Reward Model Training
As I’m currently working on a project related to RL with video generation models, my dear boss asked me to study reward model training practices, and translating a pairwise model to a ranking model during inference. So here’s my study notes.

Pairwise Reward Objectives
Bradley-Terry Model (Pairwise)
Say we have a dataset of preferences, the Bradley-Terry function defines the probability that response $y_w$ is preferred over $y_l$ given prompt $x$:
\[P(y_w \succ y_l \mid x) = \frac{\exp\big(r(x, y_w)\big)}{\exp\big(r(x, y_w)\big) + \exp\big(r(x, y_l)\big)} = \sigma\big(r(x, y_w) - r(x, y_l)\big)\]where $r(x, y)$ is the reward model and $\sigma(\cdot)$ is the sigmoid function; $y_w$ and $y_l$ defines the winner and loser .
We want to train a reward model to produce such rewards that can maximize this probability. In practice, this is done by minimizing the negative log-likelihood over a dataset $\mathcal{D}$ of preference pairs:
\[\mathcal{L}(r) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \big[ \log \sigma\big(r(x, y_w) - r(x, y_l)\big) \big]\]