less than 1 minute read

As I’m currently working on a project related to RL with video generation models, my dear boss asked me to study reward model training practices, and translating a pairwise model to a ranking model during inference. So here’s my study notes. Wechat Screenshot

Pairwise Reward Objectives

Bradley-Terry Model (Pairwise)

Say we have a dataset of preferences, the Bradley-Terry function defines the probability that response $y_w$ is preferred over $y_l$ given prompt $x$:

\[P(y_w \succ y_l \mid x) = \frac{\exp\big(r(x, y_w)\big)}{\exp\big(r(x, y_w)\big) + \exp\big(r(x, y_l)\big)} = \sigma\big(r(x, y_w) - r(x, y_l)\big)\]

where $r(x, y)$ is the reward model and $\sigma(\cdot)$ is the sigmoid function; $y_w$ and $y_l$ defines the winner and loser .

We want to train a reward model to produce such rewards that can maximize this probability. In practice, this is done by minimizing the negative log-likelihood over a dataset $\mathcal{D}$ of preference pairs:

\[\mathcal{L}(r) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \big[ \log \sigma\big(r(x, y_w) - r(x, y_l)\big) \big]\]

List of Resources

  1. Umar Jamil’s YouTube video on DPO
  2. Hugging Face DPO Post