Overview

Topic Boundary

这个 overview 记录偏好学习主线。凡是核心学习信号来自 pairwise comparison、ranking、human/judge preference、reward model 或 direct preference objective 的工作，都优先从这里定位；如果只是普通数值 reward 下的 RL algorithm，则回到 Classical & Deep RL。

Reading Queue

Dueling Bandit

Dueling Bandit 是本 topic 下的子队列，专门放 pairwise preference feedback 形式的 bandit 问题。它不再作为全局一级 topic 出现。

Paper List

Explorer

Overview of Preference Learning

Topic Boundary

Reading Queue

Dueling Bandit

Table of Contents