Paper List
Search
Search
Dark mode
Light mode
Explorer
Tag: rlhf
3 items with this tag.
May 02, 2026
From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking
reward_hacking
emergent_misalignment
rlhf
May 01, 2026
MNPO: Multiplayer Nash Preference Optimization
preference_optimization
nash_learning
rlhf
May 01, 2026
TI-DPO: Token-Importance Guided Direct Preference Optimization
direct_preference_optimization
token_importance
rlhf