Paper List

Tag: rlhf

3 items with this tag.

May 02, 2026
From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking
May 01, 2026
MNPO: Multiplayer Nash Preference Optimization
May 01, 2026
TI-DPO: Token-Importance Guided Direct Preference Optimization

Created with Quartz v4.5.1 © 2026

GitHub