Paper List

Tag: rlhf

3 items with this tag.

  • May 02, 2026

    From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking

    • reward_hacking
    • emergent_misalignment
    • rlhf
  • May 01, 2026

    MNPO: Multiplayer Nash Preference Optimization

    • preference_optimization
    • nash_learning
    • rlhf
  • May 01, 2026

    TI-DPO: Token-Importance Guided Direct Preference Optimization

    • direct_preference_optimization
    • token_importance
    • rlhf

Created with Quartz v4.5.1 © 2026

  • GitHub