Paper List

Tag: reward_hacking

2 items with this tag.

  • May 02, 2026

    From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking

    • reward_hacking
    • emergent_misalignment
    • rlhf
  • May 01, 2026

    TRACE: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

    • reward_hacking
    • scalable_oversight
    • chain_of_thought

Created with Quartz v4.5.1 © 2026

  • GitHub