Paper List
Search
Search
Dark mode
Light mode
Explorer
Tag: reward_hacking
2 items with this tag.
May 02, 2026
From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking
reward_hacking
emergent_misalignment
rlhf
May 01, 2026
TRACE: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
reward_hacking
scalable_oversight
chain_of_thought