Classical & Deep RL

Scope

这个 topic 现在明确指向 Classical & Deep Reinforcement Learning：传统 RL、Deep RL、policy gradient、value-based methods、model-based RL、exploration、IRL、state abstraction、representation for control，以及相关理论与模型。这里讨论的默认对象是 MDP/POMDP、环境交互、数值 reward、return、value function、policy optimization 和 sample efficiency。

它不再承担 LLM post-training 里的所有 “RL” 语义。LLM reasoning 的 RLVR、GRPO、DPO/RLHF 动力学应优先看 Textual Reasoning 或 Preference Learning；多轮工具使用、web/computer/code agent 的 online RL 应看 Agentic RL；监督失灵、reward hacking、alignment objective 的安全边界则归 Safety & Alignment。这个 topic 可以作为这些方向的算法背景，但不作为它们的父类。

Paper List

Explorer

Classical and Deep Reinforcement Learning

Scope

Overview of Classical and Deep Reinforcement Learning

DAE: Direct Advantage Estimation

DGSM: Data Generation as Sequential Decision Making

GAE: High-Dimensional Continuous Control Using Generalized Advantage Estimation

ICM: Curiosity-driven Exploration by Self-supervised Prediction

MaxEnt: Maximum Entropy Inverse Reinforcement Learning

TRPO: Trust Region Policy Optimization

Linear Approximation: Provably Efficient Reinforcement Learning with Linear Function Approximation

Policy Gradients Guide: The Definitive Guide to Policy Gradients

Scaling CRL Depth: 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities