Home on Decisions & Dragons

follow: @[email protected]

Posts

Should we abandon RL? Is it the right approach?

Why is it better to subtract a baseline in REINFORCE?

Why does experience replay require off-policy learning and how is it different from on-policy learning?

What is the "horizon" in reinforcement learning?

Why doesn't Q-learning work with continuous actions?

Why is the DDPG gradient the product of the Q-function gradient and policy gradient?

If Q-learning is off-policy, why doesn't it require importance sampling?

What is the difference between V(s) and Q(s,a)?

Why does the policy gradient include a log probability term?

What is the difference between model-based and model-free RL?

Math Notation Cheatsheet