Home on Decisions & Dragons
Should we abandon RL? Is it the right approach?
Why is it better to subtract a baseline in REINFORCE?
Why does experience replay require off-policy learning and how is it different from on-policy learning?
What is the "horizon" in reinforcement learning?
Why doesn't Q-learning work with continuous actions?
Why is the DDPG gradient the product of the Q-function gradient and policy gradient?
If Q-learning is off-policy, why doesn't it require importance sampling?
What is the difference between V(s) and Q(s,a)?
Why does the policy gradient include a log probability term?
What is the difference between model-based and model-free RL?
About
Math Notation Cheatsheet
Revisions