What is the relation between Dynamic Programming and Reinforcement Learning?

Question

Please forgive me for the implicity of the question, as I recently started studying Reinforcement Learning. I am supposed to study a system where the transition probabilities are known and I have to use Reinforcement Learning.

I try to unterstand the relation between Dynamic Programming and Reinforcement Learning.

Although I have studied several lectures, read the corresponding chapters of Barto & Sutton's book and watched some videos, it is still not clear if Dynamic Programming (specifically value iteration, policy iteration) is distinct from Reinforcement Learning or if policy iteration and value iteration are considered model-based algorithms of Reinforcement Learning.

Apart from this, I wonder if there is any sense using Reinforcement Learning when transition probabilities are known.

Thank you!

Dynamic Programming (DP) is not related to RL directly. However, policy iteration and value iteration are - they use DP methods, but DP can be used for all sorts of things, e.g. cloth simulation, smoothing animations. It's a general solution technique. So also can you clarify are you really asking about relationship between DP and RL, or between policy iteration/value iteration and RL? I think it is the latter, but your words are asking about the former . . . — Neil Slater, Nov 13 '23 at 21:44
Thanks for your answer! I think I am confused. I thought that Policy iteration and Value iteration are algorithms of Dynamic Programming, which somehow is correlated with Reinforcement Learning. If I restrict the question to the relation between Reinforcement Learning and Policy Iteration and Value Iteration, then the answer may be would that both of them are Model Based Algortihms in RL? — Annaassymeon2, Nov 13 '23 at 22:34

score 4 · Answer 1 · edited Nov 14 '23 at 15:37

Dynamic programming is an algorithm paradigm (i.e. a way to design algorithms) that can be applied to many problem domains, not just Markov decision processes (MDPs), as long as they satisfy certain conditions (optimal substructure and overlapping subproblems). Policy and value iteration are dynamic programming algorithms applied to problems that can be described as MDPs. If you want a general introduction to these algorithms, I'd recommend the book Introduction to Algorithms by Cormen et al., or maybe start with the Wikipedia article.

Reinforcement learning is an approach to solve MDPs when you don't usually know the true transition probabilities, although there are also RL algorithms that use or estimate these probabilities. See model-based RL. The difference is that RL is a trial-and-error approach, which is guided by a reward signal/function, i.e. you randomly take actions and progressively learn which ones give you more reward (aka reinforcement). The usual RL reference is Reinforcement Learning: An Introduction by Sutton and Barto.

Essentially, DP "lends" these methods to RL. And RL uses these methods as model based algorithms. Right? Υour answer is really helpful, thank you! — Annaassymeon2, Nov 15 '23 at 20:16
@Annaassymeon2 I'd not say that. I'd rather say that some DP algorithms (e.g. policy iteration) solve the same type of problem as the usual RL algorithm (MDPs), so they are both used to find policies for uncertainty environments. This answer also be useful. — nbro, Nov 17 '23 at 00:31

cinch · Answer 2 · 2023-11-14T08:18:15.483

In the comment section you already knew that policy iteration and value iteration are reinforcement learning (RL) application of Dynamic Programming. To further address your remaining question expressed in your comment, all value-based RL methods are well described by policy and value iteration as stated in your own reference, including some model-free RL methods such as Sarsa or Q-learning.

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy-evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right. If both the evaluation process and the improvement process stabilize, that is, no longer produce changes, then the value function and policy must be optimal.

Then we extend them to control following the general pattern of on-policy GPI, using $\epsilon$-greedy for action selection. We show results for n-step linear Sarsa on the Mountain Car problem.

Caveat: all value-based methods are generalised policy iteration. Policy gradient is different, and common enough with plenty of variations, that it is not quite right IMO to say "almost all RL methods . . ." — Neil Slater, Nov 14 '23 at 08:05
@NeilSlater thanks for the correct. I intended to exclude policy gradient methods using 'almost all', but indeed 'all value-based' methods are much better. — cinch, Nov 14 '23 at 08:21

What is the relation between Dynamic Programming and Reinforcement Learning?

2 Answers2