How can the policy iteration algorithm be model-free if it uses the transition probabilities?

Question

I'm actually trying to understand the policy iteration in the context of RL. I read an article presenting it and, at some point, a pseudo-code of the algorithm is given :

What I can't understand is this line :

From what I understand, policy iteration is a model-free algorithm, which means that it doesn't need to know the environment's dynamics. But, in this line, we need $p(s',r \mid s, \pi(s))$ (which in my understanding is the transition function of the MDP that gave us the probability of landing in the state $s'$ knowing previous $s$ state and the action taken) to compute $V(s)$. So I don't understand how we can compute $V(s)$ with the quantity $p(s',r \mid s, \pi(s))$ since it is a parameter of the environment.

score 4 · Accepted Answer · answered Mar 11 '20 at 16:25

Everything you say in your post is correct, apart from the wrong assumption that policy iteration is model-free. PI is a model-based algorithm because of the reasons you're mentioning.

See my answer to the question What's the difference between model-free and model-based reinforcement learning?.

dasWesen · Answer 2 · 2021-10-21T17:19:11.823

0

The Policy Iteration algorithm (given in the question) is model-based.

However, note that there exist methods that fall into the Generalized Policy Iteration category, such as SARSA, which are model-free.

From what I understand, policy iteration is a model-free algorithm

Maybe this was referring to generalized policy iteration methods.

(Answer based on comments from @Neil Slater.)

edited Oct 21 '21 at 17:19

answered Oct 20 '21 at 09:42

dasWesen

101
2

Comments are not for extended discussion; this conversation has been moved to chat. – nbro Oct 21 '21 at 12:59

How can the policy iteration algorithm be model-free if it uses the transition probabilities?

2 Answers2