6

I'm actually trying to understand the policy iteration in the context of RL. I read an article presenting it and, at some point, a pseudo-code of the algorithm is given : enter image description here

What I can't understand is this line :

enter image description here

From what I understand, policy iteration is a model-free algorithm, which means that it doesn't need to know the environment's dynamics. But, in this line, we need $p(s',r \mid s, \pi(s))$ (which in my understanding is the transition function of the MDP that gave us the probability of landing in the state $s'$ knowing previous $s$ state and the action taken) to compute $V(s)$. So I don't understand how we can compute $V(s)$ with the quantity $p(s',r \mid s, \pi(s))$ since it is a parameter of the environment.

nbro
  • 40,472
  • 12
  • 105
  • 192

2 Answers2

4

Everything you say in your post is correct, apart from the wrong assumption that policy iteration is model-free. PI is a model-based algorithm because of the reasons you're mentioning.

See my answer to the question What's the difference between model-free and model-based reinforcement learning?.

nbro
  • 40,472
  • 12
  • 105
  • 192
0

The Policy Iteration algorithm (given in the question) is model-based.

However, note that there exist methods that fall into the Generalized Policy Iteration category, such as SARSA, which are model-free.

From what I understand, policy iteration is a model-free algorithm

Maybe this was referring to generalized policy iteration methods.


(Answer based on comments from @Neil Slater.)

dasWesen
  • 101
  • 2