0

I have gone through this answer regarding the difference between Q-value and state value. My specific question is:

If Q-value calculates immediate reward after taking a particular action and then calculates the rest by following a policy, how will the expected reward change if value function also follows the same policy? For example, in the image below, how will Q-value and state value change depending on the policy?

enter image description here

bkshi
  • 2,235
  • 2
  • 11
  • 23
kanishka
  • 1
  • 1
  • One thing to note, and that may help you understand the difference, is that you have not shown actual policies for $\pi^1, \pi^2, \pi^3$ . . . you have instead shown trajectories. To be a complete policy for instance, $\pi^3$ needs to describe what choice would be made in state 1, even though it may not be reachable when starting from 0. But to calculate $Q_{\pi^3}(0, A)$ you need that information, even if $\pi^3(0) = B$ – Neil Slater Jun 07 '19 at 08:56
  • Hi @NeilSlater, thanks for taking time to reply. My understanding of a policy was sequence of actions, but, I seemed to have misinterpreted it. So, if a policy is step action at a state (please correct me if I am wrong), how many different policies can exist at the starting of the episode? In the example you have mentioned, what would be Q value for 3rd policy ($\pi^3$)? – kanishka Jun 07 '19 at 10:28
  • Your "policy" $\pi^3$ is not actually a policy. So you cannot ask what its Q-value function would be. There are an infinite number of policies if you allow for random choice. However, if you go with deterministic policies, you can see there are two places in your simple MDP where a choice between 2 actions can be made. So there are four different deterministic policies - there is one that you don't show, because the trajectory (list of states and actions taken) would be the same as you show for $\pi^3$ (in fact you cannot tell which of these policies your list for $\pi^3$ represents). – Neil Slater Jun 07 '19 at 10:38
  • To be clear, a policy is not "a sequence of actions". It is a function that chooses an action depending on the state. This is usually written $\pi(s)$ for a deterministic policy, or $\pi(a|s)$ for a stochastic policy. – Neil Slater Jun 07 '19 at 10:41
  • @NeilSlater, I understood the explanation for "policy" not being a sequence of actions. I am trying to wrap my head around the difference between Q-value and state value, and understanding policy w.r.t to Q_value and state value seems to be the differentiators. I was going through your answer on another question on this platform, the OP talked about following the policy in the comments to your answer. What would following a policy mean? – kanishka Jun 07 '19 at 13:07
  • "Following a policy" means at each time step, using the policy function to select the next action depending on the current state. – Neil Slater Jun 07 '19 at 13:19
  • To summarise, we will have a policy function/estimator that gives an action to take in a state. State value function calculates the cumulative reward based on the actions taken using the fixed policy in an episode. Q-value function calculates the next step reward by arbitrarily taking an action, and follows (to take action based on the previous policy function) the policy to calculate the cumulative reward. Does that make sense? – kanishka Jun 07 '19 at 13:27
  • Yes. You are mixing up in the description, what the value functions do and how they might be learned, but the relationships between Q and V look correct. – Neil Slater Jun 07 '19 at 14:16

0 Answers0