68

It seems to me that the $V$ function can be easily expressed by the $Q$ function and thus the $V$ function seems to be superfluous to me. However, I'm new to reinforcement learning so I guess I got something wrong.

Definitions

Q- and V-learning are in the context of Markov Decision Processes. A MDP is a 5-tuple $(S, A, P, R, \gamma)$ with

  • $S$ is a set of states (typically finite)
  • $A$ is a set of actions (typically finite)
  • $P(s, s', a) = P(s_{t+1} = s' | s_t = s, a_t = a)$ is the probability to get from state $s$ to state $s'$ with action $a$.
  • $R(s, s', a) \in \mathbb{R}$ is the immediate reward after going from state $s$ to state $s'$ with action $a$. (It seems to me that usually only $s'$ matters).
  • $\gamma \in [0, 1]$ is called discount factor and determines if one focuses on immediate rewards ($\gamma = 0$), the total reward ($\gamma = 1$) or some trade-off.

A policy $\pi$, according to Reinforcement Learning: An Introduction by Sutton and Barto is a function $\pi: S \rightarrow A$ (this could be probabilistic).

According to Mario Martins slides, the $V$ function is $$V^\pi(s) = E_\pi \{R_t | s_t = s\} = E_\pi \{\sum_{k=0}^\infty \gamma^k r_{t+k+1} | s_t = s\}$$ and the Q function is $$Q^\pi(s, a) = E_\pi \{R_t | s_t = s, a_t = a\} = E_\pi \{\sum_{k=0}^\infty \gamma^k r_{t+k+1} | s_t = s, a_t=a\}$$

My thoughts

The $V$ function states what the expected overall value (not reward!) of a state $s$ under the policy $\pi$ is.

The $Q$ function states what the value of a state $s$ and an action $a$ under the policy $\pi$ is.

This means, $$Q^\pi(s, \pi(s)) = V^\pi(s)$$

Right? So why do we have the value function at all? (I guess I mixed up something)

Neil Slater
  • 28,918
  • 4
  • 80
  • 100
Martin Thoma
  • 18,880
  • 35
  • 95
  • 169

6 Answers6

58

$V^\pi(s)$ is the "state" value function of an MDP (Markov Decision Process). It's the expected return starting from state $s$ following policy $\pi$:

$$V^\pi(s) = E_{\pi} \{G_t \vert s_t = s\} $$

$G_t$ is the total DISCOUNTED reward from time step $t$, as opposed to $R_t$ which is an immediate return. Here you are taking the expectation over ALL actions according to the policy $\pi$.

$Q^\pi(s, a)$ is the "state action" value function, also known as the quality function. It is the expected return starting from state $s$, taking action $a$, then following policy $\pi$. It's focusing on the particular action at the particular state.

$$Q^\pi(s, a) = E_\pi \{G_t | s_t = s, a_t = a\}$$

The relationship between $Q^\pi$ and $V^\pi$ (the value of being in that state) is:

$$V^\pi(s) = \sum_{a ∈ A} \pi (a|s) * Q^\pi(s,a)$$

You sum every state action-value multiplied by the probability of taking that action (given by the policy $\pi(a|s)$).

If you think of the grid world example, you multiply the probability of (up/down/right/left) with the one step ahead of the state value of (up/down/right/left).

Shayan Shafiq
  • 1,012
  • 4
  • 12
  • 24
aerin
  • 907
  • 1
  • 9
  • 13
36

Q-values are a great way to the make actions explicit so you can deal with problems where the transition function is not available (model-free). However, when your action-space is large, things are not so nice and Q-values are not so convenient. Think of a huge number of actions or even continuous action-spaces.

From a sampling perspective, the dimensionality of $Q(s, a)$ is higher than $V(s)$ so it might get harder to get enough $(s, a)$ samples in comparison with $(s)$. If you have access to the transition function sometimes $V$ is good.

There are also other uses where both are combined. For instance, the advantage function where $A(s, a) = Q(s, a) - V(s)$. If you are interested, you can find a recent example using advantage functions here:

Dueling Network Architectures for Deep Reinforcement Learning

by Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot and Nando de Freitas.

Juan Leni
  • 1,059
  • 9
  • 24
19

You have it right, the $V$ function gives you the value of a state, and $Q$ gives you the value of an action in a state (following a given policy $\pi$). I found the clearest explanation of Q-learning and how it works in Tom Mitchell's book "Machine Learning" (1997), ch. 13, which is downloadable. $V$ is defined as the sum of an infinite series but its not important here. What matters is the $Q$ function is defined as

$$ Q(s,a ) = r(s,a ) + \gamma V^{*}(\delta(s,a)) $$ where V* is the best value of a state if you could follow an optimum policy which you don't know. However it has a nice characterization in terms of $Q$ $$ V^{*}(s)= \max_{a'} Q(s,a') $$ Computing $Q$ is done by replacing the $V^*$ in the first equation to give $$ Q(s, a) = r(s, a) + \gamma \max_{a'} Q(\delta(s, a), a') $$

This may seem an odd recursion at first because its expressing the Q value of an action in the current state in terms of the best Q value of a successor state, but it makes sense when you look at how the backup process uses it: The exploration process stops when it reaches a goal state and collects the reward, which becomes that final transition's Q value. Now in a subsequent training episode, when the exploration process reaches that predecessor state, the backup process uses the above equality to update the current Q value of the predecessor state. Next time its predecessor is visited that state's Q value gets updated, and so on back down the line (Mitchell's book describes a more efficient way of doing this by storing all the computations and replaying them later). Provided every state is visited infinitely often this process eventually computes the optimal Q

Sometimes you will see a learning rate $\alpha$ applied to control how much Q actually gets updated: $$ Q(s, a) = (1-\alpha)Q(s, a) + \alpha(r(s, a) + \gamma \max_{a'} Q(s',a')) $$ $$ = Q(s, a) + \alpha(r(s, a) + \gamma \max_{a'} Q(s',a') - Q(s,a)) $$ Notice now that the update to the Q value does depend on the current Q value. Mitchell's book also explains why that is and why you need $\alpha$: its for stochastic MDPs. Without $\alpha$, every time a state,action pair was attempted there would be a different reward so the Q^ function would bounce all over the place and not converge. $\alpha$ is there so that as the new knowledge is only accepted in part. Initially $\alpha$ is set high so that the current (mostly random values) of Q are less influential. $\alpha$ is decreased as training progresses, so that new updates have less and less influence, and now Q learning converges

Motorhead
  • 291
  • 2
  • 4
8

Here is a more detailed explanation of the relationship between state value and action value in Aaron's answer. Let's first take a look at the definitions of value function and action value function under policy $\pi$: \begin{align} &v_{\pi}(s)=E{\left[G_t|S_t=s\right]} \\ &q_{\pi}(s,a)=E{\left[G_t|S_t=s, A_t=a\right]} \end{align} where $G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$ is the return at time $t$. The relationship between these two value functions can be derived as \begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=\sum_{g_t} p(g_t|S_t=s)g_t \nonumber \\ &= \sum_{g_t}\sum_{a}p(g_t, a|S_t=s)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)\sum_{g_t}p(g_t|S_t=s, A_t=a)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)E{\left[G_t|S_t=s, A_t=a\right]} \nonumber \\ &= \sum_{a}p(a|S_t=s)q_{\pi}(s,a) \end{align} The above equation is important. It describes the relationship between two fundamental value functions in reinforcement learning. It is valid for any policy. Moreover, if we have a deterministic policy, then $v_{\pi}(s)=q_{\pi}(s,\pi(s))$. Hope this is helpful for you. (to see more about Bellman optimality equation)

Zephyr
  • 997
  • 4
  • 10
  • 20
Jie Shi
  • 181
  • 1
  • 1
  • For anyone confused about the initial expansion of $E[G_t | S_t =s]$, that is just Adam's Law / Law of Iterated Expectation for discrete variables, with an extra conditioning on the event $S_t =s$ (this extra conditioning does not change the essence of Adam's law). – Abhishek Divekar Oct 15 '23 at 10:48
0

Value function

  1. The value function estimates the expected cumulative reward of being in a particular state.
  2. It is a state function, meaning that it only takes the state as input.
  3. The value function can be used to evaluate different policies, and to find the optimal policy.

Q-function

  1. The Q function estimates the expected cumulative reward of taking a particular action in a given state.
  2. It is a state-action function, meaning that it takes both the state and the action as input.
  3. The Q function is used to learn an optimal policy, which is a policy that maximizes the expected cumulative reward.

Key Difference between Q-Function and Value Function

The Q function and the value function are both used to estimate the expected cumulative reward, but they do so in different ways. The Q function takes both the state and the action as input, while the value function only takes the state as input. This means that the Q function can be used to learn an optimal policy, while the value function can only be used to evaluate different policies. The Q function is more complex than the value function, but it can also be more accurate. The value function is simpler, but it is less accurate.

Moreover, the concept of Q function and Value Function is illustrated with a Grid-World -> https://www.youtube.com/watch?v=GzHvZ_sSvQE

yesman
  • 1
0

The value function is an abstract formulation of utility. And the Q-function is used for the Q-learning algorithm.

emmanuel
  • 11
  • 2