Q-learning uses an exploratory policy, derived from the current estimate of the $Q$ function, such as the $\epsilon$-greedy policy, to select the action $a$ from the current state $s$. After having taken this action $a$ from $s$, the reward $r$ and the next state $s'$ are observed. At this point, to update the estimate of the $Q$ function, you use a target that assumes that the greedy action is taken from the next state $s'$. The greedy action is selected by the $\operatorname{max}$ operator, which can thus be thought of as an implicit policy (but this terminology isn't common, AFAIK), so, in this context, the greedy action is the action associated with the highest $Q$ value for the state $s'$.
In SARSA, no $\operatorname{max}$ operator is used, and you derive a policy (e.g. the $\epsilon$-greedy policy) from the current estimate of the $Q$ function to select both $a$ (from $s$) and $a'$ (from $s'$).
To conclude, in all cases, the policies are implicit, in the sense that they are derived from the estimate of the $Q$ function, but this isn't a common terminology. See also this answer, where I describe more in detail the differences between Q-learning and SARSA, and I also show the pseudocode of both algorithms, which you should read (multiple times) in order to fully understand their differences.
A'
from policy"? Does it mean that for Q-learning I do need an explicit policy (max Q
, greedy) because I know whichA'
I will take to compute target (the one that gives me themax Q
, regardless of the action actually taken in the environment). And for SARSA, I pick the action with some probability and compute target based on thatA'
(actually taken in environment). Could you elaborate on that please? – Novak Jan 29 '20 at 09:36$
for MathJax notation). – nbro Jan 29 '20 at 11:45