On the lecture notes here: http://web.stanford.edu/class/archive/ee/ee263/ee263.1082/lectures/ls.pdf
the derivation for least squares is as follows.
$x^\intercal A^\intercal Ax - 2 y^\intercal Ax + y^\intercal y $
set gradient w.r.t. to x to 0
$2A^\intercal Ax - 2A^\intercal y = 0$.
How did this step come about? Does gradient mean derivative here? And what are some good resources for learning matrix calculus (more than Wikipedia, I want to develop intuition)? Thanks!