Suppose I have a vector $y$ of dimension $N \times 1$, and a matrix $X$ of dimension $N \times p$ and a vector $\beta$ of dimension $p \times 1$. Then I wish to differentiate the matrix equation :
$RSS(\beta) = (y-X\beta)^T(y-X\beta)$ with respect to $\beta$.
I know that in general for a vector $x$ that $\frac{d}{dx} x^Tx = 2x$, so that the result should be something like (using the chain rule) :
$\frac {\partial RSS}{\partial \beta} = 2 (y-X\beta) \frac {\partial} {\partial \beta} (y-X\beta)$, and that the resulting answer is $-2 X^T(y-X\beta)$, but I am confused how this is achieved. Why is $X^T$ on the left side? I'm not too experienced with differentiating matrix equations and any general concept here to see this would be much appreciated.