I am currently following Andrew NG's Standford Online Machine Learning course and decided to prove the formulas on my own since it that's my method of understanding them better. I am a bit stuck on the linear regression formula.
$$\textrm{Let X represent the matrix of inputs where m is the number of input features and n is the number of inputs}$$ $$X = \begin{bmatrix} 1 & x_{1}^{(1)} & \cdots & x_{m}^{(1)} \\\ \vdots & \vdots & \ddots & \vdots \\\ 1 & x_{1}^{(n)} & \cdots & x_{m}^{(n)} \end{bmatrix} $$
$$\textrm{Y is the vector of the outputs}$$ $$Y = \begin{pmatrix} y^{(1)} & \cdots & y^{(n)} \end{pmatrix} ^{T}$$
$$\theta \textrm{ is the vector of the line parameters}$$ $$\theta = \begin{pmatrix} \theta_{1} & \cdots & \theta_{m} \end{pmatrix} ^{T}$$
$$\textrm{And the loss function } L(\theta) \textrm{ would be } ||X\theta - Y||^2$$
So to minimize this, $$ \frac{\partial{L(\theta)}}{\partial{\theta}} = \frac{\partial{((X\theta - Y)^T (X\theta - Y))}}{\partial{\theta}} $$ $$ =\frac{\partial{(\theta^T X^T X\theta - \theta^T X^T Y - Y^T X\theta + Y^T Y)}}{\partial{\theta}} $$ $$ =2X^T X \theta - X^T Y - Y^T X $$
But according to Wikipedia, it's supposed to be, $$ 2X^T X \theta - 2X^T Y $$
But $X^T Y \neq Y^T X$ since X isn't a vector, so where did I make a mistake? I've been trying to find it for some time already.
Or the alternative method, $$ \frac{\partial{((X\theta - Y)^T (X\theta - Y))}}{\partial{\theta}} = \left(\frac{\partial (X\theta - Y)}{\partial \theta}\right)^T \frac{\partial{((X\theta - Y)^T (X\theta - Y))}}{\partial{(X\theta - Y)}}$$ $$ = 2X^T (X\theta - Y)$$ which is the same as Wikipedia.