This answer shows that linear and polynomial function weights can be trained using this matrix operation:
$w = (X^TX)^{-1}X^Ty$
Therefore, algorithms such as gradient descent are not necessary for these functions. By my understanding, gradient descent for linear regression model finds perfect derivative for each weight so that cost function is minimal.
Before asking for connections between gradient descent and equation above, let's separate the equation in smaller steps:
$X = [1,2,3,4,5,6,7,8,9,10,11,12]$
$y=[2.3,2.33,2.29,2.3,2.36,2.4,2.46,2.5,2.48,2.43,2.38,2.35]$
Let's turn $X$ vector to matrix by adding column of 1's which will be used to train the bias value:
\begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \\ 4 & 1 \\ 5 & 1 \\ 6 & 1 \\ 7 & 1 \\ 8 & 1 \\ 9 & 1 \\ 10 & 1 \\ 11 & 1 \\ 12 & 1 \end{bmatrix}
Transpose: $X^T$:
\begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 9 & 10 & 11 & 12 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix}
Matrix multiplication $X^TX$:
\begin{bmatrix} 650 & 78 \\ 78 & 12 \end{bmatrix}
Inverse matrix operation $(X^TX)^{-1}$:
\begin{bmatrix} 0.00699301 & -0.04545455 \\ -0.04545455 & 0.37878788 \end{bmatrix}
Matrix multiplication $(X^TX)^{-1}X^T$:
array([[-0.03846154, -0.03146853, -0.02447552, -0.01748252, -0.01048951, -0.0034965 , 0.0034965 , 0.01048951, 0.01748252, 0.02447552, 0.03146853, 0.03846154], [ 0.33333333, 0.28787879, 0.24242424, 0.1969697 , 0.15151515, 0.10606061, 0.06060606, 0.01515152, -0.03030303, -0.07575758, -0.12121212, -0.16666667]])
Matrix vector multiplication $(X^TX)^{-1}X^Ty$:
\begin{bmatrix} 0.01174825 & 2.30530303 \end{bmatrix}
This seems to be the perfect slope for minimizing cost. But I'm unable to understand how does it exactly work.
It seems like that the algorithm such as gradient descent would've took extremely large amount of iterations to estimate perfect slope and bias value.
How/why does this equation exactly work? Is it somehow related to differentiation? Could it be compared to algorithms such as gradient descent? If so, how? Is it possible to use this equation with sigmoid functions?