How does "linear algebraic" weight training function work?

Question

This answer shows that linear and polynomial function weights can be trained using this matrix operation:

$w = (X^TX)^{-1}X^Ty$

Therefore, algorithms such as gradient descent are not necessary for these functions. By my understanding, gradient descent for linear regression model finds perfect derivative for each weight so that cost function is minimal.

Before asking for connections between gradient descent and equation above, let's separate the equation in smaller steps:

$X = [1,2,3,4,5,6,7,8,9,10,11,12]$

$y=[2.3,2.33,2.29,2.3,2.36,2.4,2.46,2.5,2.48,2.43,2.38,2.35]$

Let's turn $X$ vector to matrix by adding column of 1's which will be used to train the bias value:

\begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \\ 4 & 1 \\ 5 & 1 \\ 6 & 1 \\ 7 & 1 \\ 8 & 1 \\ 9 & 1 \\ 10 & 1 \\ 11 & 1 \\ 12 & 1 \end{bmatrix}

Transpose: $X^T$:

\begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 9 & 10 & 11 & 12 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix}
Matrix multiplication $X^TX$:

\begin{bmatrix} 650 & 78 \\ 78 & 12 \end{bmatrix}
Inverse matrix operation $(X^TX)^{-1}$:

\begin{bmatrix} 0.00699301 & -0.04545455 \\ -0.04545455 & 0.37878788 \end{bmatrix}
Matrix multiplication $(X^TX)^{-1}X^T$:

array([[-0.03846154, -0.03146853, -0.02447552, -0.01748252, -0.01048951,
        -0.0034965 ,  0.0034965 ,  0.01048951,  0.01748252,  0.02447552,
         0.03146853,  0.03846154],
       [ 0.33333333,  0.28787879,  0.24242424,  0.1969697 ,  0.15151515,
         0.10606061,  0.06060606,  0.01515152, -0.03030303, -0.07575758,
        -0.12121212, -0.16666667]])

Matrix vector multiplication $(X^TX)^{-1}X^Ty$:

\begin{bmatrix} 0.01174825 & 2.30530303 \end{bmatrix}

This seems to be the perfect slope for minimizing cost. But I'm unable to understand how does it exactly work.

It seems like that the algorithm such as gradient descent would've took extremely large amount of iterations to estimate perfect slope and bias value.

How/why does this equation exactly work? Is it somehow related to differentiation? Could it be compared to algorithms such as gradient descent? If so, how? Is it possible to use this equation with sigmoid functions?

score 2 · Accepted Answer · answered Mar 25 '18 at 20:12

2

Your goal is to find a $w$ such that

$$Xw \approx y$$

and way to model this problem is to minimize the objective function:

$$\min_w\|Xw-y\|^2.$$

Differentiating with respect to $w$ and equate it to zero gives us

$$2X^T(Xw-y)=0$$

$$X^TXw-X^Ty=0$$

$$X^TXw=X^Ty$$

if $X^TX$ is invertible, then we have

$$w=(X^TX)^{-1}(X^Ty)$$

Remark:

We tend to avoid computing inverse and prefer gradient-based method. Complexity of the normal equation method is cubic.

answered Mar 25 '18 at 20:12

Siong Thye Goh

3,003
2
16
23

Thank you for your response. My apologies, I couldn't exactly understand the part where we equated to zero. Is there some specific rule that i'm not aware of? It seems like Euclidean norm. – ShellRox Mar 26 '18 at 07:46
Yes, the norm that I used are just Euclidean norm. I just apply chain rule. – Siong Thye Goh Mar 26 '18 at 15:44
Reference for matrix calculus – Siong Thye Goh Mar 26 '18 at 16:08
Thank you! I guess I need to study more about multivariable differentiation. – ShellRox Mar 26 '18 at 19:29
@SiongThyeGoh I've searched alot but didn't find any valid solution. I aware that the complexity of gradient based methods are $O(n^2)$. Is there any proven approach for showing that? – Green Falcon Oct 27 '18 at 17:53
do you mean complexity of computing the gradient? – Siong Thye Goh Oct 27 '18 at 18:12
Actually, I meant the complexity of gradient descent using batch steps, #epochs == #iterations. For updating weights, $w^{t+1} = w^{t} + X^ty - X^tXw$ should be employed. The normal equation just has an extra inverse in the complexity view. I meant this. – Green Falcon Oct 29 '18 at 06:12

score 1 · Answer 2 · answered Mar 25 '18 at 20:09

1

First of all, gradient descent cannot find the global optimum. If your function has just one extremum, it can find it but if it has lots of them there is no guarantee that it finds the best one. If you are familiar with derivative and slope of functions, the Normal equation, tries to find the point which all the derivative is equal to zero for all directions, variables. It is the result of the derivatives.

You are somehow right, mathematically we know the optimal solution of all ML problems but in practice, normal equations do not work. The reason is that if you increase the number of features, you will have a matrix which gets larger and larger. Most problems are not linearly separable. Consequently, you have to add high order polynomials and the point is that you don't know which polynomial to use. If you use just the second order polynomial you will get a matrix that won't be located in your computer's memory because of being so much large. suppose that you have 100 features. If you just add the combination of multiplication of two variables, compute how many entries your matrix will have. And if you do that you can be sure that it does not add too much complexity to your model, because you've not added complex higher order polynomials whilst you need them. ML algorithms like deep-nets try to learn so many complicated functions using the smaller number of entries.

How/why does this equation exactly work? Could it be compared to algorithms such as gradient descent? If so, how? Is it possible to use this for more complex activation functions such as hyperbolic tangent?

It is the result of taking the multivariable derivative, gradient. Yes the goal of them is to find the local minima, but they have a different approach. Gradient descent tries to go downhill whilst normal equation tries to find the location by finding where the derivative is zero for all features.

I didn't figure out the last question to help you address it.

answered Mar 25 '18 at 20:09

Green Falcon

14,058
9
57
98

Thank you for your response. I'm familiar with differentiation at some extent, but couldn't exactly figure out the equation. Is it minimizing cost function by equating derivatives of all the terms to zero? Also, my last question was about other activation functions, such as hyperbolic tangent or logistic functions, I was interested if the equation above can be used for them, but considering that these functions are not directly differentiable, I don't think that the equation will work. – ShellRox Mar 25 '18 at 20:35
1

@ShellRox About the first part, you are facing to an optimization problem which tries to reduce the difference between the real results, y, and the predicted ys. Then all you do is solving that optimizing equation. About the second question, as far as I know we use activation functions which add non-linearity in gradient based methods not in normal equation because it is not needed, where do you want to use that? But keep going, you may invent a new path :) – Green Falcon Mar 26 '18 at 05:53
Thank you! i think i understood the first part. About the second one, by my knowledge inputs are weighted to minimize the cost function for some specific activation functions correct? But as I know sigmoid activation functions (tanh, logistic, arctan, etc.) are not directly differentiable thus they use backpropagation which in turn uses gradient descent. But I wonder if there was some kind of equation for sigmoid functions as well. – ShellRox Mar 26 '18 at 09:57
@ShellRox sigmoid like activation functions like Sigmoid itself and tangant hyperbolic are all diferentiable. Moreover they are used to add non-liniarity to the approximated functions. This is about the forward pass, which if you don't use non-liniarity you will estimate just a single line, while using non-liniarities help you make complex functions. Moreover, nowadays people use Relu which is not differentiable at point zero, but it has shown that it has very good effect because it's derivative is one or zero. It does not lead to vanishing / exploding gradient. – Green Falcon Mar 27 '18 at 03:47
Take a look at here. and here. – Green Falcon Mar 27 '18 at 03:52
Thank you! I understood the purpose of non-linearity. I guess I just need to study non-linear activation functions more before asking about them. – ShellRox Mar 27 '18 at 15:22
@ShellRox happy helped you :) – Green Falcon Mar 28 '18 at 15:43

How does "linear algebraic" weight training function work?

2 Answers2

Linked