Why is activation needed at all in neural network?

Question

I watched the Risto Siilasmaa video on Machine Learning. It's very well explained, but the question emerged that at what stage should we use the activation function and why we need it at all. I know that by definition the activation function transforms the sum of w*x+b to a number between some lower and upper limit.

In the video Risto Siilasmaa explains that in the training process there are the following steps:

Start with random weights.
Calculate the outcome (sum of w*x+b) - we know what it should be because we know what image we gave to the system.
Calculate the error.
Nudge all weights to reduce the error.

But what about the activation? Where to place it in the previous list? Before the error calculation? And what would happen if we omitted it altogether? Just calculate the outcome and error and nudge the weights? Is it because the error calculation doesn't work well when the value of outcome isn't transformed between some lower and upper limit?

score 8 · Answer 1 · answered Feb 19 '20 at 10:10

8

Generally the activation is part of the model and gets applied for each neuron, so definitely before the error calculation. What the activation function is depends on what task you are solving and where the neuron of interest is. In principle the activation function $f$ would go to the calculation of the outcome

$$ y = f(Wx + b)$$

For output neurons, if you are doing classification, then $f$ should map between 0 and 1, since you'll interpret the outcome as a probability. For regression $f$ could be just the identity.

For hidden (i.e. non-output neurons), you definitely want to use a non-linear $f$. The reason is that the neural network would otherwise be equivalent to a regular linear model. So the non-linear activations are needed to harvest the expressive power of neural networks.

For deep learning the most popular $f$ for hidden neurons would probably be the rectified linear unit (relu)

$$ f(x) = \max(0,x)$$

answered Feb 19 '20 at 10:10

matthiaw91

1,545
5
17

But how the model makes sure which weights to use in prediction? For example we have two classes, dogs and cats, 1000 images in both. We train them and that means we have a lot of different weights. In prediction, if we choose some unknown dog image, then the model needs to known which weights to use on this image. Am I right? How is it done? – Jane Mänd Feb 19 '20 at 13:08
After training we have one set of weights and the model uses all of them everytime. There is no decision which weights to use. For each layer you have a weight matrix and a bias vector, you multiply and add them to the input and apply the activation function to each element of the result. – matthiaw91 Feb 19 '20 at 13:22
What do You mean by 'one set of weights'? If we have 2000 images then each of them are trained separately. Right? If the first dog image is trained then the generated weights are added to the next image's weights? Confusing. – Jane Mänd Feb 19 '20 at 13:43
1

You do not train on each image separately. The point is to get one model that can differentiate between cats and dogs. The model is parameterized by its weights and the training is supposed to find the weights that make the correct predictions on most images. – matthiaw91 Feb 19 '20 at 14:08
3

@JaneMänd there are not weights for each image, there are weights for the network, which does the classification task. – hobbs Feb 19 '20 at 17:42
Ok, thank You. But in which way the weights for the whole network are formed?
1. All the images are trained simultaneously?
2. Images are trained one image at a time and by each iteration the weights are updated?
3. First the weights for each image are found and then based on them some common optimum calculated?
4. …
How is this optimum solution found that all the images can use one set of weights?
– Jane Mänd Feb 20 '20 at 09:35
minimize the mean error over all images 2) you can do both, but for practical reasons usually one or only a few images are considered at each iteration (see: stochastic gradient descent or mini-batch gradient descent). 3) again, as @hobbs commented, there are no weights "for each image", there is only weights for the network. 4) that is the key question of neural networks. usually gradient-based methods are used, but beyond that this cannot be answered in the comments below a question that was about something else. There is many good introductions in books and online.

matthiaw91

Feb 20 '20 at 10:16

score 3 · Answer 2 · answered Feb 19 '20 at 10:16

3

Activation function is applied after (sum of w*x+b) for each neuron in each layer.

The role of activation function is to introduce non linearity "higher order relation ship" between inputs and outputs.

answered Feb 19 '20 at 10:16

Hossam Alzomor

31
3

mrin9san · Answer 3 · 2020-02-19T10:44:43.017

Without activation, the model is just a linear model like linear plotting,regression. Where is the "learning"?

2.Calculate the outcome (sum of w*x+b) - we know what it should be because we know what image we gave to the system.

The weights are random. The neuron does not know how to bound the value (the firing pattern). Activation should be there as an instruction how to bound the output. Otherwise from layer to layer outcome can be anything.

4. Nudge all weights to reduce the error.

How we do it? We find out the gradient (direction of minimizing/maximizing) because if you want to optimize the function as desired , you find its values at its derivative/gradient, back propagate the gradient since we want to minimize or maximize some cost function (error or difference in this case). How will you find gradient of a linear function since differentiating it will be a constant. Learning will essentially stop.

So we introduce certain non-linearity (activation functions) to the system so that the gradient varies and we get a way to update our weights every time we back propagate.

score 1 · Answer 4 · answered Feb 20 '20 at 17:58

Mathematically, the weights that sit between two given rows of neurons collectively form a transformation matrix, and a row of neurons forms a vector. To use the network, we use the matrix to transform the vector, giving us a vector representing the next row of neurons. Then we apply an activation function to those neurons. Then we proceed to the next layer and repeat.

So what happens when we don't have an activation function? Then we just have a series of matrix transformations, and we can use matrix multiplication to compute a single matrix that does the same thing. So in truth such a network has no hidden layers, and is incapable of deep learning.

score 0 · Answer 5 · answered Feb 21 '20 at 07:59

Talking about activation...
As we know that the activation function gives the positive signal only then when the input (sum of weights and inputs) it gets is positive value. Am I right? Consequently the weights we use to multiply pixel input values has to be as big as possible. But why is it this way that only positive value fits to give better probability value? Why smaller weights are not suitable?

score 0 · Answer 6 · answered Feb 21 '20 at 08:49

Ok, let's put it like that: "no activation" means "linear model", i.e. a model that can only learn linear associations between your variables. This is a very big limitation! The whole point of Neural Networks is: almost any regularity we observe in the world is non-linear, therefore we must make them flexible enough to learn all these complex patterns. That's where non-linear transformations come into play.

A bit more technically, a layer is like a function:

output = f(Weights * Data + Bias)

if all your f()'s in your layers are linear, even the output of the stack of layers such as:

ANN_output = f_3( f_2( f_1(Weights * Data + Bias) ) ) # yeah that's a Neural Net

will be linear as well.

Non-linear activation functions let us overcome this limitation, and allow a model to learn patterns that could be either linear or not. And the more you stack layers (i.e. the more you nest functions into one another) the more complex the final non-linear transformation becomes, making the model able to learn more complicated things.

Why is activation needed at all in neural network?

6 Answers6

2.Calculate the outcome (sum of w*x+b) - we know what it should be because we know what image we gave to the system.

4. Nudge all weights to reduce the error.

Linked