4

I watched the Risto Siilasmaa video on Machine Learning. It's very well explained, but the question emerged that at what stage should we use the activation function and why we need it at all. I know that by definition the activation function transforms the sum of w*x+b to a number between some lower and upper limit.

In the video Risto Siilasmaa explains that in the training process there are the following steps:

  1. Start with random weights.
  2. Calculate the outcome (sum of w*x+b) - we know what it should be because we know what image we gave to the system.
  3. Calculate the error.
  4. Nudge all weights to reduce the error.

But what about the activation? Where to place it in the previous list? Before the error calculation? And what would happen if we omitted it altogether? Just calculate the outcome and error and nudge the weights? Is it because the error calculation doesn't work well when the value of outcome isn't transformed between some lower and upper limit?

Jane Mänd
  • 349
  • 3
  • 9

6 Answers6

8

Generally the activation is part of the model and gets applied for each neuron, so definitely before the error calculation. What the activation function is depends on what task you are solving and where the neuron of interest is. In principle the activation function $f$ would go to the calculation of the outcome

$$ y = f(Wx + b)$$

For output neurons, if you are doing classification, then $f$ should map between 0 and 1, since you'll interpret the outcome as a probability. For regression $f$ could be just the identity.

For hidden (i.e. non-output neurons), you definitely want to use a non-linear $f$. The reason is that the neural network would otherwise be equivalent to a regular linear model. So the non-linear activations are needed to harvest the expressive power of neural networks.

For deep learning the most popular $f$ for hidden neurons would probably be the rectified linear unit (relu)

$$ f(x) = \max(0,x)$$

matthiaw91
  • 1,545
  • 5
  • 17
  • But how the model makes sure which weights to use in prediction? For example we have two classes, dogs and cats, 1000 images in both. We train them and that means we have a lot of different weights. In prediction, if we choose some unknown dog image, then the model needs to known which weights to use on this image. Am I right? How is it done? – Jane Mänd Feb 19 '20 at 13:08
  • After training we have one set of weights and the model uses all of them everytime. There is no decision which weights to use. For each layer you have a weight matrix and a bias vector, you multiply and add them to the input and apply the activation function to each element of the result. – matthiaw91 Feb 19 '20 at 13:22
  • What do You mean by 'one set of weights'? If we have 2000 images then each of them are trained separately. Right? If the first dog image is trained then the generated weights are added to the next image's weights? Confusing. – Jane Mänd Feb 19 '20 at 13:43
  • 1
    You do not train on each image separately. The point is to get one model that can differentiate between cats and dogs. The model is parameterized by its weights and the training is supposed to find the weights that make the correct predictions on most images. – matthiaw91 Feb 19 '20 at 14:08
  • 3
    @JaneMänd there are not weights for each image, there are weights for the network, which does the classification task. – hobbs Feb 19 '20 at 17:42
  • Ok, thank You. But in which way the weights for the whole network are formed?
    1. All the images are trained simultaneously?
    2. Images are trained one image at a time and by each iteration the weights are updated?
    3. First the weights for each image are found and then based on them some common optimum calculated?

    How is this optimum solution found that all the images can use one set of weights?

    – Jane Mänd Feb 20 '20 at 09:35
  • minimize the mean error over all images 2) you can do both, but for practical reasons usually one or only a few images are considered at each iteration (see: stochastic gradient descent or mini-batch gradient descent). 3) again, as @hobbs commented, there are no weights "for each image", there is only weights for the network. 4) that is the key question of neural networks. usually gradient-based methods are used, but beyond that this cannot be answered in the comments below a question that was about something else. There is many good introductions in books and online.
  • – matthiaw91 Feb 20 '20 at 10:16