The differentiation of ReLU function is 1 when input is greater than 0, and 0, when input is less than or equal to 0. In the backpropagation process it doesn’t change the value of d(error)/d(weight) at all. Either the gradient is multiplied by 1, or by 0. Which means it only helps to discard the negative inputs. It feels like it works as dropout. Instead of using ReLU, if we use dropout, shouldn’t be it almost same? We use non linear activation function to bring non linearity. But isn’t it also linear transformation. Suppose a training dataset where all the inputs are positive and in the initial model all the weights are positive. Then ReLu(wx+b) ultimately becomes wx+b. How come it is bringing non linearity? I am hella confused about the whole thing.
Asked
Active
Viewed 191 times
1 Answers
0
When all the weights and the inputs are positive you don't have a non-linearity.
To make it work, weights have to be randomly initialized, with some negative and some positive.

Iya Lee
- 152
- 8
this might help.
– Arpit Sisodia Mar 29 '23 at 02:09