If ReLU is so close to being linear, why does it perform much better than a linear function?

Question

ReLU is defined as being $x \mapsto x$ whenever $x \geq 0$ and is constant on zero for negative numbers.

I'm a beginner to deep learning research and methodologies but I've already seen several examples that claim that using ReLU as activation functions on a network will be superior than other common functions, including simply linear ones, of course.

ReLU has 1 point of "drastic" change, and is "otherwise linear".

My questions are:

Is there research or opinions on why this change significantly affects the performance of a network?
Is there something unique about $0$, the point of change? What if we move the point of change to $1$ instead for instance?

My intuition is that, as $0$ separates the line, it separates the sign function and that may have high affect on certain types of data that rely on positive/negative classification, perhaps.

How relevant is the dataset being learned? Do we see that ReLU is "better" on most examples of datasets?

Why ReLU is better than the other activation functions

The above question is relevant. Mine is more focused on the difference between ReLU and a linear function.

score 2 · Answer 1 · answered Feb 26 '20 at 09:36

A neural network is a succession of layers (dense/linear, convolutional). After each layer, you need a non-linear activation function. Why do you need a non-linear activation function? Because composing two linear transformations is equivalent to having a single linear transformation:

$ (x \cdot W_1 + b_1) \cdot W_2 + b_2 = x (W_1 \cdot W_2) + (b_1 \cdot W_2 + b_2)$

Therefore, stacking linear transformations would be pointless, because the whole network would be equivalent to a linear transformation.

A ReLU may seem similar to a linear function, but its non-linear nature gives the network the ability to model non-linear functions.
No, there is nothing unique about $0$ being the point of change. Most neural network layers have a bias term that makes the point of change irrelevant, as during the training the bias will be adjusted to the proper value.
While in deep learning everything is data-dependent, the fact that the ReLU's changing point is irrelevant is independent from the data (see previous answer).

The advantage of ReLU is against other non-linearities, like sigmoid and tanh, which suffer the vanishing gradient problem. ReLU itself has its own problem, namely the "dying ReLU problem".

Thanks! Do you have some references about the second clause? I'd like to better understand how this affects the point of change. — Mariah, Feb 26 '20 at 16:24
I don't remember having seen any discussion about this. My answer (2) is pure deduction: you have a trainable free term (bias) added to the output of the operation (matrix multiplication / convolution) before the activation. Modifying the "changing point" of the ReLU is like subtracting a scalar. The bias would certainly compensate for any scalar you subtract. — noe, Feb 26 '20 at 16:44
How is it like subtracting a scalar? If you subtract $a > 0 $ then the graph "falls" $a$ units. In order to move the changing point you need to do a variable change $x \mapsto x + a$ to move it left $a$ units. — Mariah, Feb 26 '20 at 17:03
Sorry, I meant "adding" instead of "subtracting", and that's precisely what you wrote. These $a$ would be easily compensated by the learned bias. — noe, Feb 26 '20 at 17:42

score 1 · Answer 2 · answered Feb 26 '20 at 09:19

1

A ReLU serves as a non-linear activation function. If a network had a linear activation function, then it wouldn't be able map any non-linear relationships between the input features and its targets. This would render all hidden layers redundant, as your model would just be a much more complex logistic regression.

So I think a better question you should be asking is: Why do you need a non-linear activation function?

As an intuitive example on what this means you can head on to TensorFlow playground and try this out for yourself. Just change the activation function; the rest of the default settings work well to illustrate the point.

URL for trial with ReLU
URL for the same function with a linear activation.

answered Feb 26 '20 at 09:19

Djib2011

7,968
5
27
37

Thanks for your answer Djib. I understand the need to map non-linear relationships, but that is not what my question is about. I'm interested in the particular properties of ReLU, see my question. – Mariah Feb 26 '20 at 09:24
You're asking why ReLU and not a linear activation, if I'm not mistaken. The point is that ReLU is not-linear. Regarding ReLU vs the other non-linear activation functions, the answer you cited should cover you. – Djib2011 Feb 26 '20 at 10:30

score 1 · Answer 3 · answered Mar 27 '20 at 10:23

In addition to the ncasas' answer, which is good in my opinion, I'd like to point out that ReLU is computationally inexpensive, in contrast to sigmoid activation functions. They require only an if / then comparison, while e.g. the logistic function requires exponentiation, addition, and division. This practical consideration makes ReLU's attractive, especially when the computation is to be performed on simpler processing units, like the GPUs.

Jincy P Janardhanan · Answer 4 · 2022-02-23T09:42:10.703

DISCLAIMER: These are only my opinions, and intuitions about how I understand ReLU.

ReLU has 1 point of "drastic" change, and is "otherwise linear".

Exactly, and that makes all the difference!

Let's say a neuron in the first layer outputs a negative value for $ z = w^Tx + b $. Then, with the ReLU activation function, the neuron computes the line $z = 0$.

Suppose, the corresponding neuron in the next layer gets a positive value for $z$. Then, the 2nd neuron computes the line $z = z$.

Hope this figure below illustrates how ReLU becomes useful.

With the ReLU activation and a proper adjustment of weights, the 2nd layer can probably compute a new feature (a triangle). If we add a 3rd layer, the third corresponding neuron can probably compute a feature that looks like a quadrilateral, and so on... (This can even be imagined like playing a game of building blocks with a child!)
This property of ReLU that it helps each layer to act as building blocks for the next layer, makes it extremely powerful (regardless of the dataset we train on). Also, ReLU is computationally much more efficient and simple than any other activation function (it's either a 0 or the same thing, $z$). Hence ReLU becomes the best choice for most types of data (unless the data inherently has some properties that can be better modelled by other activation functions like the sigmoid or tanh).

In contrast, a completely linear activation function with no point of discontinuity (like the 0 in the ReLU), can only compute the same feature over and over - it just changes the magnitude or scale.

Regarding your 2nd question - No, I don't think there is any significance to what the singularity (or the point of discontinuity) is, but there has to be at least one point, any point - just so that neurons in different layers computes different functions (here, lines). But 0 is a "good" value, it easily divides the number-line into negative and positive sides (easy to imagine), and also help in making the computations faster.

But with ReLU, there also comes the problem of dying ReLU - when the output of $z$, of corresponding neurons in each successive layer, becomes all negative or all positive. Then again, we are back to having just a linear function to our model. In that case, we use a variation of ReLU called the leaky ReLU, which computes a line with a small slope, even if $z$ is 0.
(The leaky ReLU function is given by $g(z) = max(0.01z, z)$. Usually, we don't want our activations to go larger on the negative side, so we multiply $z$ by a very small factor and 0.01 is a constant that does just that. Again, I don't think there is any significance to the value 0.01 - it just has to be any small factor, and 0.01 is the default.)

If ReLU is so close to being linear, why does it perform much better than a linear function?

4 Answers4