Why is the "dying ReLU" problem not present in most modern deep learning architectures?

Question

The $ReLU(x) = max(0,x)$ function is an often used activation function in neural networks. However it has been shown that it can suffer from the dying Relu problem (see also What is the "dying ReLU" problem in neural networks?)

Given this problem with the ReLU function and the often seen suggestion to use a leaky ReLU instead, why is it that to this day ReLU remains the most used activation function in modern deep learning architectures? Is it simply a theoretical problem that does not often occur in practice? And if so, why does it not occur often in practice? Is it because as the width of a network becomes larger the probability of dead ReLUs becomes smaller (see Dying ReLU and Initialization: Theory and Numerical Examples )?

We moved away from sigmoid and tanh activation functions due to the vanishing gradient problem and avoid RNN's due to exploding gradients but it seems like we haven't moved away from ReLUs and their dead gradient? I would like to get more insights in to why.

score 3 · Answer 1 · answered Mar 04 '21 at 07:31

Neural networks work as a great encoder/decoder for any task you give them. To create a good representation of data, you require sparse representations. The dead neurons actually contribute to that. ReLU actually help the process.

In fact, a recent paper actually justified that ReLU is even better than LeakyReLU which you mentioned above. They start with the problem that in order to create a decision boundary, you need to disentangle data. But, in order to disentangle data what you need is a non-continuous function so the data can be mapped to disjointed area in the manifold. That is only possible by a non-continuous function. They even go further to explain that tanh, sigmoid and others work due to the floating point precision which induces the non-continuous nature to even these functions. Here's a Medium article and the original paper.

Brian Spiering · Answer 2 · 2021-03-04T19:23:32.113

In practice, dead ReLUs connections are not a major issue. Most deep learning networks can still learn an adequate representations with only sub-selection of possible connections. This is possible because deep learning networks are highly over-parameterized.

Even with the possible drawbacks backs of the dying ReLUs problem, the computational effectiveness and efficiency of ReLUs still make them one of the best options currently available.

Why is the "dying ReLU" problem not present in most modern deep learning architectures?

2 Answers2