The $ReLU(x) = max(0,x)$ function is an often used activation function in neural networks. However it has been shown that it can suffer from the dying Relu problem (see also What is the "dying ReLU" problem in neural networks?)
Given this problem with the ReLU function and the often seen suggestion to use a leaky ReLU instead, why is it that to this day ReLU remains the most used activation function in modern deep learning architectures? Is it simply a theoretical problem that does not often occur in practice? And if so, why does it not occur often in practice? Is it because as the width of a network becomes larger the probability of dead ReLUs becomes smaller (see Dying ReLU and Initialization: Theory and Numerical Examples )?
We moved away from sigmoid and tanh activation functions due to the vanishing gradient problem and avoid RNN's due to exploding gradients but it seems like we haven't moved away from ReLUs and their dead gradient? I would like to get more insights in to why.