Why do CNNs with ReLU learn that well?

Question

Convolutional Neural Networks (CNNs) use almost always the rectified linear activation function (ReLU):

$$f(x) = max(0, x)$$

However, the derivative of this function is

$$f'(x) = \begin{cases} 0 &\text{if } x \leq 0\\ 1&\text{otherwise}\end{cases}$$

(ignoring that is not differentiable at $0$, as I think it is done in practice). For inputs > 0 this is fine, but why doesn't it matter that the gradient is 0 at every point < 0? Or does it matter? (Are there publications about this problem?)

If a neuron outputs 0 for every sample of the training data, it is basically lost, correct? Its weights will never be adjusted again?

score 3 · Answer 1 · edited Apr 13 '17 at 12:50

3

ignoring that is not differentiable at 00, as I think it is done in practice

yes see ReLUs are not differentiable at zero

If a neuron outputs 0 for every sample of the training data, it is basically lost, correct? Its weights will never be adjusted again?

yes see What is the "dying ReLU" problem in neural networks?

edited Apr 13 '17 at 12:50

Community

1

answered Nov 12 '16 at 22:03

Franck Dernoncourt

5,690
10
40
76

While I appreciate the links, you didn't answer my main question. (So only +1 and not accept) – Martin Thoma Nov 12 '16 at 23:20

Why do CNNs with ReLU learn that well?

1 Answers1