I'm wondering about the benefits of advanced activation layers such as LeakyReLU, Parametric ReLU, and Exponential Linear Unit (ELU). What are the differences between them and how do they benefit training?
-
1Partially helpful answer – Dawny33 Aug 23 '17 at 17:53
-
Would you please explain the exact meaning of 'noise in deactivation results in different levels of absence'?, what is levels of absence? – Nina Jan 18 '18 at 11:05
1 Answers
ReLU
Simply rectifies the input, meaning positive inputs are retained but negatives give an output of zero. (Hahnloser et al. 2010)
$$ f(x) = max(0,x) $$ Pros:
- Eliminates the vanishing/exploding gradient problem. (true for all following as well)
- Sparse activation. (true for all following as well)
- Noise-robust deactivation state (i.e. does not attempt to encode the degree of absence).
Cons:
- Dying ReLU problem (many neurons end up in a state where they are inactive for most or all inputs).
- Not differentiable. (true for all following as well)
- No negative values means mean unit activation is often far from zero. This slows down learning.
Leaky ReLUs
Adds a small coefficient ($<1$) for negative values. (Maas, Hannun, & Ng 2013)
$$ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ 0.1 x & \text{otherwise} \end{cases} $$
Pros:
- Alleviates dying ReLU problem. (true for all following)
- Negative activations push mean unit activation closer to zero and thus speeds up learning. (true for all following)
Cons:
- Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).
PReLUs
Just like Leaky ReLUs but with a learnable coefficient. (Note that in the below equation a different $a$ can be learned for different channels.) (He et al. 2015)
$$ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ a x & \text{otherwise} \end{cases} $$
Pros:
- Improved performance (lower error rate on benchmark tasks) compared to Leaky ReLUs.
Cons:
- Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).
ELUs
$$ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha(exp(x)-1) & \text{otherwise} \end{cases} $$
Replaces the small linear gradient of Leaky ReLUs and PReLUs with a vanishing gradient. (Clevert, Unterthiner, Hochreiter 2016)
Pros:
- Improved performance (lower error and faster learning) compared to ReLUs.
- Deactivation state is noise-robust.

- 178
- 5

- 568
- 4
- 9
-
3Thanks what do you mean by "Noise-robust deactivation state (i.e. does not attempt to encode the degree of absence)" ? – nsaura Aug 07 '20 at 00:10