Advanced Activation Layers in Deep Neural Networks

Question

I'm wondering about the benefits of advanced activation layers such as LeakyReLU, Parametric ReLU, and Exponential Linear Unit (ELU). What are the differences between them and how do they benefit training?

Would you please explain the exact meaning of 'noise in deactivation results in different levels of absence'?, what is levels of absence? — Nina, Jan 18 '18 at 11:05

score 10 · Accepted Answer · edited Aug 07 '20 at 03:15

ReLU

Simply rectifies the input, meaning positive inputs are retained but negatives give an output of zero. (Hahnloser et al. 2010)

$$ f(x) = max(0,x) $$ Pros:

Eliminates the vanishing/exploding gradient problem. (true for all following as well)
Sparse activation. (true for all following as well)
Noise-robust deactivation state (i.e. does not attempt to encode the degree of absence).

Cons:

Dying ReLU problem (many neurons end up in a state where they are inactive for most or all inputs).
Not differentiable. (true for all following as well)
No negative values means mean unit activation is often far from zero. This slows down learning.

Leaky ReLUs

Adds a small coefficient ($<1$) for negative values. (Maas, Hannun, & Ng 2013)

$$ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ 0.1 x & \text{otherwise} \end{cases} $$

Pros:

Alleviates dying ReLU problem. (true for all following)
Negative activations push mean unit activation closer to zero and thus speeds up learning. (true for all following)

Cons:

Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).

PReLUs

Just like Leaky ReLUs but with a learnable coefficient. (Note that in the below equation a different $a$ can be learned for different channels.) (He et al. 2015)

$$ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ a x & \text{otherwise} \end{cases} $$

Pros:

Improved performance (lower error rate on benchmark tasks) compared to Leaky ReLUs.

Cons:

Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).

ELUs

$$ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha(exp(x)-1) & \text{otherwise} \end{cases} $$

Replaces the small linear gradient of Leaky ReLUs and PReLUs with a vanishing gradient. (Clevert, Unterthiner, Hochreiter 2016)

Pros:

Improved performance (lower error and faster learning) compared to ReLUs.
Deactivation state is noise-robust.

Thanks what do you mean by "Noise-robust deactivation state (i.e. does not attempt to encode the degree of absence)" ? — nsaura, Aug 07 '20 at 00:10

Advanced Activation Layers in Deep Neural Networks

1 Answers1

ReLU

Leaky ReLUs

PReLUs

ELUs