Difference between Jordan, Elman and normal RNN

Question

As far as I know for history, the Jordan network was proposed first in 1986 as a form of RNN with this diagram:

Actually, this is the solution that makes sense when thinking about sequence data that the current output is an input in the next time (with some weight and activation as shown in figure). However, after this in 1990, Elman network was proposed with feeding back the hidden state not the output like this?

What is the reason or benefit of this modification? And what is the difference between these two types of network and the ordinary RNN (before LSTM and GRU) we know as shown in this figure?

The figure of RNN seems to be very similar to both of them (Elman especially) because we take the hidden state and make it an input again. What is the difference between RNN and Elman and Jordan networks? and what is the usage difference in both of them? Please note that I am taking about RNN before LSTM and GRU. They are out of comparison.

score 2 · Answer 1 · answered Nov 04 '20 at 11:32

The architectures in your second and third pictures are identical because the link $I$ from the hidden units to the context units in the Elman's network is not trainable. It is always equal to 1.

In his paper [1], Elman first provides the Jordan's architecture, then introduces his own network but does not discuss why his one is better.

Anyway, this is what I think about why it may be preferred.

The neurons in the hidden layer learn features or internal representations of the inputs. Backward connections allow hidden units to learn features which encode sequential properties of the inputs.

In a deep network, the neurons in the layer that is further from the input learn more abstract features than the neurons in the layer closer to the input. Feeding the output from a layer back to the same layer allows learning sequential properties at the same level of abstraction. Feeding the output from a layer to the input of a different layer mixes the levels of abstractions.

In the Jordan's network, the output of the whole network directly influences the hidden layer's input, so the learned representation is split between the hidden layer and the output layer. Elman's network allows more separation of features by levels of abstraction.

Elman spends most of his paper trying to understand the features his neurons have learned even for those tasks where his network does not perform well. I am not completely sure but I guess that backward connections between different layers would make such analysis more difficult.

Elman, J.L. (1990), Finding Structure in Time. Cognitive Science, 14: 179-211. doi:10.1207/s15516709cog1402_1

Difference between Jordan, Elman and normal RNN

1 Answers1