As far as I know for history, the Jordan network was proposed first in 1986 as a form of RNN with this diagram:
Actually, this is the solution that makes sense when thinking about sequence data that the current output is an input in the next time (with some weight and activation as shown in figure). However, after this in 1990, Elman network was proposed with feeding back the hidden state not the output like this?
What is the reason or benefit of this modification? And what is the difference between these two types of network and the ordinary RNN (before LSTM and GRU) we know as shown in this figure?
The figure of RNN seems to be very similar to both of them (Elman especially) because we take the hidden state and make it an input again. What is the difference between RNN and Elman and Jordan networks? and what is the usage difference in both of them? Please note that I am taking about RNN before LSTM and GRU. They are out of comparison.