recurrent weights on LSTM Data-gate? (the one with tanh)

Question

Looking at the literature, there are 2 distinct approaches to LSTM

Some people use recurrent weights with Input, Forget, Output - notice, their equations don't even mention dataGate, they start from describing the $f$ or $i$ gate (1), Wikipedia: (2)

Lke this:

Other people use recurrent weights with dataGate (z), Input, Forget, Output (1) (2) (3), (4)

Like this:

Having tried the first approach (I still used standard weights on dataGate but no recurrent weights in that place) seems fine but a little odd - network converges well, but very rarely jumps over local mimima, as if too high learning rate.

Personally I like the second approach (recurrent weights on all 4 entry points), which one do I use?

Edit:

to make things worse, the "LSTM Peephole paper", page 121 describes 3 connections (no peephole for dataGate).

This destroys the equal-sizes of my matrices, because now 3 gates use the Usual, Recurrent and Peephole weights, but the very first gate only uses the Usual weights

score 1 · Answer 1 · answered Feb 01 '18 at 20:37

1

There's more than two or three variants with regards to LSTM.

A paper that explores these variants can be read here:

LSTM:A Search Space Odyssey

answered Feb 01 '18 at 20:37

Tophat

2,420
11
16

score 1 · Accepted Answer · answered Nov 10 '18 at 22:32

1

Actually, both of the examples in the question were identical. It's just that in the first example, $\sigma_c$ denotes that tanh (on the 4th line).

And in the second example, they actually write $tanh$ on a separate line. It's just syntaxis.

answered Nov 10 '18 at 22:32

Kari

2,726
2
20
49

recurrent weights on LSTM Data-gate? (the one with tanh)

2 Answers2