0

Looking at the literature, there are 2 distinct approaches to LSTM


Some people use recurrent weights with Input, Forget, Output - notice, their equations don't even mention dataGate, they start from describing the $f$ or $i$ gate (1), Wikipedia: (2)

Lke this:

enter image description here


Other people use recurrent weights with dataGate (z), Input, Forget, Output (1) (2) (3), (4)

Like this:

enter image description here

Having tried the first approach (I still used standard weights on dataGate but no recurrent weights in that place) seems fine but a little odd - network converges well, but very rarely jumps over local mimima, as if too high learning rate.

Personally I like the second approach (recurrent weights on all 4 entry points), which one do I use?


Edit:

to make things worse, the "LSTM Peephole paper", page 121 describes 3 connections (no peephole for dataGate).

This destroys the equal-sizes of my matrices, because now 3 gates use the Usual, Recurrent and Peephole weights, but the very first gate only uses the Usual weights

Kari
  • 2,726
  • 2
  • 20
  • 49

2 Answers2

1

There's more than two or three variants with regards to LSTM.

A paper that explores these variants can be read here:

LSTM:A Search Space Odyssey

Tophat
  • 2,420
  • 11
  • 16
1

Actually, both of the examples in the question were identical. It's just that in the first example, $\sigma_c$ denotes that tanh (on the 4th line).

And in the second example, they actually write $tanh$ on a separate line. It's just syntaxis.

Kari
  • 2,726
  • 2
  • 20
  • 49