-2

I've been working on an implementation of TD-backgammon. The paper/project I'm basing my implementation on is here:

https://www.cs.cornell.edu/boom/2001sp/Tsinteris/gammon.htm

Everything makes sense to me up until the point that it talks about the procedure for back-prop. I haven't taken a lot of upper-level maths past Calc II, and I've never taken a formal course on ML/RL.

The description:

Backpropagation procedure:

Given an input vector V and a desired output O.

Calculate error E between the network's output on V and the desired output O.

e(s) = (lambda)*e(s) + grad(V)

V = V + (alpha)*error(n)*e(s)

where error(n) is:

For the weight between hidden node i and the output node, error(i)=E*activation(i)*weight(i)

For the weigth between input node j and hidden node i, error(j,i)=error(i)*activation(j)*weight(j,i)

The main points I'm confused about is:

What information is included in the "eligibility trace vector" e(s)?

What is "(lambda)" in step 2?

What is "grad(V)" in step 2. Does it stand for the gradient? and if so what does this mean?

What is meant by "alpha" in step 3?

Any help or resources would be greatly appreciated.

1 Answers1

0

\alpha is the learning rate parameter. This controls iteration step size in the optimization and influences how much new information overrides old information.

\lambda is the trace decay parameter for the temporal differencing (the TD part of TD-Gammon). This controls how the difference in board evaluations is fed back to prior estimates. Setting \lambda to 0 turns the algorithm into TD(0) -- essentially a Markov decision process. Setting \lambda to 1 turns the algorithm into Monte Carlo reinforcement learning. Values in between control how long previous estimates continue to get updated.

\grad{V} is the gradient of the network weights.

The Wikipedia pages on TD-Gammon (https://en.wikipedia.org/wiki/TD-Gammon) and TD(\lambda) (https://en.wikipedia.org/wiki/Temporal_difference_learning) are both pretty helpful.