I've been reading the Sutton and Barto book and following David Silver's lectures on youtube. The basic principles make a lot of sense to me and I've been building a maze (an arbitrary grid where the agent can move up, down, left right unless blocked by a wall) solving agent that learns by randomly sampling paths and as it learns it weights its 'random' choice by the amount of reward received.
Because the agent can only assign a value to the state (its choice of direction when moving to next square) once it has reached the goal, the reward is delayed. When the agent gets to the goal, I have a chain of choices that led to it.
My first thought was to discount the reward assigned to each square in that chain by a percentage like 90% for each square away from the goal. So
GOAL = 1
GOAL-1 = 1 * 0.9
GOAL-2 = 1 * 0.9^2
...
However this results in a reward so small to be meaningless in a grid any larger than about 5x5.
The rewards have to differ of course for choices that had less influence on reaching the goal, but I can't figure out how to sensibly assign them.