3

I've been reading the Sutton and Barto book and following David Silver's lectures on youtube. The basic principles make a lot of sense to me and I've been building a maze (an arbitrary grid where the agent can move up, down, left right unless blocked by a wall) solving agent that learns by randomly sampling paths and as it learns it weights its 'random' choice by the amount of reward received.

Because the agent can only assign a value to the state (its choice of direction when moving to next square) once it has reached the goal, the reward is delayed. When the agent gets to the goal, I have a chain of choices that led to it.

My first thought was to discount the reward assigned to each square in that chain by a percentage like 90% for each square away from the goal. So

GOAL   = 1
GOAL-1 = 1 * 0.9
GOAL-2 = 1 * 0.9^2
...

However this results in a reward so small to be meaningless in a grid any larger than about 5x5.

The rewards have to differ of course for choices that had less influence on reaching the goal, but I can't figure out how to sensibly assign them.

starfish
  • 392
  • 1
    Shouldn't rewards be associated with state transitions and be awarded immediately after taking action? The reward should be for taking the step along the shortest path towards the goal. I suppose you could consider the entire path the state with the reward based solely on the 2nd state of transitions and run the solver against many, many mazes, but this doesn't seem in line with the concept of reinforcement learning (each single action would be too complex and would basically encompass the entire maze solution, rather than spreading out the solution over multiple actions). – outis Aug 03 '15 at 18:49
  • Ideally the reward would be assigned immediately after the transition, but the agent has no awareness of success until it actually reaches the end. I.e. it could move right, right, right and end up in a dead end that it has to reverse out of. Or the agent might be one step away from the end, but will not know until it actually reaches it, so has no knowledge of the value of its current state. – starfish Aug 03 '15 at 18:53
  • The agent doesn't need to (in fact, shouldn't) know the reward for any action beforehand. The reward rules should be part of whatever the agent informs of its actions (e.g. the environment, some sort of overseer). In addition to states, rewards &c, there should also be rules for what the agent observes. This separates out the agent's knowledge of the environment from things like knowledge of rewards. – outis Aug 04 '15 at 22:20
  • That's what I mean - but there are many state transitions in the maze that result in no reward, because the agent has just landed in another arbitrary place so there is no reason to return a reward for the state transition. Only when the overseer observes that the agent has moved to the goal position can it give reward. – starfish Aug 04 '15 at 22:28
  • The reason for the reward is in reinforcement learning; it's what makes it what it is. The overseer/environment/&c. can give a reward depending on whether the agent will have to backtrack, and whether it's traveling on the shortest path. If the agent will have to move back to the previous space in order to reach the goal, the reward is 0. Having multiple paths through the maze makes the reward system trickier and somewhat arbitrary, but not impossible. The main (arbirtary) decision is what the reward should be if the step is along a path, but there are shorter paths that require backtracking. – outis Aug 04 '15 at 23:07
  • I think I understand - the overseer is all knowing so can assign rewards for all state transitions even though the agent is naive? – starfish Aug 05 '15 at 00:05
  • Are there not rl problems where there is no overseer and rewards are still delayed? – starfish Aug 05 '15 at 00:05
  • Correct about who knows what. As for real-world problems, there definitely are where rewards are at least partially delayed, but that's different than only having rewards at the end, and this problem is neither. Instead, they're modeled by the state resulting from the actions. In other words, an action affects the future rewards by limiting future states. I suppose you could model this for maze solving by having a reward of -1 for any movement action (the winning strategy would be that that loses the least). Also, an overseer isn't a necessary part of reinforcement learning. – outis Aug 05 '15 at 03:45

1 Answers1

1

Disclaimer - outis is right in his comment, this isn't good approach. Still, I'll try to answer your question, I'm curious if that'll work.

OK, let's start with notation.

  • reward - base reward, in your example = 1
  • modifier(i) - modification of reward for ith step, in your example = 0.9^i
  • total(i) - total reward for step i, in your example reward*modifier(i)
  • I - length of your solution, meaning number of steps taken to solve the problem (it will vary with each solution, but that shouldn't be a problem)
  • total(I) - minimal wanted reward

Your problem arises because sequence modifier(i) converges to 0 in infinity.

Use something else. Choose which steps - first or last - have more impact on your solution. If most profit comes from first steps, sequence modifier(i) should be descending, and it should be ascending in other case. Let's assume that first steps are more important. If not, just use modifier(I-i) instead of modifier(i).

If you want total(0)=1 and total(I)=0.5 with geometric progress, then you know that:

reward=1
reward*(q^I)=0.5

where q is scaling factor. Now you calculate it as:

q = 0.5^(-I)

Voila! No matter how long your solution is - each step will have some non-zero impact on learning.

You can use other sequences too - for example linear:

total(i) = reward-i*a
reward-I*a = r 

solved for a.

Of course these aren't your only options - you could use for example sqrt or log for ascending modifier (but those are not the only options - use your imagination here). I think you'll figure the math yourself.

  • Thanks, I will try that. What is a better method to try for solving this problem where reward discovery is delayed? I thought that only learning at the end of an episode was the basis of the Monte Carlo method. – starfish Aug 03 '15 at 19:44