How to back-propagate illegal actions for policy gradient learning

Question

When training a AI RL agent to play a game there'll be situations where the AI cannot perform certain actions lest they violate the game rules. That's easy to handle, and I can set illegal actions to some large negative amount so when doing an argmax they won't be selected. Or if I use softmax I can set probabilities of illegal actions to zero and then re-calculate softmax on the remaining legal states. Indeed, I believe this is what David Silver was referring to when asked this question at a presentation/lecture of AlphaZero:

https://www.youtube.com/watch?v=Wujy7OzvdJk&t=2404s

But doing so changes the output from the network and surely changes things when performing the backprop once a reward is known.

How does one handle that?

Would I set the illegal actions to the mean of the legal actions, or zero...?

Specifically, this has been thoroughly discussed in the comment chain you participated in under this answer. — Philip Raeisghasem, Mar 11 '19 at 07:14
Please read my question carefully. This is not a duplicate and I've not forgotten about discussions I've participated in. I'm asking specifically about how the modified actions should be handled with the backprop process. i.e. not how to remove illegal actions, but what to set the illegal action values to when they're fed back in for backprop - hence the title "How to back-propagate illegal actions". — BigBadMe, Mar 11 '19 at 07:47
In fact, my question relates to brianberns (unanswered) comment (on Mar 22 '18 at 14:26) on the thread @Brale_ flagged this as a duplicate:
"How does one compute the gradient of the filtered version of Softmax? Seems like this would be necessary for backpropagation to work successfuly, yes?" — BigBadMe, Mar 11 '19 at 07:48
in any case you would backprop only through the action that you chose, so do the same thing that you do with regular version of the algorithm that doesn't have illegal actions when you make the update. And when choosing the actions disregard the illegal actions — Brale, Mar 11 '19 at 10:16
Sorry for misunderstanding. To expand upon @Brale_'s comment, because illegal actions are never taken, the agent never receives any feedback on them in the form of reward (or lack of reward). So it isn't even possible to backprop through illegal actions. — Philip Raeisghasem, Mar 11 '19 at 11:34
If doing A3C, I believe the entire action output from the policy is fed back to compute policy gradients, not just the single selected action, as would be the case for Q-based learning. Sorry, I should have made clear in the question this was for policy gradient. — BigBadMe, Mar 11 '19 at 12:29
method shouldn't matter though, even if its policy gradient you still backprop only one action. Are you using someone elses implementation or did you make your own? If NN outputs softmax probabilities you would take them, rescale them based on if there are some illegal actions and then sample from them. For the backprop I think standard approach is to apply one hot vector on the original NN output and then use that for calculating gradients for backprop. — Brale, Mar 11 '19 at 21:13
Okay, well that makes things easier. I'm using a normalized output layer, but I guess the same principle would apply and I just negate the illegal actions and select from the legal ones.
But can I ask, why would you need to rescale the softmax outputs after setting illegal moves to probability zero? If we just do an argmax/one-hot of the remaining legal actions anyway, then it'll be the same selected action after resampling as before, so why bother resampling with only legal actions...? — BigBadMe, Mar 13 '19 at 11:35
beacuse you need to explore too and if you always use argmax you will always pick greedy action. If you set some illegal actions to 0 other probabilities won't sum up to 1 so thats why you rescale and then sample from that. And after you sample an action you apply one hot vector so that it masks all actions except the one that was picked. I'm not exactly sure about details so try to check some implementations on github regarding that. — Brale, Mar 13 '19 at 22:33
I see, so you'd do this to select an action probs = tf.nn.softmax(logits) action = np.random.choice(self.action_size, p=probs.numpy()[0]) and then apply one-hot on chosen action for the backpropogation. Perfect, thanks for your help. — BigBadMe, Mar 14 '19 at 17:08

How to back-propagate illegal actions for policy gradient learning

0 Answers0