When training a AI RL agent to play a game there'll be situations where the AI cannot perform certain actions lest they violate the game rules. That's easy to handle, and I can set illegal actions to some large negative amount so when doing an argmax they won't be selected. Or if I use softmax I can set probabilities of illegal actions to zero and then re-calculate softmax on the remaining legal states. Indeed, I believe this is what David Silver was referring to when asked this question at a presentation/lecture of AlphaZero:
https://www.youtube.com/watch?v=Wujy7OzvdJk&t=2404s
But doing so changes the output from the network and surely changes things when performing the backprop once a reward is known.
How does one handle that?
Would I set the illegal actions to the mean of the legal actions, or zero...?
"How does one compute the gradient of the filtered version of Softmax? Seems like this would be necessary for backpropagation to work successfuly, yes?"
– BigBadMe Mar 11 '19 at 07:48But can I ask, why would you need to rescale the softmax outputs after setting illegal moves to probability zero? If we just do an argmax/one-hot of the remaining legal actions anyway, then it'll be the same selected action after resampling as before, so why bother resampling with only legal actions...?
– BigBadMe Mar 13 '19 at 11:35probs = tf.nn.softmax(logits)
action = np.random.choice(self.action_size, p=probs.numpy()[0])
and then apply one-hot on chosen action for the backpropogation. Perfect, thanks for your help. – BigBadMe Mar 14 '19 at 17:08