0

I want to implement this function on a voice searching application:

$$ Q(S, A) \leftarrow Q(S, A)+\alpha\left(R+\gamma Q\left(S^{\prime}, A^{\prime}\right)-Q(S, A)\right) $$

And also restricted to use epsilon-greedy policy based on a given Q-function and epsilon. I simply need a $\epsilon$-greedy policy for updating my q-table.

Milan
  • 113
  • 1
  • 6
mogoja
  • 73
  • 5

1 Answers1

2

Just try returning a function that takes the state as an input and returns the probabilities for each action in the form of a numpy array of length of the action space (set of possible actions). Here, is one attempt:

def EpsilonGreedyPolicy(Q, epsilon, no_of_actions):
def policy(state): 
    probabilities = np.ones(no_of_actions, dtype = float) * 
                epsilon / num_actions 
    best_action = np.argmax(Q[state]) 
    probabilities[best_action] += (1.0 - epsilon) 
    return probabilities 

return policy

Rithik Banerjee
  • 161
  • 1
  • 5