Q-learning

This post implements a Q-learning algorithm with epsilon-greedy approach for a robot in a 2D grid. Q-learning is an approach in reinforcement learning to solve the Bellman equations. Other than for value iterations, the state-transition-model can be unknown and based solely on observed state transitions.

Our setting

Our setting is a 2D grid, loaded with rewards. Q-learning would also work for a sparse grid with rewards. However, to learn the reinforcement learning algorithms, we would advise to start with a simple grid like this. The imaginary robot starts at state 0. The final state is defined as state 100, where state 1-100 represent the column-wise linear indices of the 10 by 10 grid. An example of reward_normal.mat

Visualizing under the hood

The Q-learning function in RL_main_github.m calls for some numerical values. That's it. If you want more intuition for the learning procedure, many of the functions take a boolean flag for visualization.

Visualization of entries of Q matrix

Monitor_Q This monitors random entries of the Q matrix. The lower right plot shows the Exponentially Weighted Moving Average of the convergence criterion, Q_diff. With this diagram, we can visually inspect the convergence of the Q matrix from both the random entries and the convergence criterion itself.

Visualization of epsilon and alpha decay

Monitor_ea This plots the decay of epsilon and alpha over the number of steps, k. If you design your own decay function, this script might be a good start for visualizing your decay.

Visualization of any matrix representing the grid

Monitor_grid This is one example of the visualization functions visQ.m and visQ_im.m. These functions visualize any matrix that represents some values in the grid. You can use it to visualize the Q matrix, the N matrix or, in this example, the reward matrix.

Visualization of the path at optimal policy

Vis_path This visualizes the path following from the optimal policy. By the Principle of Optimality for the Bellman equations, we can generate a path associated with the optimal policy. At every state, the action is the action with the highest Q value. The function plot_pol.m implements exactly that and also return the associated reward and return.

Further improvements

This code implements Q-learning at its minimal form. For anyone new to Q-learning, the reward matrices and a 10x10 grid form a good first learning experience. Further improvements would be:

Explore the N matrix. The code saves state-action pairs to a big matrix named N. Exploring this matrix will giver you lots of intuition regarding the behavior of epsilon-greedy.
Explore different values for gamma. The example code in RL_main_github.m calls for gamma at 0.5 and 0.9. You'll see that the optimal policy won't reach final state for gamma = 0.5, but it does for gamma=0.9. The threshold lies at gamma=0.8. Interestingly, the optimal policy at gamma=0.7 does reach final state for reward function reward_long.mat. Vice versa, optimal policy at gamma=0.9 is the lowest gamma for which the optimal policy reaches final state with reward function reward_short.mat.
Explore the updates of epsilon and alpha. For this implementation, epsilon and alpha follow the same formula. You might obtain better performances at different formulas. Moreover, you might find even better performances with your own formulas for updating epsilon and alpha. Start with the script eak.m for doing so.

As always, I am curious to any comments and questions. Reach me at romijndersrob@gmail.com