Q learning

Implementation of epsilon-greedy q-learning for robot on 2D grid

View the Project on GitHub RobRomijnders/q_learning

Q-learning

This post implements a Q-learning algorithm with epsilon-greedy approach for a robot in a 2D grid. Q-learning is an approach in reinforcement learning to solve the Bellman equations. Other than for value iterations, the state-transition-model can be unknown and based solely on observed state transitions.

Our setting

Our setting is a 2D grid, loaded with rewards. Q-learning would also work for a sparse grid with rewards. However, to learn the reinforcement learning algorithms, we would advise to start with a simple grid like this. The imaginary robot starts at state 0. The final state is defined as state 100, where state 1-100 represent the column-wise linear indices of the 10 by 10 grid. An example of reward_normal.mat

Visualizing under the hood

The Q-learning function in RL_main_github.m calls for some numerical values. That's it. If you want more intuition for the learning procedure, many of the functions take a boolean flag for visualization.

Visualization of entries of Q matrix

Monitor_Q This monitors random entries of the Q matrix. The lower right plot shows the Exponentially Weighted Moving Average of the convergence criterion, Q_diff. With this diagram, we can visually inspect the convergence of the Q matrix from both the random entries and the convergence criterion itself.

Visualization of epsilon and alpha decay

Monitor_ea This plots the decay of epsilon and alpha over the number of steps, k. If you design your own decay function, this script might be a good start for visualizing your decay.

Visualization of any matrix representing the grid

Monitor_grid This is one example of the visualization functions visQ.m and visQ_im.m. These functions visualize any matrix that represents some values in the grid. You can use it to visualize the Q matrix, the N matrix or, in this example, the reward matrix.

Visualization of the path at optimal policy

Vis_path This visualizes the path following from the optimal policy. By the Principle of Optimality for the Bellman equations, we can generate a path associated with the optimal policy. At every state, the action is the action with the highest Q value. The function plot_pol.m implements exactly that and also return the associated reward and return.

Further improvements

This code implements Q-learning at its minimal form. For anyone new to Q-learning, the reward matrices and a 10x10 grid form a good first learning experience. Further improvements would be:

As always, I am curious to any comments and questions. Reach me at romijndersrob@gmail.com