Implementation of epsilon-greedy q-learning for robot on 2D grid
This post implements a Q-learning algorithm with epsilon-greedy approach for a robot in a 2D grid. Q-learning is an approach in reinforcement learning to solve the Bellman equations. Other than for value iterations, the state-transition-model can be unknown and based solely on observed state transitions.
Our setting is a 2D grid, loaded with rewards. Q-learning would also work for a sparse grid with rewards. However, to learn the reinforcement learning algorithms, we would advise to start with a simple grid like this. The imaginary robot starts at state 0. The final state is defined as state 100, where state 1-100 represent the column-wise linear indices of the 10 by 10 grid.
The Q-learning function in RL_main_github.m calls for some numerical values. That's it. If you want more intuition for the learning procedure, many of the functions take a boolean flag for visualization.
This monitors random entries of the Q matrix. The lower right plot shows the Exponentially Weighted Moving Average of the convergence criterion, Q_diff. With this diagram, we can visually inspect the convergence of the Q matrix from both the random entries and the convergence criterion itself.
This plots the decay of epsilon and alpha over the number of steps, k. If you design your own decay function, this script might be a good start for visualizing your decay.
This is one example of the visualization functions visQ.m and visQ_im.m. These functions visualize any matrix that represents some values in the grid. You can use it to visualize the Q matrix, the N matrix or, in this example, the reward matrix.
This visualizes the path following from the optimal policy. By the Principle of Optimality for the Bellman equations, we can generate a path associated with the optimal policy. At every state, the action is the action with the highest Q value. The function plot_pol.m implements exactly that and also return the associated reward and return.
This code implements Q-learning at its minimal form. For anyone new to Q-learning, the reward matrices and a 10x10 grid form a good first learning experience. Further improvements would be:
As always, I am curious to any comments and questions. Reach me at romijndersrob@gmail.com