Implementation of Counterfactual risk minimization
This repo introduces the concepts of bandit feedback. In bandit feedback, the agent receives a reward for the action it took. However, it will never know what the reward are for other actions. Compare this to supervised learning: in SL, the agent gets a reward for the action it took and it knows that this action was the single best action to take.
You can now imagine the wealth of applications:
We condense the problem to something we can visualize, explain and do on our laptop using only a cpu. The applications listed above involve major companies with big computational and engineering budget. This project aims at introducing the concepts. We use the Hello world! of machine learning: MNIST. If we understand bandit algorithms on MNIST, we can readily extend this to other projects
Think of recommendations:
We simulate
How does this differ from supervised learning?
Let’s the pixels in some image correspond to the digit 8
When working with bandit problems, we’ll refer to your recommendations as samples from a policy. What’s a policy? A policy makes a distribution over possible actions given your current features. For example, a policy might be a conv-net with inputs an MNIST digit. Then the output would be a softmax distribution over the possible recommendations to make.
You might think: why does the policy output a distribution and not just a single recommendation? Good question. We want to sample a recommendation, such that we can learn from the user. If, for a given images, we’d always recommend 7, then how would we ever learn to get the best reward?
We work with logged training data. For example your e-commerce website has been recommending books for years now. You want to use that information. But there’s two catches here:
These catches come together in training our policy. The code in model/conv.py constructs our policy in Tensorflow. Near line 60, we have
importance = tf.divide(pol_y,pol_0)
importance_sampling = tf.multiply(importance,reward)
We are learning the pol_y
. Yet the website used pol_0
during logging. The intuition for this formula goes like this:
Say some log (features, action, reward) gives a reward of +8. Our old policy choose that action with a p=0.1
. Our new policy chooses the action with p=0.3
. Effectively, we reward the system with +24. However, if the old p=0.1
and the new p=0.05
, then we effectively reward the system with +4. So every trainstep scales rewards by the importance of that log
main.py
loads the modules and trains the policyconv.py
implement the convolutional neural network as our policyintro_to_cfr/
implements chapter 3 of an intro to Counterfactual learning. It ghelped me to understand counterfactual learning. That folder contains its own readme.Some pointers for recources and further readings
Intuition on bandit feedback
Learn on Counterfactual regret
Academic material
Matrial used for the implementation
As always, I am curious to any comments and questions. Reach me at romijndersrob@gmail.com