bandit

Implementation of Counterfactual risk minimization

View the Project on GitHub

Training Convolutional Neural nets with bandit feedback

This repo introduces the concepts of bandit feedback. In bandit feedback, the agent receives a reward for the action it took. However, it will never know what the reward are for other actions. Compare this to supervised learning: in SL, the agent gets a reward for the action it took and it knows that this action was the single best action to take.

Examples

Application

You can now imagine the wealth of applications:

The project

We condense the problem to something we can visualize, explain and do on our laptop using only a cpu. The applications listed above involve major companies with big computational and engineering budget. This project aims at introducing the concepts. We use the Hello world! of machine learning: MNIST. If we understand bandit algorithms on MNIST, we can readily extend this to other projects

The problem

Think of recommendations:

We simulate

Technical details

How does this differ from supervised learning?

Let’s the pixels in some image correspond to the digit 8

Why do we consider policies?

When working with bandit problems, we’ll refer to your recommendations as samples from a policy. What’s a policy? A policy makes a distribution over possible actions given your current features. For example, a policy might be a conv-net with inputs an MNIST digit. Then the output would be a softmax distribution over the possible recommendations to make.

You might think: why does the policy output a distribution and not just a single recommendation? Good question. We want to sample a recommendation, such that we can learn from the user. If, for a given images, we’d always recommend 7, then how would we ever learn to get the best reward?

Technical detail on the policy

We work with logged training data. For example your e-commerce website has been recommending books for years now. You want to use that information. But there’s two catches here:

These catches come together in training our policy. The code in model/conv.py constructs our policy in Tensorflow. Near line 60, we have

importance = tf.divide(pol_y,pol_0)
importance_sampling = tf.multiply(importance,reward) 

We are learning the pol_y. Yet the website used pol_0 during logging. The intuition for this formula goes like this: Say some log (features, action, reward) gives a reward of +8. Our old policy choose that action with a p=0.1. Our new policy chooses the action with p=0.3. Effectively, we reward the system with +24. However, if the old p=0.1 and the new p=0.05, then we effectively reward the system with +4. So every trainstep scales rewards by the importance of that log

The repo

Resources

Some pointers for recources and further readings

As always, I am curious to any comments and questions. Reach me at romijndersrob@gmail.com