Page 62 - MSDN Magazine, February 2018
P. 62
TesT Run JAMES MCCAFFREY Thompson Sampling Using C#
Thompson sampling is an algorithm that can be used to find a solution to a multi-armed bandit problem, a term deriving from the fact that gambling slot machines are informally called “one-armed bandits.”
Suppose you’re standing in front of three slot machines. When you pull the arm on one of the machines, it will pay you either zero dollars or one dollar according to some probability distribu- tion that’s unknown to you. For example, the machine might pay with a mean probability of 0.5, so if you pulled the handle on that machine 100 times, you’d expect to be paid zero dollars about 50 times and one dollar about 50 times.
Your goal is to identify the best machine, as efficiently as pos- sible. The problem just described is an example of a Bernoulli bandit problem because the result is either 0 or 1. There are other types of bandit problems. For example, if each machine paid out a value, usually between zero and one dollar (such as 55.0 cents), according to a bell-shaped Gaussian distribution, you’d have a Gaussian process bandit problem. This article addresses only the Bernoulli bandit problem.
It’s unlikely you’ll be asked
by your boss to analyze slot machines, but multi-armed bandit problems appear in many important, practical scenarios.
There are several algorithms that can be used to try to find the best machine. Suppose you’re limited to a total of 100 pulls on the three machines. You could try 10 pulls on each machine, keeping track of how well each machine performed. Then you could use your remaining 70 pulls on the one machine that paid out the most money during your 30-pull explore phase. The danger with this approach is that you might incorrectly identify the best machine. And if you use many pulls during the explore phase, you won’t have many left during the exploit phase.
Thompson sampling is a clever algorithm that continuously adjusts the estimated probabilities of each machine’s payout. As you’ll see shortly, the key to Thompson sampling for a Bernoulli bandit problem is the mathematical beta distribution.
It’s unlikely you’ll be asked by your boss to analyze slot machines, but multi-armed bandit problems appear in many important, practical scenarios. For example, suppose you work for a drug manufacturer. You’ve just created four new experimental drugs for cancer and you want to establish which of the four drugs is most effective, with a minimum of testing on live subjects. Or perhaps you work for an e-commerce company and you want to determine which of several new advertising campaigns is the most successful.
A good way to see where this article is headed is to take a look at the demo program in Figure 1. The demo sets up three Bernoulli machines with probabilities of payout of (0.3, 0.5, 0.7), respectively. In a non-demo scenario, the probabilities are unknown to you, of course. You are allowed a total of 10 pulls. The goal is to determine the best machine (machine #3) and pull its handle the most times.
In the first trial, you assume that all three machines have a payout probability of 0.5. The demo uses the beta distribution to gener- ate three probabilities based on that assumption. These random probabilities are (0.7711, 0.1660, 0.1090). Because the probability associated with machine #1 is highest, its arm is pulled. In this case, machine #1 does not pay out.
In the second trial, behind the scenes, the demo believes that the first machine now has a payout probability that is less than 0.5. Beta sam- pling is used and this time the probabilities are (0.6696, 0.2250, 0.7654), so machine #3 is selected and played, and its estimated probability of payout is adjusted according to whether the machine pays out or not.
Over time, because machine #3 has the highest probability of payout, it will win more often than the other two machines, and each time machine #3 does pay out, the likelihood that it will be selected increases.
After the 10 trials, machine #1 was played three times and paid out just once so the simulation guesses its true probability of payout is approximately 1/3 = 0.33. Machine #2 was played three times and paid out twice (estimated p = 2/3 = .6667). Machine #3 was played four times and paid out three times (estimated p = 3/4 = 0.7500). So, in this example at least, the best machine was identified and was played the most often.
This article assumes you have intermediate or better program- ming skills, but doesn’t assume you know anything about Thompson sampling or beta distributions. The demo program is coded using
Code download available at msdn.com/magazine/0218magcode.
58 msdn magazine