Page 66 - MSDN Magazine, February 2018
P. 66
used to sample the beta distribution. The seeds used, 2 and 4, were specified only because they provided a representative demo run.
The main simulation loop begins:
for (int trial = 0; trial < 10; ++trial) { Console.WriteLine("Trial " + trial); for (int i = 0; i < N; ++i)
probs[i] = bs.Sample(S[i] + 1.0, F[i] + 1.0); ...
You might want to set the number of trials to a large number, like 1,000, to observe how quickly Thompson sampling zeros in on an optimal machine. The key to the entire demo is the call to the Sample method. The numbers of successes and failures are passed to the method. A constant 1.0 is added as something of a hack because the beta distribution requires the a and b parame- ters to be greater than 0. If you have some prior knowledge of the machines’ payout probabilities, you could adjust the constant to reflect that information.
A quick scan of the code should convince you there’s some clever, non-trial math involved.
After displaying the updated sampling probabilities, the demo selects the machine with the greatest sampling probability:
int machine = 0;
double highProb = 0.0;
for (int i = 0; i < N; ++i) {
if (probs[i] > highProb) { highProb = probs[i]; machine = i;
}
Because probabilities are sampled, even if a machine has zero wins and many failures, it could still be selected for play. This mechanism enhances the exploration capabilities of the algorithm.
The main iteration loop concludes by playing the selected machine:
...
Console.Write("Playing machine " + machine); double p = rnd.NextDouble();
if (p < means[machine]) {
Console.WriteLine("win"); ++S[machine]; }
else {
Console.WriteLine("lose"); ++F[machine];
}
The NextDouble method returns a uniformly random value between0.0and1.0andisusedtoimplementaBernoulliprocess. The demo concludes by displaying the estimated probabilities of payout for each machine (not bothering to check for possible divi- sion by zero), and the number of times each machine was played.
Implementing the Beta Distribution
Somewhat surprisingly, to the best of my knowledge the .NET Framework doesn’t have a library method to sample from the beta distribution. Instead of taking a dependency on an external library, I decided to implement one from scratch. There are many algorithms to generate a beta sample. I used the “BA” algorithm, developed by R. C. H. Cheng in 1978. The entire code is presented in Figure 4.
Figure 4 Program-Defined Beta Distribution Sampler
public class BetaSampler {
public Random rnd;
public BetaSampler(int seed) {
this.rnd = new Random(seed); }
public double Sample(double a, double b) {
double alpha = a + b; double beta = 0.0;
double u1, u2, w, v = 0.0;
if (Math.Min(a, b) <= 1.0)
beta = Math.Max(1 / a, 1 / b);
else
beta = Math.Sqrt((alpha - 2.0) /
(2 * a * b - alpha));
double gamma = a + 1 / beta;
while (true) {
u1 = this.rnd.NextDouble();
u2 = this.rnd.NextDouble();
v = beta * Math.Log(u1 / (1 - u1));
w = a * Math.Exp(v);
double tmp = Math.Log(alpha / (b + w));
if (alpha * tmp + (gamma * v) - 1.3862944 >=
Math.Log(u1 * u1 * u2)) break;
}
double x = w / (b + w);
return x; }
}
}
Sampling from the beta distribution is a fascinating topic in its own right. A quick scan of the code should convince you there’s some clever, non-trial math involved. The implementation follows the source research paper closely—the same variables names and so on. Notice the potentially infinite loop, which is common in research pseudo-code, but a definite no-no in production code. You might want to add a loop counter variable and throw an exception if its value exceeds some threshold.
Wrapping Up
This article and its code should give you all the information you need to experiment with Thompson sampling for multi-armed Bernoulli problems. It should also allow you to explore bandit problems with other kinds of payout functions. For example, if you have machines that pay out according to a Poisson likelihood function, instead of using the beta distribution you’d sample from the gamma distribution, which is the conjugate prior for Poisson.
Themulti-armedbanditproblemisanexampleofwhat’scalledrein- forcement learning (RL). In RL machine learning, instead of creating a prediction system using labeled data that has known correct answers, you generate a prediction model on-the-fly, using some sort of reward function. RL is at the forefront of machine learning research. n
Dr. James mccaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.
Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd, Adith Swaminathan
}
62 msdn magazine
Test Run