Page 16 - MSDN Magazine, November 2018
P. 16
ArtificiAlly intelligent FRANK LA VIGNE A Closer Look at Reinforcement Learning
In last month’s column, I explored a few basic concepts of rein- forcement learning (RL), trying both a strictly random approach to navigating a simple environment and then implementing a Q-Table to remember both past actions and which actions led to which rewards. In the demo, an agent working randomly was able to reach the goal state approximately 1 percent of the time and roughly half the time when using a Q-Table to remember previous actions. However, this experiment only scratched the surface of the promising and expanding field of RL.
Recall that in the previous column (msdn.com/magazine/mt830356), an RL problem space consists of an environment, an agent, actions, states and rewards. An agent examines the state of an environment and takes an action. The action then changes the state of the agent and/or environment. The agent receives a reward and examines the updated state of its environment. The cycle then restarts and runs for a number of iterations until the agent succeeds or fails at a predefined goal. When an agent succeeds or fails, the simu- lation ends. With a Q-table, an agent remembers which actions yielded positive rewards and references it when making decisions in subsequent simulations.
Multi-Armed Bandit Problem
One of the classical problems in RL is the tension between explora- tion and exploitation. Slot machines, often referred to as “one-armed bandits” are the inspiration for this problem. A bank of slot machines then creates “multi-armed bandit.” Each of these slot machine has a probability of paying out a jackpot or not. The probability of each turn resulting in a jackpot may be represented as P, and the proba- bility of not paying out is 1 – P. If a machine has a jackpot probability (JP) of .5, then each pull of the lever has an equal chance of winning or losing. Conversely, a machine with a JP of 0.1 would yield a losing result 90 percent of the time.
Now, imagine a bank of five slot machines and the player (or agent) has a goal to maximize winnings and minimize losses. With no foreknowledge of any of the machines’ jackpot probability (JP), the agent must take some risks at first. With the first pull of the lever, the agent wins and receives a payout. However, subsequent tries reveal that this machine pays out about half of the time, a JP of .54. As slot machines go, this is quite generous. The agent must decide if it should exploit the current known resource or explore a new machine. If the probability of the first slot machine paying out is this generous, is it worth trying any of the machines in the bank to see if their payout chances are better?
The best way to further explore this problem space is with some Python code in a Jupyter notebook. Create a Python 3 notebook on your preferred platform. I covered Jupyter notebooks in a pre- vious article (msdn.com/magazine/mt829269). Create an empty cell and enter the following code and execute the cell.
import numpy as np
import matplotlib.pyplot as plt number_of_slot_machines = 5
np.random.seed(100)
JPs = np.random.uniform(0,1, number_of_slot_machines) print(JPs)
plt.bar(np.arange(len(JPs)),JPs)
plt.show()
The output should read and show a plot of the values, as shown in Figure 1.
[0.54340494 0.27836939 0.42451759 0.84477613 0.00471886]
The code creates an array of JP values for a series of five slot machines ranging from 0.004 to 0.844. However, the first machine the agent tried, while generous, is not the best. Clearly, the fourth slot machine with an 84.4 percent payout rate is the best paying machine in the environment. It is also worth noting that the final slot machine has the worst odds of paying out a jackpot. Remember that the agent has no prior knowledge of the payout rates and it must discover them on its own. Had the agent stayed on the first machine, choosing exploitation over exploration, the agent would never have found the best paying slot machine.
To represent what the agent knows at the start of a simulation, add the following code to a new cell:
known_JPs = np.zeros(number_of_slot_machines)
This creates an array of zeros, meaning that the agent assumes that the JP of each slot machine is zero. While this may not be the best initial value in all cases, it will suffice for our purposes here. To create a simulation of a slot machine, add the following code to a new cell and execute it:
def play_machine(slot_machine): x = np.random.uniform(0, 1) if (x <= JPs[slot_machine]):
return(10) else:
return(-1)
This code snippet simulates a slot machine paying out a reward of 10 if the machine pays out and a negative reward of -1 if the machine does not. Odds of a payout are based on the likelihood defined in the JPs numpy array. To test the code, enter the following Python code into a new cell and execute:
# Test Slot Machine 4
for machines in range(10):
print(play_machine(3)) print ("------")
# Test Slot Machine 5
for machines in range(10):
Code download available at bit.ly/2IvCFlK.
12 msdn magazine
print(play_machine(4))

