Page 19 - MSDN Magazine, November 2018
P. 19
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
01234
Automate and add interactivity to your PDF applications
.NET IMAGING and PDF
Interactive form field JavaScript actions
Barcode and Signature field Annotation and Content editing
Professional SDK for building document management apps
Image Viewer for .NET, WPF and WEB
100+ Image Processing and Document Cleanup commands
PDF Reader, Writer, Visual Editor Image Annotations
JBIG2 and JPEG2000 codecs OCR and Document Recognition Forms Processing and OMR DICOM decoder
Barcode Reader and Generator TWAIN scanning
Free evaluation version
Royalty free licensing
www.vintasoft.com VintaSoft is a registered trademark
of VintaSoft Ltd.
Figure 1 Jackpot Probabilities of the Five Slot Machines
This code pits the best performing machine against the worst per- forming machine. As this is all based on chance, there’s no guarantee of the output results. The results should reflect that, with a major- ity of 10 values for machine 4 and nearly all -1 values for machine 5. With the simulated slot machine code behaving as expected, it’s now time to examine a common algorithm in RL: Epsilon Greedy.
The Epsilon Greedy Algorithm
The core dilemma the agent faces here is whether to prioritize greed, the desire to exploit a known resource, or curiosity, the desire to explore other slot machines in the hopes of a better chance of rewards. One of the simplest algorithms for solving this dilemma is known as the Epsilon Greedy algorithm, where the agent chooses at random between using the slot machine with the best odds of payout observed thus far, or trying out another machine in the hopes that it may provide a better payout. With a low value of Epsilon, this algo- rithm follows the greedy algorithm, but will occasionally try another slot machine. For instance, if the Epsilon value is .1, the algorithm will opt to exploit 90 percent of the time and explore only 10 percent of the time. Typically, default values of Epsilon tend to fall between .05 and .1. In short, the agent will primarily play the best slot machine discov- ered that it knows of and sometimes try a new machine. Remember that each pull of the lever comes at a cost and the agent doesn’t know what we know: that slot 4 pays out the best.
This underscores the notion of RL. The agent knows nothing about the environment initially, so it needs to first explore, then exploit later.
Figure 2 Reinforcement Learning Code
def multi_armed_bandit(arms, iterations, epsilon): total_reward, optimal_action = [], [] estimated_payout_odds = np.zeros(arms)
count = np.zeros(arms)
for i in range(0, iterations): epsilon_random = np.random.uniform(0, 1) if epsilon_random > epsilon :
# exploit
action = np.argmax(estimated_payout_odds) else :
# explore
action = np.random.choice(np.arange(arms))
reward = play_machine(action)
estimated_payout_odds[action] = estimated_payout_odds[action] +
(1/(count[action]+1)) *
(reward - estimated_payout_odds[action])
total_reward.append(reward)
optimal_action.append(action == np.argmax(estimated_payout_odds)) count[action] += 1
return(estimated_payout_odds, total_reward)
msdnmagazine.com

