Page 60 - MSDN Magazine, September 2017
P. 60
predict the political leaning of a person (conservative, moderate, liberal) based on their age and income, you could cre- ate an FNN with two input nodes, eight hidden nodes and three output nodes. The number of input and output nodes is determined by the structure of your data, but the number of hidden nodes is a free parameter that you have to determine using trial and error.
3.80
5.00
-0.257
0.118
52 msdn magazine
TesT Run JAMES MCCAFFREY Deep Neural Network Training
A regular feed-forward neural network input (FNN) has a set of input nodes, a set of
hidden processing nodes and a set of out-
put nodes. For example, if you wanted to
hidden (3)
0.399
-0.607
output
0.362
-0.451
0.373
A deep neural network (DNN) has
two or more hidden layers. In this
article I’ll explain how to train a DNN
using the back-propagation algorithm
and describe the associated “vanishing gradient” problem. After reading this article, you’ll have code to experiment with, and a better understanding of what goes on behind the scenes when you use a neural network library such as the Microsoft Cognitive Toolkit (CNTK) or Google TensorFlow.
Take a look at the DNN shown in Figure 1. The DNN has three hidden layers that have four, two and two nodes. The input node values are 3.80 and 5.00, which you can imagine as the normal- ized age (38 years old) and income ($50,000 per year) of a person. Each of the eight hidden nodes has a value between -1.0 and +1.0 because the DNN uses tanh activation.
The preliminary output node values are (-0.109, 3.015, 1.202). These values are converted to the final output values of (0.036, 0.829, 0.135) using the softmax function. The purpose of softmax is to coerce the output node values to sum to 1.0 so that they can be inter- preted as probabilities. Assuming the output nodes represent the probabilities of “conservative”, “moderate” and “liberal,” respectively, the person in this example is predicted to be a political moderate.
In Figure 1, each of the arrows connecting nodes represents a numeric constant called a weight. Weight values are typically between -10.0 and +10.0, but can be any value in principle. Each of the small arrows pointing into the hidden nodes and the output nodes is a numeric constant called a bias. The values of the weights and the biases determine the output node values.
Code download available at msdn.com/magazine/0917magcode.
-0.109
3.015
1.202
0.036
0.829
0.135
-0.506
Figure 1 An Example 2-(4,2,2)-3 Deep Neural Network
The process of finding the values of the weights and the biases is called training the network. The idea is to use a large set of training data, which has known input values and known correct output values, and then use an optimization algorithm to find values for the weights and biases so that the difference between computed output values and known correct output values is minimized. There are several algorithms that can be used to train a DNN, but by far the most common is the back-propagation algorithm.
This article assumes you have a basic understanding of the neural network input-output mechanism and at least intermediate level programming skills. The demo program is coded using C# but you shouldn’t have too much trouble refactoring the demo to another language such as Python or Java if you wish. The demo program is too long to present in its entirety in this article, but the com- plete program is available in the accompanying code download.
The Demo Program
The demo begins by generating 2,000 synthetic training data items. Each item has four input values between -4.0 and +4.0, followed by three output values, which can be (1, 0, 0) or (0, 1, 0) or (0, 0, 1), representing three possible categorical values. The first and last training items are:
[ 0] 2.24 1.91 2.52 2.41 0.00 1.00 0.00 ...
[1999] 1.30 -2.41 -3.18 0.11 1.00 0.00 0.00
Behind the scenes, the dummy data is generated by creating a 4-(10,10,10)-3 DNN with random weights and biases values, and then feeding random input values to the network. After the