MSDN Magazine, August 2017

Page 70 - MSDN Magazine, August 2017

P. 70

Input parameter wts holds the values for the weights and biases, and is assumed to have the correct Length. Variable ptr points into the wts array. The demo program has very little error checking in order to keep the main ideas as clear as possible. The input-to-first- hidden-layer weights are set like so:
for (int i = 0; i < nInput; ++i)
for (int j = 0; j < hNodes[0].Length; ++j)
ihWeights[i][j] = wts[ptr++];
Next, the hidden-to-hidden weights are set:
for (int h = 0; h < nLayers - 1; ++h)
for (int j = 0; j < nHidden[h]; ++j) // From
for (int jj = 0; jj < nHidden[h+1]; ++jj) // To hhWeights[h][j][jj] = wts[ptr++];
If you’re not accustomed to working with multi-dimensional arrays, the indexing can be quite tricky. A diagram of the weights and biases data structures is essential (well, for me, anyway). The last-hidden-layer-to-output weights are set like this:
int hi = this.nLayers - 1;
for (int j = 0; j < this.nHidden[hi]; ++j)
for (int k = 0; k < this.nOutput; ++k) hoWeights[j][k] = wts[ptr++];
This code uses the fact that if there are nLayers hidden (3 in the demo), then the index of the last hidden layer is nLayers-1. Method SetWeights concludes by setting the hidden node biases and the output node biases:
...
for (int h = 0; h < nLayers; ++h)
The code is pretty much a one-one mapping of the mechanism described earlier. The built-in Math.Tanh is used for hidden node activation. As I mentioned, important alternatives are the logistic sigmoid function and the rectified linear unit (ReLU) functions, which I’ll explain in a future article. Next, the remaining hidden- layer nodes are calculated:
for (int h = 1; h < nLayers; ++h) {
for (int j = 0; j < nHidden[h]; ++j) {
for (int jj = 0; jj < nHidden[h-1]; ++jj)
hNodes[h][j] += hhWeights[h-1][jj][j] * hNodes[h-1][jj];
hNodes[h][j] += hBiases[h][j];
hNodes[h][j] = Math.Tanh(hNodes[h][j]); }
}
This is the trickiest part of the demo program, mostly due to the multiple array indexes required. Next, the pre-activation sum-of- products are calculated for the output nodes:
for (int k = 0; k < nOutput; ++k) {
for (int j = 0; j < nHidden[nLayers - 1]; ++j)
oNodes[k] += hoWeights[j][k] * hNodes[nLayers - 1][j]; oNodes[k] += oBiases[k]; // Add bias
}
Method ComputeOutputs concludes by applying the softmax activa- tion function, returning the computed output values in a separate array:
...
double[] retResult = Softmax(oNodes); for (int k = 0; k < nOutput; ++k)
for (int j = 0; j < this.nHidden[h]; ++j) hBiases[h][j] = wts[ptr++];
for (int k = 0; k < nOutput; ++k) oBiases[k] = wts[ptr++];
}
Computing the Output Values
The definition of class method ComputeOutputs begins with:
public double[] ComputeOutputs(double[] xValues) {
for (int i = 0; i < nInput; ++i) iNodes[i] = xValues[i];
...
The input values are in array parameter xValues. Class member nInput holds the number of input nodes and is set in the class constructor. The first nInput values in xValues are copied into the input nodes, so xValues is assumed to have at least nInput values in the first cells. Next, the current values in the hidden and output nodes are zeroed-out:
for (int h = 0; h < nLayers; ++h)
for (int j = 0; j < nHidden[h]; ++j)
hNodes[h][j] = 0.0;
for (int k = 0; k < nOutput; ++k) oNodes[k] = 0.0;
The idea here is that the sum of products term will be accumu- lated directly into the hidden and output nodes, so these nodes must be explicitly reset to 0.0 for each method call. An alterna- tive is to declare and use local arrays with names like hSums[][] and oSums[]. Next, the values of the nodes in the first hidden layer are calculated:
for (int j = 0; j < nHidden[0]; ++j) { for (int i = 0; i < nInput; ++i)
hNodes[0][j] += ihWeights[i][j] * iNodes[i]; hNodes[0][j] += hBiases[0][j]; // Add the bias hNodes[0][j] = Math.Tanh(hNodes[0][j]); // Activation
oNodes[k] = retResult[k]; return retResult;
}
}
The Softmax method is a static helper. See the accompanying code download for details. Notice that because softmax activation requires all the values that will be activated (in the denominator term), it’s more efficient to compute all softmax values at once instead of separately. The final output values are stored into the output nodes and are also returned separately for calling convenience.
Wrapping Up
There has been enormous research activity and many breakthroughs related to deep neural networks over the past few years. Specialized DNNs such as convolutional neural networks, recurrent neural networks, LSTM neural networks and residual neural networks are very powerful but very complex. In my opinion, understanding how basic DNNs operate is essential for understanding the more complex variations.
In a future article, I’ll explain in detail how to use the back- propagation algorithm (arguably the most famous and important algorithm in machine learning) to train a basic DNN. Back- propagation, or at least some form of it, is used to train most DNN variations, too. This explanation will introduce the concept of the vanishing gradient, which in turn will explain the design and motivation of many of the DNNs now being used for very sophis- ticated prediction systems. n
Dr. James mccaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.
Thanks to the following Microsoft technical experts who reviewed this article: Li Deng, Pingjun Hu, Po-Sen Huang, Kirk Li, Alan Liu, Ricky Loynd, Baochen Sun, Henrik Turbell.
64 msdn magazine
Test Run

68 69 70 71 72