Page 71 - MSDN Magazine, October 2017
P. 71
t actual predicted = = = = = = = ====
5 121 129
the oNodes member of the NeuralNetwork class is an array with one cell, rather than a single variable.
The choice of activation functions affects the code in the back-propagation algorithm implemented in method Train. Method Train uses the calculus derivatives of each activation function. Thederivativeofy=tanh(x)is(1+y)*(1-y).Inthedemocode:
// Hidden node signals
for (int j = 0; j < numHidden; ++j) {
derivative = (1 + hNodes[j]) * (1 - hNodes[j]); // tanh double sum = 0.0;
for (int k = 0; k < numOutput; ++k)
sum += oSignals[k] * hoWeights[j][k]; hSignals[j] = derivative * sum;
}
6 135
7 148 137
8 148 153
128
9 136
10 119 141
The demo program finishes by using the last four passenger counts (t = 141 to 144) to predict the passenger count for the first time period beyond the range of the training data (t = 145 = January 1961):
double[] predictors = new double[] { 5.08, 4.61, 3.90, 4.32 }; double[] forecast = nn.ComputeOutputs(predictors); Console.WriteLine("Predicted for January 1961 (t=145): "); Console.WriteLine((forecast[0] * 100).ToString("F0")); Console.WriteLine("End time series demo");
Notice that because the time-series model was trained using normalized data (divided by 100), the predictions will also be normalized, so the demo displays the predicted values times 100.
Neural Networks for Time-Series Analyses
When you define a neural network you must specify the activation functions used by the hidden-layer nodes and by the output-layer nodes. Briefly, I recommend using the hyperbolic tangent (tanh) function for hidden activation, and the identity function for output activation.
When using a neural network library or system such as Microsoft CNTK or Azure Machine Learning, you must explicitly specify the activation functions. The demo program hardcodes these activa- tion functions. The key code occurs in method ComputeOutputs. The hidden node values are computed like so:
for (int j = 0; j < numHidden; ++j) for (int i = 0; i < numInput; ++i)
hSums[j] += this.iNodes[i] * this.ihWeights[i][j];
for (int i = 0; i < numHidden; ++i) // Add biases hSums[i] += this.hBiases[i];
for (int i = 0; i < numHidden; ++i) // Apply activation this.hNodes[i] = HyperTan(hSums[i]); // Hardcoded
Here, function HyperTan is program-defined to avoid extreme values:
private static double HyperTan(double x) {
if (x < -20.0) return -1.0; // Correct to 30 decimals else if (x > 20.0) return 1.0;
else return Math.Tanh(x);
}
A reasonable, and common, alternative to using tanh for hidden-node activation is to use the closely related logistic sigmoid function. For example:
private static double LogSig(double x) {
if (x < -20.0) return 0.0; // Close approximation else if (x > 20.0) return 1.0;
else return 1.0 / (1.0 + Math.Exp(x));
}
Because the identity function is just f(x) = x, using it for output-node activation is just a fancy way of saying don’t use any explicit activa- tion. The demo code in method ComputeOutputs is:
for (int j = 0; j < numOutput; ++j) for (int i = 0; i < numHidden; ++i)
oSums[j] += hNodes[i] * hoWeights[i][j];
for (int i = 0; i < numOutput; ++i) // Add biases oSums[i] += oBiases[i];
Array.Copy(oSums, this.oNodes, oSums.Length);
The sum of products for an output node is copied directly into the output node without applying an explicit activation. Note that msdnmagazine.com
140
If you use logistic sigmoid activation, the derivative of y = logsig(x) is y * (1 - y). For output activation, the calculus derivative of y = x is just the constant 1. The relevant code in method Train is:
for (int k = 0; k < numOutput; ++k) { errorSignal = tValues[k] - oNodes[k]; derivative = 1.0; // For Identity activation oSignals[k] = errorSignal * derivative;
}
Obviously, multiplying by 1 has no effect. I coded as I did to act as a form of documentation.
A reasonable, and common, alternative to using tanh for hidden-node activation is to use the closely related logistic sigmoid function.
Wrapping Up
There are many different techniques you can use to perform time-series regression analyses. The Wikipedia article on the topic lists dozens of techniques, classified in many ways, such as para- metric vs. non-parametric and linear vs. non-linear. In my opinion, the main advantage of using a neural network approach with rolling-window data is that the resulting model is often (but not always) more accurate than non-neural models. The main disadvan- tage of the neural network approach is that you must experiment with the learning rate to get good results.
Most time-series regression-analysis techniques use rolling- window data, or a similar scheme. However, there are advanced techniques that can use raw data, without windowing. In particu- lar, a relatively new approach uses what’s called a long short-term memory neural network. This approach often produces very accurate predictive models. n
Dr. James mccaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.
Thanks to the following Microsoft technical experts who reviewed this article: John Krumm, Chris Lee and Adith Swaminathan
October 2017 67