MSDN Magazine, September 2017

Page 64 - MSDN Magazine, September 2017

P. 64

for (int j = 0; j < nHidden[lastLayer]; ++j) { for (int k = 0; k < nOutput; ++k) {
repeatedly multiplying values between 0 and 1, which will result in smaller and smaller gradients. For example, 0.5 * 0.5 * 0.5 * 0.5 = 0.0625. Additionally, the tanh hidden layer activation function introduces another fraction-times-fraction term.
The demo program illustrates the vanishing gradient problem by spying on the gradient associated with the weight from node 0 in the input layer to node 0 in the first hidden layer. The gradient for that weight decreases quickly:
hoGrads[j][k] = hNodes[lastLayer][j] * oSignals[k];
}
In words, the gradient for a weight connecting a hidden node to an output node is the value of the hidden node times the output signal of the output node. After the gradient associated with a hidden-to-out- put weight has been computed, the weight can be updated:
for (int j = 0; j < nHidden[lastLayer]; ++j) { for (int k = 0; k < nOutput; ++k) {
double delta = hoGrads[j][k] * learnRate; hoWeights[j][k] += delta;
hoWeights[j][k] += hoPrevWeightsDelta[j][k] * momentum; hoPrevWeightsDelta[j][k] = delta;
} }
First the weight is incremented by delta, which is the value of the gradient times the learning rate. Then the weight is incremented by an additional amount—the product of the previous delta times the momentum factor. Note that using momentum is optional, but almost always done to increase training speed.
To recap, to update a hidden-to-output weight, you calculate an output node signal, which depends on the difference between target value and computed value, and the derivative of the output node activation function (usually softmax). Next, you use the output node signal and the hidden node value to compute the gradient. Then you use the gradient and the learning rate to compute a delta for the weight, and update the weight using the delta.
Unfortunately, calculating the gradients for the input-to-hidden weights and the hidden-to-hidden weights is much more compli- cated. A thorough explanation would take pages and pages, but you can get a good idea of the process by examining one part of the code:
int lastLayer = nLayers - 1;
for (int j = 0; j < nHidden[lastLayer]; ++j) {
derivative = (1 + hNodes[lastLayer][j]) * (1 - hNodes[lastLayer][j]); // For tanh
double sum = 0.0;
for (int k = 0; k < nOutput; ++k) {
sum += oSignals[k] * hoWeights[j][k]; }
hSignals[lastLayer][j] = derivative * sum; }
This code calculates the signals for the last hidden layer nodes—those just before the output nodes. The local variable derivative is the calcu- lus derivative of the hidden layer activation function, tanh in this case. But the hidden signals depend on a sum of products that involves the output node signals. This leads to the “vanishing gradient” problem.
The Vanishing Gradient Problem
When you use the back-propagation algorithm to train a DNN, during training the gradient values associated with hidden-to-hidden weights quickly become very small or even zero. If a gradient value is zero, then the gradient times the learning rate will be zero, and the weight delta will be zero, and the weight will not change. Even if a gradient doesn’t go to zero, but gets very small, the delta will be tiny and training will slow to a crawl.
The reason gradients quickly head toward zero should be clear if you carefully examine the demo code. Because output node values are coerced to probabilities, they’re all between 0 and 1. This leads to output node signals that are between 0 and 1. The multiplica- tion part of computing the hidden node signals therefore involves
}
epoch = 200
epoch = 400
epoch = 600
epoch = 800
epoch = 1000 gradient = -0.000009 ...
gradient = -0.002536 gradient = -0.000551 gradient = -0.000141 gradient = -0.159148
The gradient temporarily jumps up at epoch 800 because the demo updates weights and biases after every training item is pro- cessed (this is called “stochastic” or “online” training, as opposed to “batch” or “mini-batch” training), and by pure chance the training item processed at epoch 800 led to a larger than normal gradient.
Long short-term memory networks are extremely good at natural language processing.
In the early days of DNNs, perhaps 25 to 30 years ago, the vanishing gradient problem was a show-stopper. As computing power increased, the vanishing gradient became less of a problem because training could afford to slow down a bit. But with the rise of very deep networks, with hundreds or even thousands of hidden layers, the problem resurfaced.
Many techniques have been developed to tackle the vanishing gra- dient problem. One approach is to use the rectified linear unit function (ReLU) instead of the tanh function for hidden layer activation. Another approach is to use different learning rates for different layers—larger rates for layers closer to the input layer. And the use of GPUs for deep learning is now the norm. A radical approach is to avoid back- propagation altogether, and instead use an optimization algorithm that doesn’t require gradients, such as particle swarm optimization.
Wrapping Up
The term deep neural network most often refers to the type of network described in this article—a fully connected network with multiple hidden layers. But there are many other types of deep neural networks. Convolutional neural networks are very good at image classification. Long short-term memory networks are extremely good at natural language processing. Most of the varia- tions of deep neural networks use some form of back-propagation and are subject to the vanishing gradient problem. n
Dr. James mccaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.
Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee and Adith Swaminathan
56 msdn magazine
Test Run

62 63 64 65 66