MSDN Magazine, April 2017

Page 57 - MSDN Magazine, April 2017

P. 57

The equation for RBF is: K(x1,x2)=exp(-||x1-x2||^2/(2*sigma^2))
Here, K stands for kernel, x1 and x2 are two arrays that have the same length, sigma is a free parameter with a value like 1.0 or 1.5, the || indicates Euclidean distance, and exp means Euler’s number (e) raised to a power.
The RBF is best explained by example. Suppose x1 = (2.0, 1.0, 3.0) and x2 = (0.0, 1.0, 5.0), and sigma is 1.0. First, you compute the squared Euclidean distance:
weights, where the number of weights equals the number of pre- dictor values, and an additional special weight called the bias. For example, suppose you have a binary classification problem where there are three predictor variables. And suppose the perceptron weights are w0 = 1.5, w1 = 1.1 and w2 = -3.0, and three predictor variables have values x0 = 4.0, x1 = 3.0, x2 = 5.0 and the bias is b = 2.5. The prediction is calculated as the sign (positive or negative) of the sum of the products of weight and input values, plus the bias:
sum = (w0)(x0) + (w1)(x1) +(w2)(x2) + b
= (1.5)(4.0) + (1.1)(3.0) + (-3.0)(5.0) + 2.5
||x1-x2||^2=(2.0-0.0)^2+(1.0-1.0)^2 +(3.0-5.0)^2
=4.0+0.0+4.0 =6.0+3.3+(-15.0)+2.5 = 8.0 = -3.2
Next, you divide the squared distance by 2 times sigma squared: prediction = sign(sum) 8.0/(2*(1.0)^2)=8.0/2.0=4.0 =-1
Last, you take Euler’s number (e = approximately 2.71828) and raise it to the negative of the previous result:
K = e^(-4.0) = 0.0183
A kernel perceptron is a machine learning (ML) classifier that can be used to make binary predictions.
Notice that if x1 equals x2, the squared Euclidean distance will be 0, and then dividing by 2 times sigma squared will be 0, and e raised to the 0th power is 1.0. The less similar x1 and x2 are, the larger the squared difference will be, and e raised to a negative large value gets very small, approaching 0.0. The larger the value of sigma is, the larger the value of K is, but K still varies from 1.0 (equal vectors) down to 0.0 exclusive (very dissimilar vectors). The order of array parameters doesn’t matter, so K(x1, x2) = K(x2, x1).
The demo program defines the RBF kernel function as:
static double Kernel(double[] d1, double[] d2, double sigma) { double num = 0.0;
for (int i = 0; i < d1.Length - 1; ++i)
num += (d1[i] - d2[i]) * (d1[i] - d2[i]); double denom = 2.0 * sigma * sigma;
double z = num / denom;
double result = Math.Exp(-z);
return result; }
The function assumes that the last cell of each array holds the class label (+1 or -1), so the last cell isn’t included in the calculation. There are several other kernel functions that can be used with a kernel perceptron, including the linear kernel, polynomial kernel, Fisher kernel, and sigmoid kernel. Calculating kernel functions is relatively easy. What’s not obvious, however, is that kernel functions have some remarkable mathematical properties that allow simple classifiers, like an ordinary perceptron, to be transformed into classifiers that can handle non-linearly separable data.
Understanding Ordinary Perceptrons
An ordinary perceptron can perform binary classification for simple, linearly separable data. An ordinary perceptron consists of a set of
msdnmagazine.com
It’s that simple. But where do the weights and bias values come from? You take training data that has known input and correct class values and then use a mathematical optimization algorithm to find values for the weights and bias so that the calculated class values closely match the known correct class values.
Note that in perceptron literature, the bias value is often consid- ered to be a regular weight associated with a dummy input value of 1.0. Using that scheme, the preceding example would have inputs x = (1.0, 4.0, 3.0, 5.0) and weights w = (2.5, 1.5, 1.1, -3.0), yielding the following (which is the same result as before):
sum = (2.5)(1.0) + (1.5)(4.0) + (1.1)(3.0) + (-3.0)(5.0) = -3.2
In other words, a perceptron bias value can be explicit or implicit. You should not underestimate the confusion this can cause.
Now, unfortunately, ordinary perceptrons can classify only linearly separable data, which makes them very limited in practice. However, by using a kernel function in conjunction with a perceptron, it’s pos- sible to create a classifier that can work with non-linearly separable data, such as the demo data shown in Figure 2.
The Kernel Perceptron Algorithm
At first glance, the kernel perceptron algorithm appears unrelated to the ordinary perceptron algorithm, but in fact, the two are deeply
9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0
0.0
Dummy Classification Training Data
-1
+1
1.0 2.0
3.0 4.0
5.0 6.0 7.0
8.0 9.0
x0
Figure 2 Kernel Perceptron Training Data
April 2017 43
x1

55 56 57 58 59