Page 57 - MSDN Magazine, May 2019
P. 57
functions, but using Euclidean distance is simple and effective. The Euclidean distance between the two items is the square root of the sum of the squared differences of coordinates. For example, if a labeled data item is (0.20, 0.60) and the item to classify is (0.35, 0.38) then:
dist = sqrt( (0.20 - 0.35)^2 + (0.60 - 0.38)^2 ) = sqrt(0.0225 + 0.0484)
= 0.2663
After computing all distances and finding the k-nearest distances, you must use a voting algorithm to determine the predicted class. There are many ways to compute a prediction from distances. The demo program uses the inverse weights technique, which is best explained by example. Suppose, as in the demo, the six clos- est distances are (0.0728, 0.0781, 0.1118, 0.1217, 0.1300, 0.1414) and the associated labeled classes are (0, 1, 0, 1, 1, 2). You compute the inverse of each distance, find the sum of the inverses, then divide each inverse by the sum. For the demo, the inverses are (13.7363, 12.8041, 8.9445, 8.2169, 7.6923, 7.0721). The sum of the inverses is 58.4663. Dividing each inverse distance by the sum gives the weights: (0.2349, 0.2190, 0.1530, 0.1405, 0.1316, 0.1210).
Each weight is a vote for its associated class. In the demo, the first and third closest items have class 0, so the vote for class 0 is the sum of the first and third weights: 0.2349 + 0.1530 = 0.3879. Similarly, the vote for class 1 is 0.2190 + 0.1405 + 0.1316 = 0.4911. And the vote for class 2 is just the sixth weight, 0.1210. The final prediction is the class that has the largest vote.
The Demo Program
The complete demo program, with a few minor edits to save space, is presented in Figure 3. To create the program, I launched Visual Studio and created a new console application named Weighted- KNN. I used Visual Studio 2017 but the demo has no significant .NET Framework dependencies.
After the template code loaded, in the editor window I removed all unneeded namespace references, leaving just the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked on file Program.cs, renamed it to the more descriptive WeightedKNNProgram.cs, and allowed Visual Studio to automati- cally rename class Program.
After computing all distances and finding the k-nearest distances, you must use a voting algorithm to determine the predicted class.
For simplicity, the demo program uses a static code approach rather than an object-oriented programming approach. Most of the work is done in the Analyze method.
msdnmagazine.com
0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00
0.00 0.20
k-NN Demo Data
class 0
class 1
class 2
item to classify
0.40 0.60 0.80 1.00 Person Age x 100 years
Figure 2 Weighted k-NN Data
Loading Data into Memory
The demo program hardcodes the dummy data and the item-to-classify:
double[][] data = new double[33][];
data[0] = new double[] { 0, 0.25, 0.43, 0 }; // Etc.
data[32] = new double[] { 32, 0.46, 0.22, 2 }; double[] item = new double[] { 0.35, 0.38 };
I named the data as just “data” rather than something like trainData because data used by k-NN isn’t really used to train a general model.
In a non-demo scenario, you’d likely store your data in a text file or SQL table and write a helper method to load the data into an array-of-arrays-style matrix. I named the data as just “data,” rather than something like trainData because data used by k-NN isn’t really used to train a general model.
The first value in each data item is an arbitrary ID. I used con- secutive integers from zero to 32 but, in many cases, you’d have meaningful IDs, such as a person’s employee ID number. The k-NN algorithm doesn’t need data IDs, so the ID column could’ve been omitted. The second and third values are the normalized predictor values. The fourth value is the class ID. Class IDs for k-NN don’t need to be zero-based integers, but using this scheme is program- matically convenient.
Computing Distances
The first step in the k-NN algorithm is to compute the distance from each labeled data item to the item-to-be classified. Based on my experience and a few research results, the choice of distance function to apply isn’t as critical as you might think, and simple Euclidean distance usually works well.
May 2019 51
Person Expenses x $100,000