Page 59 - MSDN Magazine, May 2019
P. 59
dataPoint[i+1] is offset to account for the data ID value at position [0]. The point here is that there’s usually a tradeoff between coding an ML algorithm for a specific problem and coding the algorithm with a view toward general-purpose use. Because k-NN is so simple and easy to implement, I usually prefer the code-for-specific- problem approach.
At first thought, the requirement of computing a distance function seems to impose the major restriction on k-NN that all predictor variables must be strictly numeric. And until recently this idea was true, for the most part. However, one of the reasons for the increased interest in k-NN is that it’s possible for k-NN to deal with mixed numeric and non-numeric data by using neural encoding of the predictor variables.
Briefly, the idea is to encode predictor variables using a neural autoencoder. An autoencoder can deal with non-numeric data using one-hot or 1-of-(N-1) encoding. An autoencoder predicts its out- put values to match its input values. The central hidden layer of an autoencoder is a completely numeric representation of the source numeric and non-numeric data.
Sorting or Ordering the Distances
After all distances have been computed, the k-NN algorithm must find the k-nearest (smallest) distances. One approach is to augment the entire labeled dataset with each distance value, then explicitly sort the augmented data. An alternative approach, used by the demo, is to use a neat overload of the .NET Array.Sort method to sort just the distances and the associated data indices in parallel. The key code is:
int[] ordering = new int[N]; for (int i = 0; i < N; ++i)
ordering[i] = i;
double[] distancesCopy = new double[N]; Array.Copy(distances, distancesCopy, distances.Length); Array.Sort(distancesCopy, ordering);
One of the reasons for the increased interest in k-NN is that it’s possible for k-NN to deal with mixed numeric and non-numeric data by using neural encoding of the predictor variables.
The array named ordering initially holds 0, 1, 2 . . 32. A copy of the 33 distances to the item-to-classify is created because you don’t want to lose the distances in their original order. The state- ment Array.Sort(distancesCopy, ordering) sorts the values in the distancesCopy array from smallest to largest using the Quicksort algorithm, and simultaneously rearranges the index values in the ordering array. The result is that the first value in array ordering is the index to the data of the item with the smallest distance, the second value in ordering holds the index to the second-closest distance and so on. Nice!
msdnmagazine.com
Determining k-NN Weights and Voting
The most primitive form of using the k-nearest distances to pre- dict a class is to use a simple majority rule approach. For example, if k = 4 and c = 3, and two of the closest distances are associated with class 2, and one closest distance is associated with class 0, and one closest distance is associated with class 1, then a majority rule approach predicts class 2. Note that weighted k-NN using uniform weights, each with value 1/k, is equivalent to the majority rule approach.
Compared to many other classification algorithms, the results of weighted k-NN are relatively easy to interpret.
The majority rule approach has two significant problems: First, ties are possible. Second, all distances are equally weighted. The most common weighting scheme for weighted k-NN is to apply the inverse weights approach used by the demo program. But there are many other techniques. For example, the reverse distances technique sums all distances, divides all distances by the sum, then reverses the order.
Another approach is to use the rank of the k-nearest distances (1, 2, . . 6) instead of the distances themselves. For k = 6, the rank order centroid weights would be calculated as (1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6)/6=0.4083,(1/2+1/3+1/4+1/5+1/6)/6=0.2417,andsoon.
For the demo program, the inverse, uniform, reverse and cen- troids weights are:
Inverse Uniform Reverse Centroids --------------------------------------- 0.2349 0.1667 0.2156 0.4083 0.2190 0.1667 0.1982 0.2417 0.1530 0.1667 0.1856 0.1583 0.1405 0.1667 0.1705 0.1028 0.1316 0.1667 0.1191 0.0611 0.1210 0.1667 0.1110 0.0278
Wrapping Up
The weighted k-NN classification algorithm has received increased attention recently for two reasons. First, by using neural auto- encoding, k-NN can deal with mixed numeric and non-numeric predictor values. Second, compared to many other classification algorithms, the results of weighted k-NN are relatively easy to interpret. Interpretability has become increasingly important due to legal requirements of ML techniques from regulations such as the European Union’s GDPR. n
Dr. James mccaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.
Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd
May 2019 53