Page 27 - MSDN Magazine, April 2018
P. 27

the efficacy of the model. The best way to do that is to count the number of times the algorithm predicted correctly and how many times it was wrong. However, a simple “right and wrong” metric doesn’t always tell the full story. The better metric is something called the “confusion matrix,” which displays the number of true negatives and true positives along with the number of false pos- itives and false negatives. Enter the following code into a blank cell and execute it to display the confusion matrix for this model:
true_negatives = predicted.where( (predicted.prediction == '0.0') | (predicted.trueLabel == '0')).count()
true_positives = predicted.where( (predicted.prediction == '1.0') | (predicted.trueLabel == '1')).count()
false_negatives = predicted.where( (predicted.prediction == '0.0') | (predicted.trueLabel == '1')).count()
false_positives = predicted.where( (predicted.prediction == '1.0') | (predicted.trueLabel == '0')).count()
print ( "True Positive: " + str(true_positives) ) print ( "True Negative: " + str(true_negatives) ) print ( "False Positive: " + str(false_positives) ) print ( "False Negative: " + str(false_negatives) )
When experiments fail, the best course of action is to analyze the results, tweak the model parameters and try again.
The results are not encouraging. The model was wrong con- siderably more than it was right. All is not lost, however. When experiments fail, the best course of action is to analyze the results, tweak the model parameters and try again. This is why the field is called “data science.”
Wrapping Up
Spark is a fast and powerful cluster computing environment for parallel processing of data workloads with a modular architecture. Two modules explored in this article were PySpark and Spark ML. PySpark provides a run Python runtime for Spark and high-level abstraction of Resilient Distributed Datasets (RDDs) in the form of a DataFrames API. The Spark ML library provides a machine learning API for data built on top of DataFrames.
Machine learning is a discipline with the larger field of data sci- ence. When an experiment doesn’t yield the desired results, finding the solution requires an iterative approach. Perhaps 10 minutes is too granular an interval. Maybe more than five input fields would help uncover a pattern. Quite possibly one month’s worth of flight data is not enough for the algorithm to establish a clear pattern. The only way to know is to keep experimenting, analyzing the results, and adjusting the input data and parameters. n
Frank La Vigne leads the Data & Analytics practice at Wintellect and co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical expert for reviewing this article: Andy Leonard
msdnmagazine.com
®
Instantly Search Terabytes of Data
across an Internet or Intranet site, desktop, network or mobile device
dtSearch enterprise and developer products have over 25 search options, with
dtSearch’s document filters support: • popular file types
• emails with multilevel attachments • a wide variety of databases
easy
multicolor
hit-
highlighting
• web data
• SDKs for Windows,
UWP, Linux, Mac, iOS in beta, Android in beta
• See dtSearch.com for articles on faceted search, advanced data classification, Azure and more
Visit dtSearch.com for
• hundreds of reviews and case studies • fully-functional evaluations
The Smart Choice for Text Retrieval® since 1991
dtSearch.com 1-800-IT-FINDS
Developers:
Ask about new cross-platform .NET Standard SDK including Xamarin and .NET Core
• APIs for .NET, Java and C++


































































































   25   26   27   28   29