MSDN Magazine, April 2018

Page 20 - MSDN Magazine, April 2018

P. 20

ArtificiAlly intelligent FRANK LA VIGNE Introducing Apache Spark ML
Apache Spark ML is a machine learning library module that runs on top of Apache Spark. Spark itself is a cluster computing environment that provides data engineers and data scientists an interface for programming entire clusters of commodity com- puters with data parallelism and fault tolerance. Spark supports a number of languages such as Java, Scala, Python and R. It also natively provides a Jupyter Notebook service. Please refer to my February Artificially Intelligent column (msdn.com/magazine/mt829269) for the fundamentals of Jupyter Notebooks if you’re not familiar with them. In this article, I’ll explore using Spark ML in a Juptyer Notebook on an HDInsight cluster on Azure.
Getting Started with Apache Spark on Azure
To work with Spark ML, the first step is to create a Spark cluster. Log into the Azure Portal and choose “Create a resource,” then choose HDInsight, as shown in Figure 1. The blades that appear walk through the process of creating an HDInsight cluster.
The first blade, labeled Basics, covers essential properties of the clus- ter, such as Cluster Name, administrator and ssh credentials, as well as resource group and location. The cluster name must be unique across the azurehdinsight.net domain. For reference, please refer to Figure 2. Of particular importance is the Cluster Type option, which brings up yet another blade. In this blade, shown
in Figure 3, set the cluster type to
Apache Spark and the version to
Spark 2.10 (HDI 3.6). Click Select to
save the setting and close the blade.
machines in Azure, HDInsight clusters do not have an option to pause and stop billing. Click Create and take note of the notifi- cation that it can take up to 20 minutes to instantiate the cluster.
When running HDInsight clusters, Azure Storage Blobs seamlessly map to HDFS.
HDFS and Azure Blob Storage
Spark, like Hadoop, uses the Hadoop Distributed File System (HDFS) as its cluster-wide file store. HDFS is designed to reliably store large datasets and make those datasets rapidly accessible to applications running on the cluster. As the name implies, HDFS originated within Hadoop and is also supported by Spark.
When running HDInsight clusters, Azure Storage Blobs seamlessly map to HDFS. This fact makes it simple to upload and download data to and from the HDFS file store attached to the Spark cluster. In fact, the first step to getting the project started will be using Azure Storage Explorer to upload the data file with which to work.
The next step configures storage options for the cluster. Leave Pri- mary Storage Type and Selection Method at their defaults. For stor- age container, click create new and name it “msdnsparkstorage” and set the default container to “spark- storage” in the Default Container textbox. Click Next to get to the Summary step of the setup pro- cess. This screen offers a review and opportunity to modify the cluster setup along with the estimated hourly cost to run the cluster. Take special care to always delete clusters when not in use. Unlike virtual
Figure 1 Creating a New HDInsight Resource
16 msdn magazine

18 19 20 21 22