MSDN Magazine, April 2018

Page 23 - MSDN Magazine, April 2018

P. 23

Azure Storage Explorer is a free, cross-plat- form utility to manage data stores in Azure. If you don’t already have it installed, please do so while waiting for the Spark cluster to ini- tialize (azure.microsoft.com/features/storage-explorer).
Once the cluster is set up and Azure Storage Explorer is configured to access the storage account created for the cluster in Figure 2, open Azure Storage Explorer and browse to the “msdndemo” blob using the tree control on the left (Figure 4). Click on “msdndemo” to reveal the contents at the root of the blob, then click New Folder and in the Create New Virtual Directory dialog, enter the name for the new folder: flights. Click OK to create the folder. Next, click on the Upload button, choose Upload Files, click the ellipsis button, and browse to the CSV data file for this arti- cle, “06-2015.csv.” Click Upload to upload the file to the Blob store.
Now that the data file is uploaded, it’s time to start working with the file in a PySpark Notebook. The Spark Python API, commonly referred to as PySpark, exposes the Spark programming model to Python. For develop- ers accustomed to Python, PySpark will feel very familiar. The Spark Web site provides a great introductory explanation to the envi- ronment and how it differs from standard Python (bit.ly/2oVBuCy).
Jupyter Notebooks in Spark
The HDInsight implementation of Apache Spark includes an instance of Jupyter Notebooks already running on the cluster. The easiest way to access the environment is to browse to the Spark cluster blade on the Azure Portal. On the overview tab, click either of the items labeled Cluster Dashboard (Figure 5). In the blade that appears, click on the Jupyter Notebook tile. If challenged for credentials, use the cluster login credentials created earlier.
Once the homepage for the cluster’s Jupyter Notebooks ser- vice loads, click New and then choose PySpark3 to create a new PySpark3 notebook, as depicted in Figure 6.
Doing this will create a new blank notebook with an empty cell. In this first cell, enter the following code to load the CSV file uploaded to the blob earlier. Before pressing Control+Enter
on the keyboard to execute the code, examine the code first:
flights_df = spark.read.csv('wasb:///flights/06-2015.csv', inferSchema=True, header=True)
Note the “wasb” protocol handler in the URL. WASB stands for Windows Azure Storage Blobs and provides an interface between Hadoop and Azure Blob storage. For more information about how this was done and why this is significant, refer to the blog posts, “Why WASB Makes Hadoop on Azure So Very Cool” (bit.ly/2oUXptz), and “Understanding WASB and Hadoop Storage in Azure” (bit.ly/2ti43zu), both written by Microsoft developer Cindy msdnmagazine.com
Gross. For now, however, the key takeaway is that Azure Blobs can act as persistent file stores for data even when the Spark cluster isn’t running. Furthermore, the data stored here is accessible by applications that support either Azure Blob storage or HDFS.
With the focus still on this textbox, press Control+Enter. Immediately, the space beneath the cell will read “Starting Spark application.” After a moment, a table appears with some data about the job that just ran and a notification that the “SparkSession is avail- able as ‘spark.’” The parameters passed along to the spark.read.csv method automatically inferred a schema and indicated that the file has a header row. The contents of the CSV file were loaded into a DataFrame object. To view the schema, enter the following code into the newly created blank cell and then execute the code by pressing Control+Enter:
flights_df.printSchema()
The schema appears and displays the name and datatype of each field. The names match the names of the fields from the header row of the CSV file.
DataFrames in Detail
In Spark, DataFrames are a distributed collection of rows with named columns. Practically speak- ing, DataFrames provide an interface similar to tables in relational databases or an Excel work- sheet sheet with named column headers. Under
the covers, DataFrames provide an API to a lower level fundamental data structure in Spark: a resilient distributed dataset (RDD). RDDs are a fault-tolerant, immutable, distributed collection of objects that can be worked on in parallel across the worker nodes in a Spark cluster.
RDDs themselves are divided into smaller pieces called Parti- tions. Spark automatically determines the number of partitions into which to split an RDD. Partitions are distributed across the nodes on the cluster. When an action is performed on an RDD, each of its partitions launches a task and the action is executed in parallel. Happily, for those new to Spark, most of the architecture is abstracted away: first by the RDD data structure and Spark, then by the higher-level abstraction of the DataFrame API.
Figure 2 The Basics Step of the Quick Create Process for Setting Up an HDInsight Cluster
Figure 3 Choosing Spark as the Cluster Type for the HDInsight Cluster April 2018 17

21 22 23 24 25