MSDN Magazine, June 2018

Page 37 - MSDN Magazine, June 2018

P. 37

Event Hubs
WebJobs HDInsight Spark
single operation, which is done by using a distinct operation Id for each section. Additionally, we’d like to see those separate large operation blocks as part of a whole, which we can do by setting the context’s parent operation Id to the same value for all metrics reporting in each operation. The parent operation Id can also be passed in from an outside trigger for the job, which would then provide a mechanism to link all of the discrete operations from the previous process and the Azure Databricks job as part of a single gestalt operation identified by the parent operation Id in Application Insights.
We’ve depicted a couple scenarios here. The key point is that you should consider how you want to organize your operations, events and metrics as part of the overall job organization.
Notice that notebook organization has been aligned with discrete operations that can be used to group event reporting in Application Insights.
Adding Application Insights to the Environment
In order to get the environment ready, you need to install the Python Application Insights library on the cluster, grab some configuration settings and add a bit of helper code. You can find Application Insights on pypi (pypi.python.org/pypi/applicationinsights/0.1.0). To add it to Databricks, simply choose a location in your workspace (we created one named Lib) and right-click and choose Create, then Library. Once there, you can enter the pypi application name and Databricks will download and install the package. The last thing you’ll have to decide is whether or not you want to attach the library to all clusters automatically.
In the attempt to reduce the amount of code to add to each note- book, we’ve added an include file that has a couple of helper functions:
def NewTelemetryClient (applicationId, operationId="", parentOperationId=""):
tc = TelemetryClient(instrumentationKey) tc.context.application.id = applicationId tc.context.application.ver = '0.0.1' tc.context.device.id = 'Databricks notebook' tc.context.operation.id = operationId tc.context.operation.parentId = parentOperationId return tc
This code contains a factory function named NewTelemetry- Client to create a telemetry client object, set some of the properties and return the object to the caller. As you can see, it takes a parent operation Id and an operation Id. This initializes the object, but note that if you need to change the operation Id, you’ll have to do it in the job notebook directly. Also worth noting is that the TelemetryClient constructor takes an instrumentation key, which can be found in the properties blade of the Application Insights instance you wish to use. We’ve statically assigned a few values that are needed for the example, but the TelemetryClient context object has many child objects and properties that are available. If
June 2018 31
Figure 1 Single Solution, Separate Processes, Separate Steps
with Application Insights” (msdn.com/magazine/mt808502). Here, we’ll focus on organizing our notebooks and jobs to facilitate proper tracking in the form of the operation, event and data we send from our Databricks jobs.
In Databricks, you can define a job as the execution of a note- book with certain parameters. Figure 2 illustrates a couple of basic approaches to organizing work in a Databricks Notebook.
Figure 2 shows two simple possibilities in which one job is defined as a single notebook with a number of code blocks or functions that get called while the other job displays a control notebook that orches- trates the execution of child notebooks, either in sequence or in parallel. This is not, by any means, the only organization that can be used, but it’s enough to help illustrate how to think about correlation. How you go about organizing the notebooks and code is certainly a worthwhile topic and is highly variable depending on the size and nature of the job. For a little more depth on Databricks Notebook Workflow, take a look at the blog post, “Notebook Workflows: The Easiest Way to Implement Apache Spark Pipelines” (bit.ly/2HOqvTj).
Notice that notebook organization has been aligned with discrete operations that can be used to group event reporting in Application Insights. In Application Insights, correlation is accomplished via two properties: Operation Id and Parent Operation Id. As seen in Figure 2, we wish to capture all of the discrete events and metrics within a code block or separate notebook under the context of a
Single Notebook Job
Job Scheduler
Nested Notebooks Job
Job Scheduler
Control Notebook Parent Operation Id
Operation Id A
Operation Id B
Operation Id C
Cell/Block 1 ~
Cell/Block 2 ~
Cell/Block 3 ~
Operation Id A
Operation Id B
Operation Id C
Figure 2 Basic Organization Options for a Databricks Notebook Job
msdnmagazine.com
Parent Operation Id
Step 0 Step 1 Step 2 Step 3
Step 0 Step 1 Step 2 Step 3

35 36 37 38 39