Page 24 - MSDN Magazine, July 2017
P. 24
Using Cognitive Functions in U-SQL As described earlier, the assemblies and models that power the Cognitive Services have been integrated with U-SQL, allowing you to run simple query statements over millions of images and process them using cognitive functions. The overall method for using these cognitive capabilities at scale in U-SQL is simply:
• Use the REFERENCE ASSEMBLY statement to include the cognitive functions in the U-SQL script. • UsetheEXTRACToperationtoloaddataintoarowset.
• Use the PROCESS operation to apply various Cognitive functions.
• Use the SELECT operation to apply transforma- tions to the predictions.
• Use the OUTPUT operation to store the result into persistent store.
Let’s continue with the scenario described earlier, which involves processing a large number of images and analyz- ing the emotion of people when there are animals in the image. Figure 1 shows an example script for completing this scenario. In the example, we use the Vision cognitive functions, which enables us to understand what’s in an image and returns a set of tags that identify objects. For the sample script in Figure 1, we’re using a subset of the images from the team’s 1 million images dataset.
In this simple U-SQL query, we’re doing some very pow- erful things. First, we’re extracting images into the byte array column using the system-provided ImageExtractor, and then loading them into rowsets. Next, we extract all the tags from those images using the built-in ImageTagger. Then we filter the images, finding those that have “cat” or “dog” tags. Using the system-provided EmotionAnalyzer, we next extract the faces and associated emotions from these images, then find all the images that have a human along with a dog or a cat. Finally, we output the distribution of human emotions in those images.
To demonstrate the scalability of U-SQL, we executed the same script on the full data set with 1 million images. As soon as we submit the script, in a matter of seconds, thousands of containers in ADLA spring to action to start processing these images, as shown in the Figure 2.
UsqlML.dbo.MegaFace
SV3 Extract
1,000 vertices 1.10 min 1,028,061 rows
SV1 Extract
1,000 vertices 3.84 min 3,940 rows
41.58 MB
W 41.58 MB 41.58 MB
R 27.67 GB
R 41.58 MB
R 41.58 MB
R 41.85 MB
R 326 bytes
W 41.58 MB 41.58 MB
W 41.58 MB 41.58 MB
W 41.58 MB 41.58 MB
W 326 bytes 326 bytes
W 280 bytes
R 27.67 GB
R 100.44 KB
R 134.81 KB 134.81 KB
W 134.81 KB 100.44 KB
W 100.44 KB 100.44 KB
W 134.81 KB
66.35 KB
SV4 PodAggregate...
4 vertices
7.32 s 1,028,061 rows
SV4 PodAggregate...
1 vertex
0.77 s 1,028,061 rows
SV2 AggregateInte...
3 vertices 10.34 s 2,936 rows
R 41.58 MB
SV5 Aggregate
2 vertices
5.19 s 1,028,061 rows
SV2 Aggregate
1 vertex 11.52 s 3,940 rows
SV6 Cross
2 vertices 1.99 s
14 rows
SV7 PodAggregate
1 vertex 0.20 s 14 rows
You can easily extend this example to get other inter-
esting insights, like the most frequently occurring pairs
of tags, objects that appear together most often and so on. Furthermore, you can also detect age, gender, and landmarks from these images using other cognitive functions. For your reference, we’ve added the code snippets in Figure 3 to describe how to use other built-in cognitive functions in U-SQL applications.
Using a Pre-Trained Model
Most traditional machine learning algorithms assume that the data processed to train a model isn’t too large to store in the RAM of one computer. Thus, most of the time users need only a single-box environment to train their models. Furthermore, it’s relatively common to have only a small amount of label data on which a
sample_dog_cat.csv
Figure 2 Job Execution Graph
model is trained. The R and Python languages have emerged as the industry standard for open source, as well as proprietary predic- tive analytics. R and Python together provide many capabilities, such as flexibility, rich graphics and statistics-oriented features, along with an ecosystem of freely available packages that account for much of its growing popularity. Thus, many developers uses R and Python to do single-box predictive analytics.
Once trained, a model is applied to massive data sets that frequently eclipse the size of the training data by orders of magnitude. In the following section, we’ll describe how to use an existing model that was trained to do prediction using a local R/Python environment on a massive amount of data, using the U-SQL extension on ADLA.
20 msdn magazine
Machine Learning