MSDN Magazine, July 2017

Page 22 - MSDN Magazine, July 2017

P. 22

MACHINE LEARNING
Cognition at Scale with
U-SQL on ADLA
Hiren Patel and Shravan Matthur Narayanamurthy
Companies providing cloud-scale services have an ever-growing need to store and analyze massive data sets. Such analysis to transform data into insight is becoming increasingly valuable. For instance, analyzing telemetry from a service to derive insights into the investments that most improve service qual- ity, analyzing usage patterns over time to detect changes in user behavior (engagement/churn), analyzing sensor data to perform preventative maintenance—all these are extremely important, as has become very apparent to us while running massive services like Bing and Cosmos. Most of these analyses involve feature engi- neering and modeling. With storage getting cheaper, there are no longer any constraints on the amount of data we can collect, which means we soon reach the limits of traditional single-node data processing engines and require a distributed processing platform to do machine learning tasks on massive datasets. Furthermore, machine learning models are usually built or used in applications or pipelines that involve processing raw data—deserializing data, filtering out unnecessary rows and columns, extracting features, and transforming them to a form amenable for modeling. To express such operations easily, users need a programming model that offers a degree of compositional freedom that’s typical of declarative languages.
U-SQL is a declarative language that provides the expressibil- ity necessary for advanced analytics tasks, like machine learning and operating seamlessly on cloud-scale data. It also offers the following advantages:
• The resemblance of U-SQL to SQL reduces the learning curve for users. It offers easy extensibility with user-defined oper- ators, the ability to reuse existing libraries and the flexibility to choose different languages (C#, Python or R) to develop custom algorithms.
• Users can focus on business logic while the system takes care of data distribution and task parallelism, along with execu- tion plan complexities.
• U-SQL has built-in support for machine learning.
Machine Learning Using U-SQL in ADLA
Building intelligent features into applications requires some form of prediction capability. There are two ways to go:
Build your own model: You first preprocess the telemetry or any kind of raw data into a shape suitable for modeling, then train a model on the pre-processed data and use this trained model for prediction in applications. The Azure Data Lake Analytics (ADLA) engine makes all the preprocessing possible in an efficient man- ner. It allows you to build machine learning models that cover a wide variety of scenarios, from building regression models to image classification via R and Python extensions, and enables you to build models using built-in, efficient, massively parallelable distributed machine learning algorithms. (We’ll discuss how to train a model using U-SQL in a future article.)
Using a pre-trained model for scoring: Suppose you have a pre-trained model but want to score large amounts of data effi- ciently. U-SQL can handle this pleasingly parallel task very well. U-SQL allows user-defined operators (UDOs) where you provide
This article discusses:
• Machine learning using U-SQL in ADLA • Cognition with U-SQL
• Using a pre-trained model Technologies discussed:
U-SQL, Azure Data Lake Analytics, U-SQL Extension-Cognitive/R/Python
18 msdn magazine

20 21 22 23 24