MSDN Magazine, July 2017

Page 35 - MSDN Magazine, July 2017

P. 35

churn models to predict customers that are most likely to churn and to understand the key factors that lead to those churns.
Reducing Hospital Readmission: Reducing readmission rates for various medical conditions (heart attack, pneumonia, coronary artery bypass surgery, to name a few) is important to hospitals. In the United States, hospitals face penalties if the readmission rate is too high. Hospitals leverage predictive models for predicting patients that are more susceptible to being readmitted within 30 days. This helps them understand the root causes for the 30-day readmission, and helps them work toward addressing them.
This presents an exciting opportunity for database professionals and developers to either work with data scientists, or put on a data scientist hat to build predictive models that can help to assess credit loan risk, manage customer churn, reduce hospital admissions and more. The possibilities for developers to turn all these raw data assets sitting in the database to golden, strategic insights is exciting.
This article shows how you can work with data science and artificial intelligence (AI) with SQL Server. You’ll also learn how to jump-start your journey of using R and Python with SQL Server.
Why Doing Data Science with SQL Server Matters
What does doing data science mean and why does it matter to the database person? Today, most data scientists first figure out how to connect to many data sources (databases included), bring the data out from the database, and use the historical data to train and subsequently test the machine learning models that they’ve built.
A typical approach used by data scientists is to read the data from the database into the client that they’re using for building the model. Once the data is read, they combine the data with other data sources. For data scientists developing the models using R, packages like dplyr are commonly used for performing aggregation, joins and for filtering. With the data transformed into the right shape, data scientists continue the data modeling process, and start doing fea- ture engineering. As part of feature engineering, new features (such as adding new columns in a SQL Server table) might get created, existing features might get transformed (scaling it to -1 to 1, or 0 to 1, applying logarithmic transformation, computing the z-score, binnning the data, and so on) or removed. Feature engineering plays a very important role in laying the groundwork needed for a good predictive model. Once all these steps are completed, the data scientist develops the models and validates it using test data before figuring out an operationalization plan for the model to be deployed to production so that applications can consume them.
At this point, as a database person, you might ask, “Why do I need to move the data out from the database to do data science? Should we push the processing of joins and aggregations (Sum, Min, Max and so on) into the database?”
Why does it make sense to do this in the database? First, data move- ment is expensive. If the machine learning models can run where the data is stored, this removes the need to move data between the database and the client application. Second, a new working copy of the data is extracted from the database and stored external to the database. The implication is that many of the security policies and audits that apply to data stored in the database can no longer be enforced. Third, if the computation of joins and aggregations can be done where the data is msdnmagazine.com
located, you can leverage decades of database innovations (leverag- ing indexes—clustered and non-clustered, in-memory tables, column stores, high-availability and so on). If training the model can be done where the data is stored, it can lead to performance improvements.
In addition, if the data science project involves working with spa- tial data, temporal data or semi-structured data, you can leverage SQL Server capabilities that let you do this efficiently. For example, if you’re working on a data science project (say a land-use classification problem) where you must manipulate spatial data, the geography and geometry data types in SQL Server will provide a good way to store the spatial data. Once it’s stored as spatial data in SQL Server, you can leverage SQL Server spatial functions to query for nearest neighbors, compute the spatial distance using different spatial reference systems and more. The spatial data can be indexed to facilitate efficient query processing.
As a database professional and developer, you have tremen- dous knowledge and value to bring to a data science project. By doing data science and AI where the data resides, there are many benefits. These include being able to take advantage of the enterprise- grade performance, scale, security and reliability that you’ve come to expect from SQL Server over the years. More important, you eliminate the need for expensive data movement.
Figure 1 illustrates the difference between doing data science and AI outside of the database. From the application perspective, a developer doesn’t need to learn new methods to tap the power of AI teeming in SQL Server. It connects to it the same way it connects to a database today, and invokes SQL Server-stored procedures, which encapsulates the R or Python code. The stored procedure has just become an intelligent AI stored procedure.
Another important consideration in data science projects is operationalization. The predictive model that has been developed by data scientists needs to be deployed into a production envi- ronment for it to be used by applications. With the release of SQL Server 2016 R Services, you can wrap R code as part of the stored procedures. After training is done, the predictive models are stored as varbinary(max) in a database table.
An application (.NET, Java, Node.js and more) would connect to SQL Server and invoke the stored procedures to use the predictive model for making predictions on new instances of data. Continuing the momen- tum, SQL Server 2017 CTP2 added Python support. You now have the best of multiple worlds: the ability to write code in R or Python, lever- age the rich set of R and Python libraries for machine learning and deep learning, and consume the predictive models in any application.
Application + AI
Database
vs.
Application
AI in the Database
Regular DB + App
AI DB + App
Figure 1 Doing Data Science and AI Where the Data Is Stored
July 2017 31

33 34 35 36 37