Page 18 - MSDN Magazine, June 2019
P. 18
ArtificiAlly intelligent
Exploring Data with R
FRANK LA VIGNE
Since the very first Artificially Intelligent column, all the code samples I’ve provided have been in Python. That’s because Python currently reigns as the language of data science and AI. But it’s not alone—languages like Scala and R hold a place of prominence in this field. For developers wondering why they must learn yet another programming language, R has unique aspects that I’ve not encountered elsewhere in a career that’s spanned Java, C#, Visual Basic, Python and Perl. With R being a language that readers are likely to encounter in the data science field, I think it’s worth exploring here.
R itself is an implementation of the S programming language, which was created in the 1970s for statistical processing at Bell Labs. S was designed to provide an interactive experience for develop- ers who at the time worked with Fortran for statistical processing. While we take interactive programming environments for granted today, it was revolutionary at the time.
R was conceived in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and derives its name from the first initial of its creators, while also playing on the name of S. Version 1.0.0 of R was released in 2000 and has since enjoyed wide adoption in research departments thanks in part to its wide array of built-in statistical algorithms. It’s also easily extensible via functions and extension packages.
A robust developer community has emerged around R, with the most popular repository for R packages being the Comprehensive R Archive Network (CRAN). CRAN has various packages that cover anything from Bayesian Accrual Prediction to Spectral Processing for High Resolution Flow Infusion Mass Spectrometry. A complete list of R packages available in CRAN is online at bit.ly/2DGjuEJ. Suffice it to say that R and CRAN provide robust tools for any data science or scientific research project.
Getting Started with R
Perhaps the fastest way to run R code is through a Jupyter Notebook on the Azure Notebook service. For details on Jupyter Notebooks, refer to my February 2018 article on the topic at msdn.com/magazine/ mt829269. However, this time make sure to choose R as the language when creating a new notebook. The R logo should appear on the top right of the browser window. In a blank cell, enter the following code and execute it:
# My first R code print("hello world") x <- 3.14
y = 1.21
x
y
The output should read the traditional “hello world” greeting, as well as the values 3.14 and 1.21. None of this should come as novel or unique to any software developer. Note that the assignment operator can also be “<-” and not just the more commonly used equals sign. Both are syntactically equal. Also take note that the # character introduces a comment and applies to the rest of the line.
Vectors are one-dimension arrays that can hold numeric data, char- acter data or logical data. They’re created with the c function. The c stands for “combine.” Enter the following into a new cell and execute it:
num_vec <- c(1,2,3.14) # numeric vector
char_vec <- c("blog","podcast","livestream") # character vector bool_vec <- c(TRUE,TRUE,FALSE) #logical vector
#print out values
num_vec
char_vec
bool_vec
The values displayed should match the values set in the code. You may now be wondering if vectors can contain mixed types. Enter the following code into a new cell:
mix_vec <- c(1,"lorem ispum",FALSE) mix_vec
While the code does run, sharp-eyed readers will notice single quotes around each element in the vector. This indicates that the values were converted to character values. R has the typeof func- tion to check the type of any given variable. Enter the following code to inspect the vectors already created:
typeof(num_vec) typeof(char_vec) typeof(bool_vec) typeof(mix_vec)
One other useful function to know is ls, which displays all the objects in the current working environment. Enter “ls()” into a new cell, execute it, and observe that the output contains the four vectors just defined, along with the x and y variables defined in the first cell.
Working with Data
The best way to experience the true power and elegance of the R language is by using it to explore and manipulate data. R makes it easy to load datasets and quickly get an understanding of their dimensions, structure and statistical properties. For the next few examples, I’ll use a dataset that’s near and dear to me: basic statistics on my blogging activity. I’ve run and maintained a technology blog
Code download available at bit.ly/2vwC0L0.
14 msdn magazine