Page 21 - MSDN Magazine, June 2019
P. 21
Now, I would like to see a graph of blog posts over the entire 15-year span and see if a pattern emerges over a longer period of time. Enter the following code to graph the entire timespan:
plot(postData[,2], xlab="Month Index", ylab="Posts", main="All Posts")
The results, shown in Figure 2, do show a clear
trend, if not a well-defined pattern. Blogging activ-
ity started out fairly strong but declined steadily,
picking up again around 30 months ago. The trend of late is decid- edly upward. There’s also the one significant outlier.
Correlation Matrix
Earlier, I noted a correlation between the Posts and PPD columns. R has a built-in function to display a correlation matrix, which is a table displaying correlation coefficients between variables. Each cell in the table shows the correlation between two variables.
A correlation matrix quickly summarizes data and reveals relationships between variables. Values closer to 1 have a high correlation, while those closer to 0 have low correlation. Nega- tive values indicate a negative correlation. To view the correlation matrix for the postData DataFrame, it’s first necessary to isolate the numeric fields into their own DataFrame and then call the cor function. Enter the following code into a new cell and execute it:
postsCor <- postData[, c(2, 3 ,4)] cor(postsCor)
The output reveals a near-perfect correlation between Posts and PPD, while Days.In.Month has a slightly negative correlation to PPD.
Wrapping Up
While R’s syntax and approach may differ from tra- ditional programming languages, I find it an ele- gant solution for data wrangling and mathematical processing. For software engineers serious about
building a career in data science, R is an important skill to develop. In this article, I explored some of the fundamentals of the R programming language. I showed how to use built-in functions to load and explore data within DataFrames, to gain insights through statis- tics, and to plot graphs. In fact, everything in this article was written in what would be referred to as “base” R, as it doesn’t rely on any third-party packages. However, some R users prefer the “tidyverse” suite of packages, which uses a different style. I’ll explore that in an upcoming column. n
Frank La Vigne works at Microsoft as an AI Technology Solutions Professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical expert for reviewing this article: Andy Leonard, David Smith
Posts
0 100 200
Untitled-2 1
1/8/19 10:52 AM
All Posts
0 50 100 150 Month Index
Figure 2 Posts Over 15 Years