MSDN Magazine, October 2019

Page 26 - MSDN Magazine, October 2019

P. 26

ArtificiAlly intelligent
Exploring the tidyverse
FRANK LA VIGNE
In my last article (msdn.com/magazine/mt833459), I explored the funda- mentals of the R programming language, which is widely used in the data science space. At the end of the article, I pointed out that allthecodewrittenforthearticlewasin“baseR.”WhilebaseRis capable of loading, exploring and visualizing data, it’s not the only way to perform data analysis in R. At the end of the article, I briefly mentioned the tidyverse (tidyverse.org), a collection of packages for R that align to common design principles and are designed to work together seamlessly. Package developers that would like to add to the tidyverse must adhere to the tidyverse style guide (style.tidyverse.org). This enables a consistent experience for developers and ease of interoperability between packages.
Tidyverse packages are designed to simplify and streamline the data science process of load, prep, train and visualize by providing a more consistent development experience across the various libraries. A good analogy would be how jQuery simplified Web development by creating a more consistent programming surface across the DOM, event handling and more. While not a language per se, jQuery made JavaScript more productive by lessening the friction of their most common tasks.
The tidyverse libraries are open source and available on GitHub (github.com/tidyverse). The core tidyverse modules include packages needed for everyday data analyses and exploration. As of tidyverse 1.2.0, the following packages are included in the core distribution: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats. Dozens of other useful packages are also included in the tidyverse, but aren’t loaded automatically with library(tidyverse). See tidyverse.org/ packages for details. This article will explore the basics of how to load, filter and visualize data the “tidyverse way.”
Recall that last month’s column used post count data from my blog as a sample dataset. This dataset is simple with enough vari- ation to demonstrate the power and ease of tidyverse packages. It also helps to use the same sample dataset to facilitate comparisons between base R and tidyverse R.
Loading Data with readr
The readr package provides a fast and easy way to read rectangular data files, such as .csv files. It can flexibly parse many types of data files, while handling errors robustly. To get started, create a new R language Jupyter Notebook. For details on Jupyter Notebooks, refer to my February 2018 article on the topic at msdn.com/magazine/ mt829269. In the first blank cell, enter the following code to load the .csv file data and display it:
library(readr)
fwposts <- read_csv("franksworldposts.csv") fwposts
22 msdn magazine
Note that above the tabular output with the contents of the .CSV file is text that highlights how each record was parsed and that the output is a tibble with 183 rows and four columns. Base R uses data frames to store tabular data. In the tidyverse, a tibble is the equivalent structure. In fact, tibbles are data frames, but they modify some default data frame behaviors to meet the needs of modern data analytics. More information about tibbles can be found at tibble.tidyverse.org and in-depth documentation resides at r4ds.had.co.nz/tibbles.html.
Look at the following message:
Parsed with column specification: cols(
Month = col_character(),
Posts = col_integer(),
`Days in Month` = col_integer(), PPD = col_double()
)
While the read_csv function did properly load and parse the data, it didn’t automatically detect that the Month column was a date field and labeled it a character field instead. However, I would like to preserve this data type in the schema. To do this, I need to pass along a col_types parameter to the read_csv function that explicitly defines the column schema. As readr correctly guessed all the column data types except one, I can use the existing schema as a guide.
The readr package provides a fast and easy way to read rectangular data files, such as .csv files.
To see the current schema or specification of the tibble, enter the following code into a blank cell and execute it:
spec(fwposts)
Enter the following code, which takes the original inferred schema and adjusts how the Month column is parsed. The “%b-%y” format string matches the format of the column with the three-letter month abbreviation and a two-digit year separated by a dash, like so:

24 25 26 27 28