Chapter 2 Tidyverse
This statement from the official tidyverse website describes what the tidyverse is
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Further background information on what the tidyverse is can be found in the R Views blog: What is the tidyverse?. In addition, Hadley Wickham has laid out his principles for tidyverse packages within his tidy tools manifesto.
At this point, you should watch this video to get an introduction to data wrangling with the tidyverse.
What is data wrangling? Intro, Motivation, Outline, Setup – Pt. 1 Data Wrangling Introduction
2.1 Installation
The package can be installed and loaded with
This loads the core set of tidyverse packages – see here for more details.
2.2 Getting Data into R
The tidyverse installs a number of packages that help with getting your data into R. However, only the readr package is loaded as part of the core set of packages so the other data import packages, (readxl and haven have to be loaded with their own library()
calls).
- readr: for reading in data formats such as csv, tsv and fwf.
- readxl: for reading in Excel data formats (.xls and .xlsx).
- haven: for reading in SPSS, Stata and SAS data formats.
RStudio also provides an ‘Import Dataset’ button to help with this. The RStudio support pages describe how to import data with RStudio. If you decide to use this to help you import data into R, my recommendation is to copy over the code generated in the ‘Code Preview’ box into your R script. This helps you to keep a record of the data import step in your script.
Besides these data types, you should familiarise yourself with R’s own data formats: .rds and .RData – .rds format can be used when saving a single object or dataset while .RData should only be used when you want to save multiple objects or datasets from your R session. These formats load fast into R and an important advantage is that they keep data exactly as they were when they were created in R (i.e. all information about variable types and metadata will be retained after saving).
- .rds: read in data with
readRDS()
; save data withsaveRDS()
- .RData: read in data with
load()
; save data withsave()
2.3 Data Manipulation with dplyr
dplyr provides a set of functions (often described as verbs) to help with data manipulation. Most common tasks can be done using only a handful of functions: select()
, filter()
, mutate()
, rename()
, arrange()
, summarise()
and group_by()
– note that rename()
is not usually included in this list but I have included it as renaming variables comes up a lot during data cleaning/processing. Out of these functions, group_by()
is the most complex and should probably be tackled after understanding the others; in particular, it is often used together with summarise()
. Furthermore, two functions that often come in handy when used together with mutate()
are ifelse()
(or the stricter dplyr equivalent if_else()
) and case_when()
– these help to create new variables depending on certain conditions being met.
It is a good idea to understand what each of these functions does on their own first before thinking about how to combine multiple manipulation steps together using pipes (%>%
). Once you are comfortable with each of these functions, it should be a relatively small step to learn how to combine them together with the pipe operator; the ability to do this in a way that is relatively intuitive is one of the most powerful design features of the tidyverse. This is more or less the learning order used in the dplyr chapter of the R Programming for Data Science book which is worth a read through (there is also an accompanying video for that chapter which may also help). I would definitely recommend spending a good amount of time learning dplyr as you will want to reach the point where you can use these functions without requiring much thinking.
A good place to get started on learning about these functions are these videos.
Data Manipulation Tools: dplyr – Pt 3 Intro to the Grammar of Data Manipulation with R
Hands-on dplyr tutorial for faster data manipulation in R – note that this tutorial is a little old (2014) so some parts of it are out-of-date but I think it is still useful. Only minor changes will be needed to make this current and the old dplyr functions mentioned should still be useable anyway. For example, they use tbl_df()
which can now be replaced by tibble()
instead. You should try running the R commands yourself while watching this video to aid the learning process.
Once you have watched these videos, you can work through all the tutorials in the Work with Data section on RStudio Cloud. Then you can further consolidate this knowledge by reading through data transformation chapter of the R for Data Science book and working through the exercises at the end. After this, a set of four tutorials by Suzan Baert provides much more depth on what some of the basic dplyr functions can do. Again, trying to run the R codes while following along with the tutorials will be very beneficial.
Data Wrangling Part 1: Basic to Advanced Ways to Select Columns
Data Wrangling Part 2: Transforming your columns into the right shape
Data Wrangling Part 3: Basic and more advanced ways to filter rows
Data Wrangling Part 4: Summarizing and slicing your data
2.3.1 Working with Multiple Datasets
It is very likely that you will need to perform data linkage at some point so this makes it more or less essential to learn how to work with the various joining functions within dplyr. To get a quick overview on this, you can watch this short video.
Working with Two Datasets: Binds, Set Operations, and Joins – Pt 4 Intro to Data Manipulation
These neat animations then offer a great illustration of how these joins work and a cheatsheet by stat545 summarises this further. After this, I recommend reading three short sections of the relational data chapter in the R for Data Science book: Understanding joins, Inner join, Outer joins.
Then you can follow this video tutorial – the parts about joins start from around 25 minutes into the video but if you want to further solidify your dplyr knowledge, you can watch it from the beginning. Again, note that this video is a little old (2015) so some aspects may be out-of-date but it is still useful nevertheless.
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
There is also a Join Data Sets tutorial on RStudio Cloud that you can try. Finally, you will want to read and work through all of relational data chapter in the R for Data Science book to understand how to work with multiple related datasets more thoroughly. By the end, you should be familiar with just about everything on this Data Transformation Cheat Sheet.
2.4 Reshaping Data with tidyr
Sometimes you will need to format your data into a specific shape to make it work for a certain purpose – tidyr is a package that has functions that help with this. For example, for output tables, you may want to have data on different years in separate columns while for analytical operations within R, it may be easier to work with ‘year’ stored in only one column (or variable); the latter format here can be considered as tidy data. In particular, the ggplot2 package often requires data to be in the long or tidy format. To get a better idea on this, watch this video.
Tidy Data and tidyr – Pt 2 Intro to Data Wrangling with R and the Tidyverse – however please note that the gather()
and spread()
functions mentioned will be replaced by pivoting functions so you should mainly be concerned with the concepts here rather than learning about these functions specifically.
Then you can try the Reshape Data tutorial on RStudio Cloud. The main thing you need to be able to do with tidyr is to be confident in knowing how to transform your data from wide-to-long format or from long-to-wide format. However, if you want to know more about what tidyr can do or find out more about tidy data in general, please refer to the tidy data chapter in the R for Data Science book (keeping in mind the caveat made previously about the change to using pivoting functions).
2.5 Other Data Manipulation Tips
As you progress further along with learning to manipulate data using the tidyverse, you will inevitably encounter situations where you need to know more about how to deal with specific types of data such as dates/times and string variables. In addition, there will also be times when you wish there was a convenient and quick way to process data – luckily there usually is in R. Please note that some of the material mentioned in this section is probably optional until you specifically have a need for it but I recommend that you go through some of it anyway to build an awareness of where to look for solutions when you need it.
2.5.1 Dates/Times
The main package within the tidyverse for dealing with dates and times is the lubridate package. This is not loaded with the core set of tidyverse packages so has to be loaded separately by typing library(lubridate)
. The main reference text for learning about how to work with dates/times using lubridate is the dates and times chapter in R for Data Science. Dates/times can sometimes be very complex (e.g. when thinking about time calculations across timezones) but it is likely that you will usually only need to perform fairly simple data processing operations involving dates. In this case, you will only really need to know a small number of functions from lubridate:
dmy()
,mdy()
,ymd()
to convert text to a Date format in R, e.g.ymd("2001/05/20")
would convert the text “2001/05/20” so that R can recognise this as 20 May 2001.year()
,month()
,day()
to extract the year, month, day from a date object, e.g.year(ymd("2001/05/20"))
would extract 2001.
2.5.2 Strings
The stringr package package is loaded within the core set of tidyverse packages and provides a set of consistent functions for working with string (or text) data. Again, the main reference is the strings chapter in R for Data Science. You will likely not need to know everything contained in that chapter but I would recommend learning some basics of using regular expressions. These can help to, for instance, filter data or create new variables according to some text pattern found in a string variable. This can sometimes be much quicker than filtering using a list of keywords, particularly if your data may contain variations on a common item (e.g. ‘Male’, ‘M’, ‘Man’) or mispellings (e.g. ‘Mal’).
A useful place to practice and learn how to use regular expressions is regexr. There is also an RStudio addin that may help called regexplain which was inspired by regexr but may help with your learning as it can be used within RStudio. Once you gain some familiarity on how to use regular expressions, many of the stringr functions should be relatively easy to understand and use. In particular, str_detect()
is useful for filtering and creating variables.
If you are already familiar with regular expressions using base R functions and want to find out about their stringr equivalents to take advantage of the consistencies in their design, the from base R to stringr vignette is a handy resource. Furthermore, this blog post on Demystifying Regular Expressions in R gives a very clear and descriptive guide to how the different base R regular expression functions work.
2.5.3 janitor
The janitor package is a R package that has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimised for user-friendliness.
A really neat function from janitor is clean_names()
. This function has various different options for cleaning variables names but the defaults usually work quite well – this will transform variable names into lower-case and snake-case. We can use the iris dataset to see how this looks (note that ::
can be used before a function name to specify the package that the function comes from).
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [1] "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
This variable name cleaning can also be applied at the same time as reading in a data file. For example, when reading in an Excel file, you can use something like readxl::read_excel("excel_file", .name_repair = janitor::make_clean_names)
.
Another function from janitor that is quite useful is tabyl()
. This creates tables of counts, including cross-tabulations and outputs to dataframe format – this is the main benefit over using something like table()
from base R as dataframes are often easier to manipulate, particularly with the help of the myriad tidyverse functions.
## cyl 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2