Introduction To Tidyverse

Library Download

#install.packages("palmerpenguins")

Load Library

Loading the packages required for subsequent analyses.

library(palmerpenguins)

Tidyverse

The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package and data visualization with the ‘ggplot2’ package.

RStudio 4-pane layout with .R file open

Today’s Lesson

Questions

How do I access my data in R?
How do I visualize my data with ggplot2?
How do I subset my data with dplyr?

Objectives

To be able to use ggplot2 to generate publication quality graphics.
To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and coloring or paneling by groups.
To be able to subset data using dplyr

Keypoints

Read data into R
Use ggplot2 to create different types of plots
Use dplyr to subset data

About the dataset

Palmer penguins. Data were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica by Dr. Kristen Gorman. In addition to which island each penguin lived on, the data contains information on the species of the penguin (Adelie, Chinstrap, and Gentoo), its bill length, bill depth, flipper length, its body mass, and sex of the penguin (male or female).

The palmerpenguins package contains two datasets: penguins and penguins_raw. We will be using penguins.

data(package = 'palmerpenguins')

head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

?penguins

penguins <- penguins

You should see an object called “penguins”, which is a Dataset with 344 observations and 8 variables. mm=millimeters. g=grams.

The dataset contains the following fields:

species: penguin species
island: island of observation
bill_length_mm: bill length in millimetres
bill_depth_mm: bill depth in millimetres
flipper_length_mm: flipper length in millimetres
body_mass_g: body mass in grams
sex: penguin sex
year: year of observation

Introduction to ggplot2

ggplot2 is a core member of tidyverse family of packages. Installing and loading the package under the same name will load all of the packages we will need for today’s lesson.

# install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Here’s a question that we would like to answer using penguins data: Do penguins with deep beaks also have long beaks? This might seem like a silly question, but it gets us exploring our data.

RStudio 4-pane layout with .R file open

We will begin by using the ggplot() function to initialize the basic graph structure. The basic idea is that you can specify different parts of the plot and add them together using the + operator. These parts are often referred to as layers.

As input ggplot() requires a data frame.

Let’s start:

#ggplot(penguins) #what happens?

Notice, that you will get a blank plot because ggplot2 requires the user to specify layers using the + operator. The + sign goes at the end of the line, not in the beginning.

One layer is called geometric objects. Examples include:

points (geom_point, geom_jitter for scatter plots, dot plots, etc)
lines (geom_line, for time series, trend lines, etc)
boxplot (geom_boxplot, for boxplots)

Any plot created with ggplot() must have at least one geom. You can add a geom to a plot using the + operator. Functions starting with the prefix geom create a visual representation of data.

#ggplot(penguins) + 
  #geom_point() # note what happens here

You will find that even though we have added a layer by specifying geom_point, we still get an error. This is because each type of geom has a required set of aesthetics elements to be set. Aesthetic mappings are set with the aes() function and can be set inside geom_point(). Examples of aesthetics include:

position (i.e., on the x and y axis)
color (“outside” color)
fill (“inside” color)
shape (of points)
line type
size

We will begin by specifying the x- and y-axis since geom_point() requires this information for a scatter plot. The following code will put bill_depth_mm on the x-axis and bill_length_mm on the y-axis:

ggplot(data = penguins) +
  geom_point(mapping = aes(x = bill_depth_mm,
                           y = bill_length_mm))

Mapping data

What if we want to show the relationship between three variables in the same graph? We can employ color!

Let’s take a look:

ggplot(data = penguins) + 
  geom_jitter(mapping = aes(x = bill_depth_mm, 
                            y = bill_length_mm, 
                            color = island))

Island is categorical character variable with a discrete range of possible values whereas, body mass is a continuous numeric variable in which any number of potential values can exist between known values. To represent this, R uses a color bar with a continuous gradient.

To answer the question: Which species of penguin is found in all three islands?

We will add another type of aesthetic called shape, for categorical data like species.

ggplot(data = penguins) + 
  geom_jitter(mapping = aes(x = bill_depth_mm, 
                            y = bill_length_mm, 
                            color = island,
                            shape = species))

Can you spot the difference?

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = bill_depth_mm, 
                           y = bill_length_mm),
             color = "blue")

Important point: Values set outside aesthetics will apply to the entire geom or plot!

Geometrical objects

Next, we will consider different options for geoms. Using different geom_ functions user can highlight different aspects of data.

A useful geom function is geom_boxplot(). It adds a layer with the “box and whiskers” plot illustrating the distribution of values within categories. The following chart breaks down bill length by island, where the box represents first and third quartile (the 25th and 75th percentiles), the middle bar signifies the median value and the whiskers extent to cover 95% confidence interval. Outliers (outside of the 95% confidence interval range) are shown separately.

ggplot(data = penguins) + 
  geom_boxplot(mapping = aes(x = species, 
                             y = bill_length_mm))

Layers can be added on top of each other. In the following graph we will place the boxplots over jittered points to see the distribution of outliers more clearly. We can map two aesthetic properties to the same variable. Here we will also use different color for each island.

ggplot(data = penguins) + 
  geom_jitter(mapping = aes(x = species, 
                            y = bill_length_mm, 
                            color = species)) +
  geom_boxplot(mapping = aes(x = species,
                             y = bill_length_mm))

Now, this was slightly inefficient due to duplication of code - we had to specify the same mappings for two layers. To avoid it, you can move common arguments of geom_ functions to the main ggplot() function. In this case every layer will “inherit” the same arguments, specified in the “parent” function.

ggplot(data = penguins,
       mapping = aes(x = island, 
                     y = bill_length_mm)) + 
  geom_jitter(aes(color = island)) +
  geom_boxplot(alpha = .6)

#alpha takes a value from 0 (transparent) to 1 (solid).

You can still add layer-specific mappings or other arguments by specifying them within individual geoms. Here, we’ve set the transparency of the boxplot to .6, so we can see the points behind it, and also mapped color to island in the points. It is recommended to build each layer separately and then moving common arguments up to the “parent” function.

We can use linear models to highlight differences in dependency between bill length and body mass by island. Notice that we added a separate argument to the geom_smooth() function to specify the type of model we want ggplot2 to built using the data (linear model). The geom_smooth() function has also helpfully provided confidence intervals, indicating “goodness of fit” for each model (shaded gray area). For more information on statistical models, please refer to help (by typing ?geom_smooth)

ggplot(data = penguins, 
       mapping = aes(x = bill_depth_mm, 
                     y = bill_length_mm)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")

Class Exercise:

Modify the plot so the the points are colored by island, but there is a single regression line.

ggplot(data = penguins, 
       mapping = aes(x = bill_depth_mm, 
                     y = bill_length_mm)) +
  geom_point(mapping = aes(color = species),
             alpha = 0.5) +
  geom_smooth(method = "lm")

In the graph above, each geom inherited all three mappings: x, y and color. If we want only single linear model to be built, we would need to limit the effect of color aesthetic to only geom_point() function, by moving it from the “parent” function to the layer where we want it to apply. Note, though, that because we want the color to be still mapped to the species variable, it needs to be wrapped into aes() function and supplied to mapping argument.

ggplot(data = penguins, 
       mapping = aes(x = bill_depth_mm, 
                     y = bill_length_mm,
                     color = species)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")

The data actually reveals something called the “simpsons paradox”. It’s when a relationship looks to go in a specific direction, but when looking into groups within the data the relationship is the opposite. Here, the overall relationship between bill length and depths looks negative, but when we take into account that there are different species, the relationship is actually positive.

Sub-plots (plot panels)

Often, we’d like to create the same set of plots, but as distinctly different subplots. This way, we don’t need to map so many aesthetics (it can end up being really messy).

Lets say, the last plot we made, we want to understand if there are also differences between male and female penguins.

In ggplot2, this is called a “facet”, and the function we use is called either facet_wrap or facet_grid.

ggplot(penguins, 
      aes(x = bill_depth_mm, 
          y = bill_length_mm,
          color = species)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ sex)

The facet’s take formula arguments, meaning they contain the tilde (~).

This plot looks a little crazy though, as we have penguins with missing sex information getting their own panel, and really, it makes more sense to compare the sexes within each species rather than the other way around.

Class Exercise:

Swap the places of sex and species. Then add another variable to facet by so that you are faceting by both species and island.

ggplot(penguins, 
      aes(x = bill_depth_mm, 
          y = bill_length_mm,
          color = sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ species + island)

Subsetting data with `dplyr`

Now we will move forward with answering two of the most common questions from graduate students,

Q1: How can I subset the number of columns in my data set? (this class)
Q2: How can I reduce the number of rows in my data set? (next class)

In many cases, we are working with data sets that contain more data than we need, or we want to inspect certain parts of the data set before we continue.

The {dplyr} package

The {dplyr} package provides a number of very useful functions for manipulating data sets in a way that will reduce the probability of making errors, and even save you some typing time.

We’re going to cover 6 of the most commonly used functions as well as using pipes (|>) to combine them.

select() (covered in this class)
filter() (covered next class)
arrange() (covered next class)
mutate() (covered next class)
group_by() (covered next class)
summarize() (covered next class)

Selecting columns

Let us first talk about selecting columns. In {dplyr}, the function name for selecting columns is select()! Most {tidyverse} function names for functions are inspired by English grammar, which will help us when we are writing our code.

To select data, we must first tell select which data set we are selecting from, and then give it our selection. Here, we are asking R to select() from the penguins data set the island, species and sex columns

select(penguins, island, species, sex)

## # A tibble: 344 × 3
##    island    species sex   
##    <fct>     <fct>   <fct> 
##  1 Torgersen Adelie  male  
##  2 Torgersen Adelie  female
##  3 Torgersen Adelie  female
##  4 Torgersen Adelie  <NA>  
##  5 Torgersen Adelie  female
##  6 Torgersen Adelie  male  
##  7 Torgersen Adelie  female
##  8 Torgersen Adelie  male  
##  9 Torgersen Adelie  <NA>  
## 10 Torgersen Adelie  <NA>  
## # ℹ 334 more rows

When we use select() we don’t need to use quotations, we write in the names directly.

We can also use the numeric indexes for the column, if we are 100% certain of the order of the columns:

select(penguins, 1:3, 6)

## # A tibble: 344 × 4
##    species island    bill_length_mm body_mass_g
##    <fct>   <fct>              <dbl>       <int>
##  1 Adelie  Torgersen           39.1        3750
##  2 Adelie  Torgersen           39.5        3800
##  3 Adelie  Torgersen           40.3        3250
##  4 Adelie  Torgersen           NA            NA
##  5 Adelie  Torgersen           36.7        3450
##  6 Adelie  Torgersen           39.3        3650
##  7 Adelie  Torgersen           38.9        3625
##  8 Adelie  Torgersen           39.2        4675
##  9 Adelie  Torgersen           34.1        3475
## 10 Adelie  Torgersen           42          4250
## # ℹ 334 more rows

In some cases, we want to remove columns, and not necessarily state all columns we want to keep. Select also allows for this by adding a minus (-) sign in front of the column name you don’t want.

select(penguins, -bill_length_mm, -bill_depth_mm)

## # A tibble: 344 × 6
##    species island    flipper_length_mm body_mass_g sex     year
##    <fct>   <fct>                 <int>       <int> <fct>  <int>
##  1 Adelie  Torgersen               181        3750 male    2007
##  2 Adelie  Torgersen               186        3800 female  2007
##  3 Adelie  Torgersen               195        3250 female  2007
##  4 Adelie  Torgersen                NA          NA <NA>    2007
##  5 Adelie  Torgersen               193        3450 female  2007
##  6 Adelie  Torgersen               190        3650 male    2007
##  7 Adelie  Torgersen               181        3625 female  2007
##  8 Adelie  Torgersen               195        4675 male    2007
##  9 Adelie  Torgersen               193        3475 <NA>    2007
## 10 Adelie  Torgersen               190        4250 <NA>    2007
## # ℹ 334 more rows

Class Exercise:

Select the columns sex, year, and species. Make sure that species comes before sex in the output.

select(penguins, species, sex, year)

## # A tibble: 344 × 3
##    species sex     year
##    <fct>   <fct>  <int>
##  1 Adelie  male    2007
##  2 Adelie  female  2007
##  3 Adelie  female  2007
##  4 Adelie  <NA>    2007
##  5 Adelie  female  2007
##  6 Adelie  male    2007
##  7 Adelie  female  2007
##  8 Adelie  male    2007
##  9 Adelie  <NA>    2007
## 10 Adelie  <NA>    2007
## # ℹ 334 more rows

select does not only subset columns, but it can also re-arrange them. The columns appear in the order your selection is specified.

Tidy selections

These selections are quite convenient and fast! But they can be even better.

For instance, what if we want to choose all the columns with millimeter measurements? That could be quite convenient, making sure the variables we are working with have the same measurement scale.

We could of course type them all out, but the penguins data set has names that make it even easier for us, using something called tidy-selectors.

Here, we use a tidy-selector ends_with(), can you guess what it does? yes, it looks for columns that end with the string you provide it, here "mm".

select(penguins, ends_with("mm"))

## # A tibble: 344 × 3
##    bill_length_mm bill_depth_mm flipper_length_mm
##             <dbl>         <dbl>             <int>
##  1           39.1          18.7               181
##  2           39.5          17.4               186
##  3           40.3          18                 195
##  4           NA            NA                  NA
##  5           36.7          19.3               193
##  6           39.3          20.6               190
##  7           38.9          17.8               181
##  8           39.2          19.6               195
##  9           34.1          18.1               193
## 10           42            20.2               190
## # ℹ 334 more rows

So convenient! There are several other tidy-selectors you can choose, which you can find here, but often people resort to three specific ones:

ends_with() - column names ending with a character string
starts_with() - column names starting with a character string
contains() - column names containing a character string

Summary

We learned about different parameters of ggplot functions, and how to combine different geoms into more complex charts. We also learned a little bit about subsetting our data, to create data sets that suit our needs!

Introduction To Tidyverse