Introduction To Tidyverse
Tidyverse
The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package and data visualization with the ‘ggplot2’ package.
Today’s Lesson
Questions
- How do I access my data in R?
- How do I visualize my data with
ggplot2
? - How do I subset my data with
dplyr
?
Objectives
- To be able to use
ggplot2
to generate publication quality graphics. - To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and coloring or paneling by groups.
- To be able to subset data using
dplyr
Keypoints
- Read data into R
- Use
ggplot2
to create different types of plots - Use
dplyr
to subset data
About the dataset
Palmer penguins. Data were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica by Dr. Kristen Gorman. In addition to which island each penguin lived on, the data contains information on the species of the penguin (Adelie, Chinstrap, and Gentoo), its bill length, bill depth, flipper length, its body mass, and sex of the penguin (male or female).
The palmerpenguins package contains two datasets:
penguins
and penguins_raw
. We will be using
penguins
.
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
You should see an object called “penguins”, which is a Dataset with 344 observations and 8 variables. mm=millimeters. g=grams.
The dataset contains the following fields:
- species: penguin species
- island: island of observation
- bill_length_mm: bill length in millimetres
- bill_depth_mm: bill depth in millimetres
- flipper_length_mm: flipper length in millimetres
- body_mass_g: body mass in grams
- sex: penguin sex
- year: year of observation
Introduction to ggplot2
ggplot2
is a core member of tidyverse
family of packages. Installing and loading the package under the same
name will load all of the packages we will need for today’s lesson.
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Here’s a question that we would like to answer using
penguins
data: Do penguins with deep beaks also have
long beaks? This might seem like a silly question, but it gets us
exploring our data.
We will begin by using the ggplot()
function to
initialize the basic graph structure. The basic idea is that you can
specify different parts of the plot and add them together using the
+
operator. These parts are often referred to as
layers.
As input ggplot()
requires a data frame.
Let’s start:
Notice, that you will get a blank plot because ggplot2
requires the user to specify layers using the +
operator.
The +
sign goes at the end of the line, not in the
beginning.
One layer is called geometric objects. Examples include:
- points (
geom_point
,geom_jitter
for scatter plots, dot plots, etc) - lines (
geom_line
, for time series, trend lines, etc) - boxplot (
geom_boxplot
, for boxplots)
Any plot created with ggplot()
must have at
least one geom
. You can add a geom
to
a plot using the +
operator. Functions starting with the
prefix geom
create a visual representation of data.
You will find that even though we have added a layer by specifying
geom_point
, we still get an error. This is because each
type of geom
has a required set of aesthetics
elements to be set. Aesthetic mappings are set with the
aes()
function and can be set inside
geom_point()
. Examples of aesthetics include:
- position (i.e., on the x and y axis)
- color (“outside” color)
- fill (“inside” color)
- shape (of points)
- line type
- size
We will begin by specifying the x- and y-axis since
geom_point()
requires this information for a scatter plot.
The following code will put bill_depth_mm
on the x-axis and
bill_length_mm
on the y-axis:
Mapping data
What if we want to show the relationship between three variables in the same graph? We can employ color!
Let’s take a look:
ggplot(data = penguins) +
geom_jitter(mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
color = island))
Island is categorical character variable with a discrete range of possible values whereas, body mass is a continuous numeric variable in which any number of potential values can exist between known values. To represent this, R uses a color bar with a continuous gradient.
To answer the question: Which species of penguin is found in all three islands?
We will add another type of aesthetic called shape, for categorical data like species.
ggplot(data = penguins) +
geom_jitter(mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
color = island,
shape = species))
Can you spot the difference?
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_depth_mm,
y = bill_length_mm),
color = "blue")
Important point: Values set outside aesthetics will apply to the entire geom or plot!
Geometrical objects
Next, we will consider different options for geoms
.
Using different geom_
functions user can highlight
different aspects of data.
A useful geom function is geom_boxplot()
. It adds a
layer with the “box and whiskers” plot illustrating the distribution of
values within categories. The following chart breaks down bill length by
island, where the box represents first and third quartile (the 25th and
75th percentiles), the middle bar signifies the median value and the
whiskers extent to cover 95% confidence interval. Outliers (outside of
the 95% confidence interval range) are shown separately.
Layers can be added on top of each other. In the following graph we will place the boxplots over jittered points to see the distribution of outliers more clearly. We can map two aesthetic properties to the same variable. Here we will also use different color for each island.
ggplot(data = penguins) +
geom_jitter(mapping = aes(x = species,
y = bill_length_mm,
color = species)) +
geom_boxplot(mapping = aes(x = species,
y = bill_length_mm))
Now, this was slightly inefficient due to duplication of code - we
had to specify the same mappings for two layers. To avoid it, you can
move common arguments of geom_
functions to the main
ggplot()
function. In this case every layer will “inherit”
the same arguments, specified in the “parent” function.
ggplot(data = penguins,
mapping = aes(x = island,
y = bill_length_mm)) +
geom_jitter(aes(color = island)) +
geom_boxplot(alpha = .6)
You can still add layer-specific mappings or other arguments by specifying them within individual geoms. Here, we’ve set the transparency of the boxplot to .6, so we can see the points behind it, and also mapped color to island in the points. It is recommended to build each layer separately and then moving common arguments up to the “parent” function.
We can use linear models to highlight differences in dependency
between bill length and body mass by island. Notice that we added a
separate argument to the geom_smooth()
function to specify
the type of model we want ggplot2
to built using the data
(linear model). The geom_smooth()
function has also
helpfully provided confidence intervals, indicating “goodness of fit”
for each model (shaded gray area). For more information on statistical
models, please refer to help (by typing ?geom_smooth
)
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm")
Class Exercise:
Modify the plot so the the points are colored by island, but there is a single regression line.
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm)) +
geom_point(mapping = aes(color = species),
alpha = 0.5) +
geom_smooth(method = "lm")
In the graph above, each geom inherited all three mappings: x, y and
color. If we want only single linear model to be built, we would need to
limit the effect of color aesthetic to only geom_point()
function, by moving it from the “parent” function to the layer where we
want it to apply. Note, though, that because we want the color to be
still mapped to the species variable, it needs to be wrapped into
aes()
function and supplied to mapping argument.
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm")
The data actually reveals something called the “simpsons paradox”. It’s when a relationship looks to go in a specific direction, but when looking into groups within the data the relationship is the opposite. Here, the overall relationship between bill length and depths looks negative, but when we take into account that there are different species, the relationship is actually positive.
Sub-plots (plot panels)
Often, we’d like to create the same set of plots, but as distinctly different subplots. This way, we don’t need to map so many aesthetics (it can end up being really messy).
Lets say, the last plot we made, we want to understand if there are also differences between male and female penguins.
In ggplot2, this is called a “facet”, and the function we use is
called either facet_wrap
or facet_grid
.
ggplot(penguins,
aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~ sex)
The facet’s take formula arguments, meaning they contain the tilde (~).
This plot looks a little crazy though, as we have penguins with missing sex information getting their own panel, and really, it makes more sense to compare the sexes within each species rather than the other way around.
Class Exercise:
Swap the places of sex and species. Then add another variable to facet by so that you are faceting by both species and island.
ggplot(penguins,
aes(x = bill_depth_mm,
y = bill_length_mm,
color = sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~ species + island)
Subsetting data with dplyr
Now we will move forward with answering two of the most common questions from graduate students,
Q1: How can I subset the number of columns in my data set? (this class)
Q2: How can I reduce the number of rows in my data set? (next class)
In many cases, we are working with data sets that contain more data than we need, or we want to inspect certain parts of the data set before we continue.
The {dplyr} package
The {dplyr} package provides a number of very useful functions for manipulating data sets in a way that will reduce the probability of making errors, and even save you some typing time.
We’re going to cover 6 of the most commonly used functions as well as
using pipes (|>
) to combine them.
select()
(covered in this class)filter()
(covered next class)arrange()
(covered next class)mutate()
(covered next class)group_by()
(covered next class)summarize()
(covered next class)
Selecting columns
Let us first talk about selecting columns. In {dplyr}, the function
name for selecting columns is select()
! Most {tidyverse}
function names for functions are inspired by English grammar, which will
help us when we are writing our code.
To select data, we must first tell select which data set we are
selecting from, and then give it our selection. Here, we are asking R to
select()
from the penguins
data set the
island
, species
and sex
columns
## # A tibble: 344 × 3
## island species sex
## <fct> <fct> <fct>
## 1 Torgersen Adelie male
## 2 Torgersen Adelie female
## 3 Torgersen Adelie female
## 4 Torgersen Adelie <NA>
## 5 Torgersen Adelie female
## 6 Torgersen Adelie male
## 7 Torgersen Adelie female
## 8 Torgersen Adelie male
## 9 Torgersen Adelie <NA>
## 10 Torgersen Adelie <NA>
## # ℹ 334 more rows
When we use select()
we don’t need to use quotations, we
write in the names directly.
We can also use the numeric indexes for the column, if we are 100% certain of the order of the columns:
## # A tibble: 344 × 4
## species island bill_length_mm body_mass_g
## <fct> <fct> <dbl> <int>
## 1 Adelie Torgersen 39.1 3750
## 2 Adelie Torgersen 39.5 3800
## 3 Adelie Torgersen 40.3 3250
## 4 Adelie Torgersen NA NA
## 5 Adelie Torgersen 36.7 3450
## 6 Adelie Torgersen 39.3 3650
## 7 Adelie Torgersen 38.9 3625
## 8 Adelie Torgersen 39.2 4675
## 9 Adelie Torgersen 34.1 3475
## 10 Adelie Torgersen 42 4250
## # ℹ 334 more rows
In some cases, we want to remove columns, and not necessarily state
all columns we want to keep. Select also allows for this by adding a
minus (-
) sign in front of the column name you don’t
want.
## # A tibble: 344 × 6
## species island flipper_length_mm body_mass_g sex year
## <fct> <fct> <int> <int> <fct> <int>
## 1 Adelie Torgersen 181 3750 male 2007
## 2 Adelie Torgersen 186 3800 female 2007
## 3 Adelie Torgersen 195 3250 female 2007
## 4 Adelie Torgersen NA NA <NA> 2007
## 5 Adelie Torgersen 193 3450 female 2007
## 6 Adelie Torgersen 190 3650 male 2007
## 7 Adelie Torgersen 181 3625 female 2007
## 8 Adelie Torgersen 195 4675 male 2007
## 9 Adelie Torgersen 193 3475 <NA> 2007
## 10 Adelie Torgersen 190 4250 <NA> 2007
## # ℹ 334 more rows
Class Exercise:
Select the columns sex, year, and species. Make sure that species comes before sex in the output.
## # A tibble: 344 × 3
## species sex year
## <fct> <fct> <int>
## 1 Adelie male 2007
## 2 Adelie female 2007
## 3 Adelie female 2007
## 4 Adelie <NA> 2007
## 5 Adelie female 2007
## 6 Adelie male 2007
## 7 Adelie female 2007
## 8 Adelie male 2007
## 9 Adelie <NA> 2007
## 10 Adelie <NA> 2007
## # ℹ 334 more rows
select does not only subset columns, but it can also re-arrange them. The columns appear in the order your selection is specified.
Tidy selections
These selections are quite convenient and fast! But they can be even better.
For instance, what if we want to choose all the columns with millimeter measurements? That could be quite convenient, making sure the variables we are working with have the same measurement scale.
We could of course type them all out, but the penguins data set has
names that make it even easier for us, using something called
tidy-selectors
.
Here, we use a tidy-selector ends_with()
, can you guess
what it does? yes, it looks for columns that end with the string you
provide it, here "mm"
.
## # A tibble: 344 × 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <int>
## 1 39.1 18.7 181
## 2 39.5 17.4 186
## 3 40.3 18 195
## 4 NA NA NA
## 5 36.7 19.3 193
## 6 39.3 20.6 190
## 7 38.9 17.8 181
## 8 39.2 19.6 195
## 9 34.1 18.1 193
## 10 42 20.2 190
## # ℹ 334 more rows
So convenient! There are several other tidy-selectors you can choose, which you can find here, but often people resort to three specific ones:
ends_with()
- column names ending with a character string
starts_with()
- column names starting with a character string
contains()
- column names containing a character string
Summary
We learned about different parameters of ggplot
functions, and how to combine different geoms
into more
complex charts. We also learned a little bit about subsetting our data,
to create data sets that suit our needs!