Introduction To Tidyverse Part II
Load Library
Load the package required for subsequent analyses.
Today’s Lesson
Questions
- How can I subset the number of columns in my data set?
- How can I reduce the number of rows in my data set?
Objectives
- Use
select()
to reduce columns - Use tidyselectors like
starts_with()
withinselect()
to reduce columns - Use
filter()
to reduce rows - Understand the difference between
&
and|
- Use the pipe
|>
to chain commands together
Challenge 1
Change the geometric object from geom_jitter
to
geom_point
- what is the difference in the plots? Can you
explain why?
ggplot(data = penguins,
mapping = aes(x = island,
y = bill_length_mm)) +
geom_jitter(aes(color = island)) +
geom_boxplot(alpha = .6)
Both geom_point()
and geom_jitter()
are
used to create scatterplots, but they behave slightly differently when
plotting data points.
geom_point()
- What it does:
Plots the exact coordinates of your data points.
Each point is placed exactly at its x and y values.
When to use:
When you want to visualize the precise relationship between two continuous variables.
Suitable when data points do not overlap or crowd.
geom_jitter()
– What it does:
Adds random noise (jittering) to the points.
Prevents overplotting by slightly shifting points horizontally and/or vertically.
When to use:
- When you have overlapping points that need to be spread out to avoid clutter.
Selecting columns with the {dplyr} package
The {dplyr} package provides a number of very useful functions for manipulating data sets in a way that will reduce the probability of making errors.
We’re going to cover 6 of the most commonly used functions as well as
using pipes (|>
) to combine them.
select()
filter()
arrange()
mutate()
group_by()
summarize()
Challenge 2
Using select()
, select the first four rows of the
penguins
data. Assign this new dataset to
penguins_new
.
Tidy selectors
Just a reminder here are the column names:
- species: penguin species
- island: island of observation
- bill_length_mm: bill length in millimetres
- bill_depth_mm: bill depth in millimetres
- flipper_length_mm: flipper length in millimetres
- body_mass_g: body mass in grams
- sex: penguin sex
- year: year of observation
What if we want to choose all the columns with millimeter
measurements? We could use something called
tidy-selectors
.
Here, we use a tidy-selector ends_with()
. It looks for
columns that end with the string you provide it, here we are specifying
"mm"
.
## # A tibble: 344 × 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <int>
## 1 39.1 18.7 181
## 2 39.5 17.4 186
## 3 40.3 18 195
## 4 NA NA NA
## 5 36.7 19.3 193
## 6 39.3 20.6 190
## 7 38.9 17.8 181
## 8 39.2 19.6 195
## 9 34.1 18.1 193
## 10 42 20.2 190
## # ℹ 334 more rows
So convenient! There are several other tidy-selectors you can choose, which you can find here, but often people resort to three specific ones:
ends_with()
- column names ending with a character string
starts_with()
- column names starting with a character string
contains()
- column names containing
Challenge 3
Using one of the tidy-selectors
select all columns
containing an underscore (“_“).
## # A tibble: 344 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <int> <int>
## 1 39.1 18.7 181 3750
## 2 39.5 17.4 186 3800
## 3 40.3 18 195 3250
## 4 NA NA NA NA
## 5 36.7 19.3 193 3450
## 6 39.3 20.6 190 3650
## 7 38.9 17.8 181 3625
## 8 39.2 19.6 195 4675
## 9 34.1 18.1 193 3475
## 10 42 20.2 190 4250
## # ℹ 334 more rows
Challenge 4
Select the species and sex columns, in addition to all columns ending with “mm”
## # A tibble: 344 × 5
## species sex bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie male 39.1 18.7 181
## 2 Adelie female 39.5 17.4 186
## 3 Adelie female 40.3 18 195
## 4 Adelie <NA> NA NA NA
## 5 Adelie female 36.7 19.3 193
## 6 Adelie male 39.3 20.6 190
## 7 Adelie female 38.9 17.8 181
## 8 Adelie male 39.2 19.6 195
## 9 Adelie <NA> 34.1 18.1 193
## 10 Adelie <NA> 42 20.2 190
## # ℹ 334 more rows
Filtering rows
Now that we know how to select the columns we want, we should take a look at how we filter the rows.
The filter()
function is used to subset a data frame
retaining all rows that satisfy your conditions. To be retained, the row
must produce a value of TRUE
for all conditions.
Let’s use filter()
to keep any penguin with a body mass
of less than 3000.
## # A tibble: 9 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 37.5 18.9 179 2975
## 2 Adelie Biscoe 34.5 18.1 187 2900
## 3 Adelie Biscoe 36.5 16.6 181 2850
## 4 Adelie Biscoe 36.4 17.1 184 2850
## 5 Adelie Dream 33.1 16.1 178 2900
## 6 Adelie Biscoe 37.9 18.6 193 2925
## 7 Adelie Torgersen 38.6 17 188 2900
## 8 Chinstrap Dream 43.2 16.6 187 2900
## 9 Chinstrap Dream 46.9 16.6 192 2700
## # ℹ 2 more variables: sex <fct>, year <int>
Above we have filtered so that we only have observations where the body mass less than 3 kilos are kept.
The output is showing multiple values that are equal to 2900. What if we just want this?
We need to use double equals (==
).
In R, =
and ==
have different meanings.
=
is used to assign values to arguments in functions.==
is used to compare values.
What happens above?
This is incorrect usage because
=
is trying to assign 2900 to body_mass_g instead of comparing it.filter()
expects a logical condition to subset rows but instead gets an assignment, which results in unexpected behavior or an error.
## # A tibble: 4 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 34.5 18.1 187 2900
## 2 Adelie Dream 33.1 16.1 178 2900
## 3 Adelie Torgersen 38.6 17 188 2900
## 4 Chinstrap Dream 43.2 16.6 187 2900
## # ℹ 2 more variables: sex <fct>, year <int>
What happens above?
This correctly filters the penguins dataset to only keep rows where body_mass_g is equal to 2900.
==
compares each row’sbody_mass_g
value with 2900 and returns only the rows that satisfy this condition.In other words, R will check if the values in
body_mass_g
are the same as 2900 (TRUE
) or not (FALSE
), and will do this for every row in the data set. Then at the end, it will discard all those that areFALSE
, and keep those that areTRUE
.
Challenge 5
Filter the dataset so you only keep observations from the “Dream” island.
## # A tibble: 124 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 39.5 16.7 178 3250
## 2 Adelie Dream 37.2 18.1 178 3900
## 3 Adelie Dream 39.5 17.8 188 3300
## 4 Adelie Dream 40.9 18.9 184 3900
## 5 Adelie Dream 36.4 17 195 3325
## 6 Adelie Dream 39.2 21.1 196 4150
## 7 Adelie Dream 38.8 20 190 3950
## 8 Adelie Dream 42.2 18.5 180 3550
## 9 Adelie Dream 37.6 19.3 181 3300
## 10 Adelie Dream 39.8 19.1 184 4650
## # ℹ 114 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
# in order to be read as text characters require quotations
# numbers can be read directly in R, do not require quotations
Multiple filters
Many times, we want to apply several filters applied at once. So if you only want Adelie penguins that are below 3 kilos?
filter()
can take as many statements as you want.
Combine them by adding commas (,) between each statement, and that will
work as ‘and’
## # A tibble: 7 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 37.5 18.9 179 2975
## 2 Adelie Biscoe 34.5 18.1 187 2900
## 3 Adelie Biscoe 36.5 16.6 181 2850
## 4 Adelie Biscoe 36.4 17.1 184 2850
## 5 Adelie Dream 33.1 16.1 178 2900
## 6 Adelie Biscoe 37.9 18.6 193 2925
## 7 Adelie Torgersen 38.6 17 188 2900
## # ℹ 2 more variables: sex <fct>, year <int>
You can also use the &
sign:
## # A tibble: 7 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 37.5 18.9 179 2975
## 2 Adelie Biscoe 34.5 18.1 187 2900
## 3 Adelie Biscoe 36.5 16.6 181 2850
## 4 Adelie Biscoe 36.4 17.1 184 2850
## 5 Adelie Dream 33.1 16.1 178 2900
## 6 Adelie Biscoe 37.9 18.6 193 2925
## 7 Adelie Torgersen 38.6 17 188 2900
## # ℹ 2 more variables: sex <fct>, year <int>
Challenge 6
Filter the data so you only have observations of male penguins of the Chinstrap species
## # A tibble: 34 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 50 19.5 196 3900
## 2 Chinstrap Dream 51.3 19.2 193 3650
## 3 Chinstrap Dream 52.7 19.8 197 3725
## 4 Chinstrap Dream 51.3 18.2 197 3750
## 5 Chinstrap Dream 51.3 19.9 198 3700
## 6 Chinstrap Dream 51.7 20.3 194 3775
## 7 Chinstrap Dream 52 18.1 201 4050
## 8 Chinstrap Dream 50.5 19.6 201 4050
## 9 Chinstrap Dream 50.3 20 197 3300
## 10 Chinstrap Dream 49.2 18.2 195 4400
## # ℹ 24 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
A tidy dataset follows specific principles that make it easier to manipulate, visualize, and analyze using the tidyverse packages like dplyr, ggplot2, and tidyr.
Each variable is in its own column.
Each observation is in its own row.
Each value is in its own cell.
Challenge 7
Filter the data so you only keep observations from the year 2008 or later and from Biscoe island
## # A tibble: 124 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 39.6 17.7 186 3500
## 2 Adelie Biscoe 40.1 18.9 188 4300
## 3 Adelie Biscoe 35 17.9 190 3450
## 4 Adelie Biscoe 42 19.5 200 4050
## 5 Adelie Biscoe 34.5 18.1 187 2900
## 6 Adelie Biscoe 41.4 18.6 191 3700
## 7 Adelie Biscoe 39 17.5 186 3550
## 8 Adelie Biscoe 40.6 18.8 193 3800
## 9 Adelie Biscoe 36.5 16.6 181 2850
## 10 Adelie Biscoe 37.6 19.1 194 3750
## # ℹ 114 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Understanding difference between &
(and) and
|
(or)
When using filter()
, you often need to apply multiple
conditions. The behavior of these conditions depends on whether you
use:
&
– AND: Both conditions must be TRUE for a row to be included.|
– OR: At least one condition must be TRUE for a row to be included.
## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 43.2 16.6 187 2900
## 2 Chinstrap Dream 46.9 16.6 192 2700
## # ℹ 2 more variables: sex <fct>, year <int>
The above statement keeps rows where: + The species
is
Chinstrap AND + The body_mass_g
is less than 3000
grams.
species body_mass_g Chinstrap 2800 (included) Chinstrap 3200 (not included) Adelie 2900 (not included)
But what if we want all the Chinstrap penguins or if
body mass is below 3 kilos? In this case, we will use the
or character |
.
## # A tibble: 75 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 37.5 18.9 179 2975
## 2 Adelie Biscoe 34.5 18.1 187 2900
## 3 Adelie Biscoe 36.5 16.6 181 2850
## 4 Adelie Biscoe 36.4 17.1 184 2850
## 5 Adelie Dream 33.1 16.1 178 2900
## 6 Adelie Biscoe 37.9 18.6 193 2925
## 7 Adelie Torgers… 38.6 17 188 2900
## 8 Chinstrap Dream 46.5 17.9 192 3500
## 9 Chinstrap Dream 50 19.5 196 3900
## 10 Chinstrap Dream 51.3 19.2 193 3650
## # ℹ 65 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
The pipe |>
When you need to apply multiple functions in sequence (e.g.,
filter()
followed by select()
), you can use
the pipe operator |>
.
The pipe operator
|>
takes the result from the expression on the left and passes it as the first argument to the function on the right.This makes your code cleaner and easier to read, especially when you have several functions to apply.
You can enable the pipe in RStudio by going to Tools -> Global options -> Code -> Use native pipe operator.
The shortcut to insert the pipe operator is
Ctrl
+Shift
+M
for Windows/Linux,
and Cmd
+Shift
+M
for Mac.
In the chinstraps
example, we had the following code to
filter the rows and then select our columns.
# Step 1
chinstraps <- filter(penguins, species == "Chinstrap")
#Step 2
chinstraps <- select(chinstraps, -starts_with("bill")) #select away columns that start with "bill"
Instead this can be rewritten as:
You can read the pipe operator as “and then”.
So if we translate the code above to human language we could read it as:
“take the penguins data set, and then keep only rows for the chinstrap penguins, and then remove the columns starting with bill and assign the end result to chinstraps.”
Learning to read pipes is a great skill, R is not the only programming language that can do this (though the operator is different between languages, the functionality exists in many).