Introduction To Tidyverse Part II

Load Library

Load the package required for subsequent analyses.

library(palmerpenguins)
library(tidyverse)

Today’s Lesson

Questions

  • How can I subset the number of columns in my data set?
  • How can I reduce the number of rows in my data set?

Objectives

  • Use select() to reduce columns
  • Use tidyselectors like starts_with() within select() to reduce columns
  • Use filter() to reduce rows
  • Understand the difference between & and |
  • Use the pipe |> to chain commands together

Challenge 1

Change the geometric object from geom_jitter to geom_point - what is the difference in the plots? Can you explain why?

ggplot(data = penguins,
       mapping = aes(x = island, 
                     y = bill_length_mm)) + 
  geom_jitter(aes(color = island)) +
  geom_boxplot(alpha = .6) 

#alpha takes a value from 0 (transparent) to 1 (solid).

Both geom_point() and geom_jitter() are used to create scatterplots, but they behave slightly differently when plotting data points.

geom_point() - What it does:

  • Plots the exact coordinates of your data points.

  • Each point is placed exactly at its x and y values.

When to use:

  • When you want to visualize the precise relationship between two continuous variables.

  • Suitable when data points do not overlap or crowd.


geom_jitter() – What it does:

  • Adds random noise (jittering) to the points.

  • Prevents overplotting by slightly shifting points horizontally and/or vertically.

When to use:

  • When you have overlapping points that need to be spread out to avoid clutter.

Selecting columns with the {dplyr} package

The {dplyr} package provides a number of very useful functions for manipulating data sets in a way that will reduce the probability of making errors.

We’re going to cover 6 of the most commonly used functions as well as using pipes (|>) to combine them.

  1. select()
  2. filter()
  3. arrange()
  4. mutate()
  5. group_by()
  6. summarize()

Challenge 2

Using select(), select the first four rows of the penguins data. Assign this new dataset to penguins_new.

penguins <- penguins
penguins_new <- select(penguins, 1:4)

Tidy selectors

Just a reminder here are the column names:

  • species: penguin species
  • island: island of observation
  • bill_length_mm: bill length in millimetres
  • bill_depth_mm: bill depth in millimetres
  • flipper_length_mm: flipper length in millimetres
  • body_mass_g: body mass in grams
  • sex: penguin sex
  • year: year of observation

What if we want to choose all the columns with millimeter measurements? We could use something called tidy-selectors.

Here, we use a tidy-selector ends_with(). It looks for columns that end with the string you provide it, here we are specifying "mm".

select(penguins, ends_with("mm"))
## # A tibble: 344 × 3
##    bill_length_mm bill_depth_mm flipper_length_mm
##             <dbl>         <dbl>             <int>
##  1           39.1          18.7               181
##  2           39.5          17.4               186
##  3           40.3          18                 195
##  4           NA            NA                  NA
##  5           36.7          19.3               193
##  6           39.3          20.6               190
##  7           38.9          17.8               181
##  8           39.2          19.6               195
##  9           34.1          18.1               193
## 10           42            20.2               190
## # ℹ 334 more rows

So convenient! There are several other tidy-selectors you can choose, which you can find here, but often people resort to three specific ones:

  • ends_with() - column names ending with a character string
  • starts_with() - column names starting with a character string
  • contains() - column names containing

Challenge 3

Using one of the tidy-selectors select all columns containing an underscore (“_“).

select(penguins, contains("_"))
## # A tibble: 344 × 4
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##             <dbl>         <dbl>             <int>       <int>
##  1           39.1          18.7               181        3750
##  2           39.5          17.4               186        3800
##  3           40.3          18                 195        3250
##  4           NA            NA                  NA          NA
##  5           36.7          19.3               193        3450
##  6           39.3          20.6               190        3650
##  7           38.9          17.8               181        3625
##  8           39.2          19.6               195        4675
##  9           34.1          18.1               193        3475
## 10           42            20.2               190        4250
## # ℹ 334 more rows

Challenge 4

Select the species and sex columns, in addition to all columns ending with “mm”

select(penguins, species, sex, ends_with("mm"))
## # A tibble: 344 × 5
##    species sex    bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>   <fct>           <dbl>         <dbl>             <int>
##  1 Adelie  male             39.1          18.7               181
##  2 Adelie  female           39.5          17.4               186
##  3 Adelie  female           40.3          18                 195
##  4 Adelie  <NA>             NA            NA                  NA
##  5 Adelie  female           36.7          19.3               193
##  6 Adelie  male             39.3          20.6               190
##  7 Adelie  female           38.9          17.8               181
##  8 Adelie  male             39.2          19.6               195
##  9 Adelie  <NA>             34.1          18.1               193
## 10 Adelie  <NA>             42            20.2               190
## # ℹ 334 more rows

Filtering rows

Now that we know how to select the columns we want, we should take a look at how we filter the rows.

The filter() function is used to subset a data frame retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

Let’s use filter() to keep any penguin with a body mass of less than 3000.

filter(penguins, body_mass_g < 3000)
## # A tibble: 9 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Dream               37.5          18.9               179        2975
## 2 Adelie    Biscoe              34.5          18.1               187        2900
## 3 Adelie    Biscoe              36.5          16.6               181        2850
## 4 Adelie    Biscoe              36.4          17.1               184        2850
## 5 Adelie    Dream               33.1          16.1               178        2900
## 6 Adelie    Biscoe              37.9          18.6               193        2925
## 7 Adelie    Torgersen           38.6          17                 188        2900
## 8 Chinstrap Dream               43.2          16.6               187        2900
## 9 Chinstrap Dream               46.9          16.6               192        2700
## # ℹ 2 more variables: sex <fct>, year <int>

Above we have filtered so that we only have observations where the body mass less than 3 kilos are kept.

The output is showing multiple values that are equal to 2900. What if we just want this?

We need to use double equals (==).

In R, = and == have different meanings.

  • = is used to assign values to arguments in functions.

  • == is used to compare values.

#filter(penguins, body_mass_g = 2900)

What happens above?

  • This is incorrect usage because = is trying to assign 2900 to body_mass_g instead of comparing it.

  • filter() expects a logical condition to subset rows but instead gets an assignment, which results in unexpected behavior or an error.

filter(penguins, body_mass_g == 2900)
## # A tibble: 4 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Biscoe              34.5          18.1               187        2900
## 2 Adelie    Dream               33.1          16.1               178        2900
## 3 Adelie    Torgersen           38.6          17                 188        2900
## 4 Chinstrap Dream               43.2          16.6               187        2900
## # ℹ 2 more variables: sex <fct>, year <int>

What happens above?

  • This correctly filters the penguins dataset to only keep rows where body_mass_g is equal to 2900.

  • == compares each row’s body_mass_g value with 2900 and returns only the rows that satisfy this condition.

  • In other words, R will check if the values in body_mass_g are the same as 2900 (TRUE) or not (FALSE), and will do this for every row in the data set. Then at the end, it will discard all those that are FALSE, and keep those that are TRUE.


Challenge 5

Filter the dataset so you only keep observations from the “Dream” island.

filter(penguins, island == "Dream")
## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Dream            39.5          16.7               178        3250
##  2 Adelie  Dream            37.2          18.1               178        3900
##  3 Adelie  Dream            39.5          17.8               188        3300
##  4 Adelie  Dream            40.9          18.9               184        3900
##  5 Adelie  Dream            36.4          17                 195        3325
##  6 Adelie  Dream            39.2          21.1               196        4150
##  7 Adelie  Dream            38.8          20                 190        3950
##  8 Adelie  Dream            42.2          18.5               180        3550
##  9 Adelie  Dream            37.6          19.3               181        3300
## 10 Adelie  Dream            39.8          19.1               184        4650
## # ℹ 114 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
# in order to be read as text characters require quotations 
# numbers can be read directly in R, do not require quotations 

Multiple filters

Many times, we want to apply several filters applied at once. So if you only want Adelie penguins that are below 3 kilos?

filter() can take as many statements as you want. Combine them by adding commas (,) between each statement, and that will work as ‘and’

filter(penguins, 
       species == "Adelie",
       body_mass_g < 3000)
## # A tibble: 7 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Dream               37.5          18.9               179        2975
## 2 Adelie  Biscoe              34.5          18.1               187        2900
## 3 Adelie  Biscoe              36.5          16.6               181        2850
## 4 Adelie  Biscoe              36.4          17.1               184        2850
## 5 Adelie  Dream               33.1          16.1               178        2900
## 6 Adelie  Biscoe              37.9          18.6               193        2925
## 7 Adelie  Torgersen           38.6          17                 188        2900
## # ℹ 2 more variables: sex <fct>, year <int>

You can also use the & sign:

filter(penguins, 
       species == "Adelie" &
         body_mass_g < 3000)
## # A tibble: 7 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Dream               37.5          18.9               179        2975
## 2 Adelie  Biscoe              34.5          18.1               187        2900
## 3 Adelie  Biscoe              36.5          16.6               181        2850
## 4 Adelie  Biscoe              36.4          17.1               184        2850
## 5 Adelie  Dream               33.1          16.1               178        2900
## 6 Adelie  Biscoe              37.9          18.6               193        2925
## 7 Adelie  Torgersen           38.6          17                 188        2900
## # ℹ 2 more variables: sex <fct>, year <int>

Challenge 6

Filter the data so you only have observations of male penguins of the Chinstrap species

filter(penguins, 
       sex == "male",
       species == "Chinstrap")
## # A tibble: 34 × 8
##    species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Chinstrap Dream            50            19.5               196        3900
##  2 Chinstrap Dream            51.3          19.2               193        3650
##  3 Chinstrap Dream            52.7          19.8               197        3725
##  4 Chinstrap Dream            51.3          18.2               197        3750
##  5 Chinstrap Dream            51.3          19.9               198        3700
##  6 Chinstrap Dream            51.7          20.3               194        3775
##  7 Chinstrap Dream            52            18.1               201        4050
##  8 Chinstrap Dream            50.5          19.6               201        4050
##  9 Chinstrap Dream            50.3          20                 197        3300
## 10 Chinstrap Dream            49.2          18.2               195        4400
## # ℹ 24 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

A tidy dataset follows specific principles that make it easier to manipulate, visualize, and analyze using the tidyverse packages like dplyr, ggplot2, and tidyr.

  1. Each variable is in its own column.

  2. Each observation is in its own row.

  3. Each value is in its own cell.

Challenge 7

Filter the data so you only keep observations from the year 2008 or later and from Biscoe island

filter(penguins, 
       year >= 2008,
       island == "Biscoe")
## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           39.6          17.7               186        3500
##  2 Adelie  Biscoe           40.1          18.9               188        4300
##  3 Adelie  Biscoe           35            17.9               190        3450
##  4 Adelie  Biscoe           42            19.5               200        4050
##  5 Adelie  Biscoe           34.5          18.1               187        2900
##  6 Adelie  Biscoe           41.4          18.6               191        3700
##  7 Adelie  Biscoe           39            17.5               186        3550
##  8 Adelie  Biscoe           40.6          18.8               193        3800
##  9 Adelie  Biscoe           36.5          16.6               181        2850
## 10 Adelie  Biscoe           37.6          19.1               194        3750
## # ℹ 114 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
# >= filters data from the year 2008 or later, but only from Biscoe island  

Understanding difference between & (and) and |(or)

When using filter(), you often need to apply multiple conditions. The behavior of these conditions depends on whether you use:

  • & – AND: Both conditions must be TRUE for a row to be included.

  • | – OR: At least one condition must be TRUE for a row to be included.

filter(penguins, 
       species == "Chinstrap" & 
         body_mass_g < 3000)
## # A tibble: 2 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Chinstrap Dream            43.2          16.6               187        2900
## 2 Chinstrap Dream            46.9          16.6               192        2700
## # ℹ 2 more variables: sex <fct>, year <int>

The above statement keeps rows where: + The species is Chinstrap AND + The body_mass_g is less than 3000 grams.

species body_mass_g Chinstrap 2800 (included) Chinstrap 3200 (not included) Adelie 2900 (not included)

But what if we want all the Chinstrap penguins or if body mass is below 3 kilos? In this case, we will use the or character | .

filter(penguins, 
       species == "Chinstrap" | 
         body_mass_g < 3000)
## # A tibble: 75 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Dream              37.5          18.9               179        2975
##  2 Adelie    Biscoe             34.5          18.1               187        2900
##  3 Adelie    Biscoe             36.5          16.6               181        2850
##  4 Adelie    Biscoe             36.4          17.1               184        2850
##  5 Adelie    Dream              33.1          16.1               178        2900
##  6 Adelie    Biscoe             37.9          18.6               193        2925
##  7 Adelie    Torgers…           38.6          17                 188        2900
##  8 Chinstrap Dream              46.5          17.9               192        3500
##  9 Chinstrap Dream              50            19.5               196        3900
## 10 Chinstrap Dream              51.3          19.2               193        3650
## # ℹ 65 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Challenge 8

Filter the data so you only have observations of either male penguins or the Chinstrap species and assign this to the object chinstrap_males.

chinstrap_males <- filter(penguins, 
                          sex == "male" |
                            species == "Chinstrap")

The pipe |>

When you need to apply multiple functions in sequence (e.g., filter() followed by select()), you can use the pipe operator |>.

  • The pipe operator |> takes the result from the expression on the left and passes it as the first argument to the function on the right.

  • This makes your code cleaner and easier to read, especially when you have several functions to apply.

  • You can enable the pipe in RStudio by going to Tools -> Global options -> Code -> Use native pipe operator.

The shortcut to insert the pipe operator is Ctrl+Shift+M for Windows/Linux, and Cmd+Shift+M for Mac.

#try it here!

In the chinstraps example, we had the following code to filter the rows and then select our columns.

# Step 1
chinstraps <- filter(penguins, species == "Chinstrap")

#Step 2
chinstraps <- select(chinstraps, -starts_with("bill")) #select away columns that start with "bill" 

Instead this can be rewritten as:

chinstraps <- penguins |> 
  filter(species == "Chinstrap") |> 
  select(-starts_with("bill"))

You can read the pipe operator as “and then”.

So if we translate the code above to human language we could read it as:

“take the penguins data set, and then keep only rows for the chinstrap penguins, and then remove the columns starting with bill and assign the end result to chinstraps.”

Learning to read pipes is a great skill, R is not the only programming language that can do this (though the operator is different between languages, the functionality exists in many).

Challenge 9

“take the penguins data set, and then keep only data from the Biscoe island, and then select only the first 4 columns and assign the end result to biscoe.”

biscoe <- penguins |> 
  filter(island == "Biscoe") |> 
  select(1:4)