Lesson 3: Manipulating data with dplyr (Part II)

Introduction

Last time we started using dplyr and tidyr for the purpose of manipulating data.
Today we will:
- Describe the purpose of the pipe operator (%>% or |>) and explain how it improves code readability.
- Explain the difference between modifying data (mutate) and reshaping data (pivot_longer, pivot_wider).
- We will also be more importantly, integrating or merging datasets with these functions.

Getting set up

Load any required packages:

library(dplyr)
library(tidyverse)

Go ahead and read in the rnaseq.csv data table:

rna <- read_csv("rnaseq.csv") # comma separated values

Use the glimpse command to check the structure of the dataset:

glimpse(rna)

Notice:

Number of rows
Column names
The sex column consists of both “Male” and “Female” values

Below, filter the dataset to include only samples where the sex column is “Male”. Then redirect the output to a new object called rna2

rna2 <- filter(rna, sex == "Male")

Check the contents of rna2 below:

glimpse(rna2)

What changed?

Now select only the following columns from rna2:

gene
sample
tissue
expression.

Redirect the output to rna3:

rna3 <- select(rna2, gene, sample, tissue, expression)

Instead of creating multiple intermediate objects, we can chain operations together using pipes.

Pipes

Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

Pipes in R look like %>% (made available via the magrittr package) or |> (through base R). If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

rna |>
  filter(sex == "Male") |>
  select(gene, sample, tissue, expression)

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame rna, then we filtered for rows with sex == "Male", then we selected columns gene, sample, tissue, and expression.

An important conceptual moment occurs now when using R.

R does not modify object automatically. R follows a principal called “objects are not changed unless you explicitly reassign them”.

R is taking rna, filtering it, selecting columns, and then printing the results to the console. That’s it. R creates a temporary result in memory and displays it. It does not overwrite rna because you never told it to.

If you want to keep the result, you must store it somewhere. You must save it. R is a functional programming language. It does not modify input objects by default, as this prevents accidental data destruction.

nrow(rna)

rna |>
  filter(sex == "Male")

nrow(rna)

Challenge Take the next few minutes to:

Subset the rna dataset to include only observations from female mice at time 0, where the gene expression value higher is greater than 50,000.

Retain only the following columns:

gene
sample
time
expression
age

Redirect the resulting dataset to rna4.

rna4 <- rna |>
  filter(expression > 50000,
         sex == "Female",
         time == 0 ) |>
  select(gene, sample, time, expression, age)

Mutate

So far we have:

Filtered rows (kept certain observations)
Selected columns (kept certain variables)

But what if we don’t want to remove anything? What if we want to add new information?

Now we will learn how to create new columns based on existing columns using mutate().

What does mutate() do?

keeps all existing rows and columns
Adds or modifies columns

It does not remove anything unless you explicitly overwrite a column.

Creating a Time Column in Hours

We have a column called time that is measured in days. We would like to create a new column that shows the time in hours. So we can multiply time by 24.

rna |>
  mutate(time_hours = time * 24)

So far, there are three key syntaxes to keep track of:

<- : used to assign objects in your R environment
= : most commonly used inside of functions to name arguments or assign it to a value
== : does not assign anything, it performs a comparison and returns TRUE or FALSE

Also note:

We didn’t remove anything
We added a new column called time_hours at the end

rna |>
  mutate(time_hours = time * 24) |>
  select(time, time_hours)

Grouped Operations: Split-apply-combine

R calculated time * 24 for every row independently, across the entire dataset.

But what if we don’t want to treat all rows the same?

What if we want to calculate something within categories? Then we would no longer be transforming individual rows, but analyzing a group of rows.

Below we will begin by: “Take all rows that belong to the same gene, and analyze them together.”

This is changing how subsequent functions behave! The group_by function does not calculate anything, it just prepares the data for grouped operations.

rna |>
  group_by(gene)

rna |>
  group_by(gene) |>
  glimpse()

group_by(gene) organizes the data so that all rows for the same gene are treated as a group

What the output tells you:

There are 32,428 rows (observations)
There are 19 columns (variables)
The data is grouped into 1,474 genes, meaning there are 1,474 unique gene groups

Now is a good time to introduce the next common step. Together this is known as the split-apply-combine paradigm.

Split the data into groups using group_by()
Apply a summary calculation to each group
Combine the results

The data has been grouped by gene, now we can collapse each group into a single-row summary using the summarize()

mean_expression is a new column containing the average expression for each gene
The result has one row per gene, instead of one row per sample
Other columns are automatically dropped unless included in the grouping

rna |>
  group_by(gene) |>
  summarise(mean_expression = mean(expression))

What is this code doing? This code calculates the average expression level for each gene across all samples in the dataset.For each gene, we are collapsing all measurements into a single number that represents its overall average activity in your experiment. This average is taken across everything in your dataset (e.g., different tissues, time points, sexes, treatments).

Challenge Take the next five minutes to add multiple summaries at once.

Above you are creating a new column with the mean expression. Now add, max_expression and min_expression columns.

rna |>
  group_by(gene) |>
  summarise(
    mean_expression = mean(expression),
    max_expression = max(expression),
    min_expression = min(expression)
  )

Counting

So far, we’ve learned how to:

Split data into groups using group_by()
Collapse each group into a summary using summarize()

But sometimes, we don’t need to calculate a mean, max, or min. Sometimes we just want to know: How many observations are in each group?

In other words we just want to count rows per category.

rna |> 
  count(infection)

This output shows the number of observations in each infection category. There are 22,110 rows corresponding to samples infected with Influenza A, and 10,318 rows corresponding to non-infected samples. This tells us that the dataset contains substantially more observations from infected samples than from non-infected ones. In other words, the data are not evenly balanced between the two infection groups.

Challenge

Modify the code above, so that you count the number of observations for each combination of infection and time. Then pipe the output using arrange() to sort the table by time.

Based on this new output:

Does the distribution of samples look more or less balanced than when we ignored time?

rna |>
    count(infection, time) |>
  arrange(time)

Reshaping data

Take the next five minutes to carefully compare the two pieces of code below.

rna_exp <- rna |>
  select(gene, sample, expression)
rna_exp

rna_wide <- rna_exp |>
  pivot_wider(names_from = sample,
              values_from = expression)
rna_wide

Discuss the following:

What does the first code chunk change about the rna dataset?

It reduces the dataset to only three columns: gene, sample, and expression.
It does not change the structure of the data.
Each row represents one measurement of expression for one gene in one sample.

What does the second code chunk change about the rna dataset?

It reshapes the data from long to wide format.
sample become new column names.
The number of rows decreases (one row per gene).

How does the structure of the data differ between rna_exp and rna_wide? Perform the code!

str(rna_exp)

str(rna_wide)

In rna_exp, what does each row represent? Each row represents a single observation. The expression of one gene in one specific sample. As a result, there are multiple rows for each gene, one for each sample.
In rna_wide, what does each row represent?

Each row represents one gene, with all its expression values spread across multiple columns. There is exactly one row per gene, regardless of how many samples there are.

When might the wide format be more useful than the long format?

Preparing data for modeling or matrix operations. For example, a gene expression matrix for principal component analysis (PCA) or clustering needs one row per gene and one column per sample.
Performing correlations or modeling of samples
Long format is preferred when plotting with ggplot2 or using tidy workflows