Lesson 3: Manipulating data with dplyr (Part II)
Introduction
Last time we started using
dplyrandtidyrfor the purpose of manipulating data.Today we will:
- Describe the purpose of the pipe operator (%>% or |>) and explain how it improves code readability.
- Explain the difference between modifying data (mutate) and reshaping data (pivot_longer, pivot_wider).
- We will also be more importantly, integrating or merging datasets with these functions.
Getting set up
Load any required packages:
Go ahead and read in the rnaseq.csv data table:
Use the glimpse command to check the structure of the
dataset:
Notice:
- Number of rows
- Column names
- The
sexcolumn consists of both “Male” and “Female” values
Below, filter the dataset to include only samples where the
sex column is “Male”. Then redirect the output to a new
object called rna2
Check the contents of rna2 below:
What changed?
Now select only the following columns from rna2:
- gene
- sample
- tissue
- expression.
Redirect the output to rna3:
Instead of creating multiple intermediate objects, we can chain operations together using pipes.
Pipes
Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.
Pipes in R look like %>% (made available via the
magrittr package) or |> (through base R). If you use
RStudio, you can type the pipe with Ctrl +
Shift + M if you have a PC or Cmd
+ Shift + M if you have a Mac.
Some may find it helpful to read the pipe like the word “then”. For
instance, in the above example, we took the data frame rna,
then we filtered for rows with sex == "Male", then we
selected columns gene, sample,
tissue, and expression.
An important conceptual moment occurs now when using R.
R does not modify object automatically. R follows a principal called “objects are not changed unless you explicitly reassign them”.
R is taking rna, filtering it, selecting columns, and
then printing the results to the console. That’s it. R creates a
temporary result in memory and displays it. It does not overwrite
rna because you never told it to.
If you want to keep the result, you must store it somewhere. You must save it. R is a functional programming language. It does not modify input objects by default, as this prevents accidental data destruction.
Challenge Take the next few minutes to:
Subset the rna dataset to include only observations from
female mice at time 0, where the gene
expression value higher is greater than 50,000.
Retain only the following columns:
genesampletimeexpressionage
Redirect the resulting dataset to rna4.
Mutate
So far we have:
- Filtered rows (kept certain observations)
- Selected columns (kept certain variables)
But what if we don’t want to remove anything? What if we want to add new information?
Now we will learn how to create new columns based on existing columns
using mutate().
What does mutate() do?
- keeps all existing rows and columns
- Adds or modifies columns
It does not remove anything unless you explicitly overwrite a column.
Creating a Time Column in Hours
We have a column called time that is measured in days.
We would like to create a new column that shows the time in hours. So we
can multiply time by 24.
So far, there are three key syntaxes to keep track of:
<-: used to assign objects in your R environment=: most commonly used inside of functions to name arguments or assign it to a value
==: does not assign anything, it performs a comparison and returnsTRUEorFALSE
Also note:
- We didn’t remove anything
- We added a new column called
time_hoursat the end
Grouped Operations: Split-apply-combine
R calculated time * 24 for every row independently,
across the entire dataset.
But what if we don’t want to treat all rows the same?
What if we want to calculate something within categories? Then we would no longer be transforming individual rows, but analyzing a group of rows.
Below we will begin by: “Take all rows that belong to the same gene, and analyze them together.”
This is changing how subsequent functions behave! The
group_by function does not calculate anything, it just
prepares the data for grouped operations.
group_by(gene)organizes the data so that all rows for the same gene are treated as a group
What the output tells you:
- There are 32,428 rows (observations)
- There are 19 columns (variables)
- The data is grouped into 1,474 genes, meaning there are 1,474 unique gene groups
Now is a good time to introduce the next common step. Together this
is known as the split-apply-combine paradigm.
- Split the data into groups using
group_by() - Apply a summary calculation to each group
- Combine the results
The data has been grouped by gene, now we can collapse each group
into a single-row summary using the summarize()
mean_expressionis a new column containing the average expression for each gene- The result has one row per gene, instead of one row per sample
- Other columns are automatically dropped unless included in the grouping
What is this code doing? This code calculates the average expression level for each gene across all samples in the dataset.For each gene, we are collapsing all measurements into a single number that represents its overall average activity in your experiment. This average is taken across everything in your dataset (e.g., different tissues, time points, sexes, treatments).
Challenge Take the next five minutes to add multiple summaries at once.
Above you are creating a new column with the mean expression. Now
add, max_expression and min_expression
columns.
Counting
So far, we’ve learned how to:
- Split data into groups using
group_by() - Collapse each group into a summary using
summarize()
But sometimes, we don’t need to calculate a mean, max, or min. Sometimes we just want to know: How many observations are in each group?
- In other words we just want to count rows per category.
This output shows the number of observations in each infection category. There are 22,110 rows corresponding to samples infected with Influenza A, and 10,318 rows corresponding to non-infected samples. This tells us that the dataset contains substantially more observations from infected samples than from non-infected ones. In other words, the data are not evenly balanced between the two infection groups.
Challenge
Modify the code above, so that you count the number of observations
for each combination of infection and time.
Then pipe the output using arrange() to sort the table by
time.
Based on this new output:
- Does the distribution of samples look more or less balanced than
when we ignored
time?
Reshaping data
Take the next five minutes to carefully compare the two pieces of code below.
Discuss the following:
- What does the first code chunk change about the
rnadataset?
- It reduces the dataset to only three columns: gene, sample, and expression.
- It does not change the structure of the data.
- Each row represents one measurement of expression for one gene in one sample.
- What does the second code chunk change about the
rnadataset?
- It reshapes the data from long to wide format.
samplebecome new column names.- The number of rows decreases (one row per gene).
- How does the structure of the data differ between
rna_expandrna_wide? Perform the code!
In
rna_exp, what does each row represent? Each row represents a single observation. The expression of one gene in one specific sample. As a result, there are multiple rows for each gene, one for each sample.In
rna_wide, what does each row represent?
Each row represents one gene, with all its expression values spread across multiple columns. There is exactly one row per gene, regardless of how many samples there are.
- When might the wide format be more useful than the long format?
Preparing data for modeling or matrix operations. For example, a gene expression matrix for principal component analysis (PCA) or clustering needs one row per gene and one column per sample.
Performing correlations or modeling of samples
Long format is preferred when plotting with
ggplot2or using tidy workflows