Introduction
- Last time we finished using
dplyrandtidyversefor the purpose of manipulating data. - Today we will produce scatter plots, boxplots, etc using
ggplot. - Understand universal plot settings
- Review faceting and apply it
- Modify the aesthetics of an existing ggplot plot
Plotting with ggplot2
ggplot2is a core member of tidyverse family of packages. Therefore upon loading tidyverse, we can view under the packages tab that it has been loaded.ggplot2is a plotting package that makes it simple to create complex plots from data in a data frame.ggplot2functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation.Well-structured data will save you lots of time when making figures with
ggplot2.ggplotgraphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot, we will use the following basic template
#ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + #<GEOM_FUNCTION>()
# (1) data set
# (2) a coordinate system
# (3) geomsWe will begin by using the ggplot() function to
initialize the basic graph structure. The basic idea is that you can
specify different parts of the plot and add them together using the
+ operator. These parts are often referred to as
layers.
Let’s start:
So what we are doing is binding the plot function to a specific
data-frame using the data argument.
#?ggplot
# We can see that the first argument that ggplot requires is the data argument. Now let's continue to build this. Now lets define the mapping argument, using the aesthetic function. Notice, that we are selecting what is to be plotted on the x axis below.
Notice, that we will still get a blank plot. This is because
ggplot2 requires the user to specify an important
layer called geometric objects. These are graphical
representations of the data in plots. Examples include:
points (
geom_point,geom_jitterfor scatter plots, dot plots, etc)lines (
geom_line, for time series, trend lines, etc)boxplot (
geom_boxplot, for boxplots)histogram (
geom_histogram)Any plot created with
ggplot()must have at least one geom.You can add a geom to a plot using the
+operator.
How would I interpret this plot?
The majority of the counts are of low-expressed genes. We can see that the majority of the rows are of those which have less than 1,000 reads.
Raw RNA-seq count data do not follow a normal distribution. Instead,
they follow a count-based distribution (commonly modeled as negative
binomial). This is why, DESeq2 and edgeR use
negative binomial models.
What does bins = 15 vs bins = 30 do?
It controls how many bars the histogram is divided into.
A histogram works by:
- Taking the range of your data
- Dividing that range into equal-width intervals (bins)
- Counting how many observations fall inside each interval
bins = 15; fewer binds & wider intervals
Class Exercise 1
We observed that the data are skewed to the right. We can apply log2 transformation to have a more symmetric distribution. By doing so, large values get compressed and small values are less affected.
If we tried to take the log2(expression), knowing that it contains zeros, it will give us an output that says log(0) is undefined. To avoid this, you will need to add a small constant value (+1).
You will need to:
Create a new column called
expression_logthat equals tolog2(expression +1).- Which
dplyrfunction allows us to create new columns?
- Which
Redirect this entire data-frame to a new R object called
rna_logFinally, create a histogram using
rna_logand specifyx = expression_log
Note that the + sign used to add new layers must be
placed at the end of the line containing the previous layer. If,
instead, the + sign is added at the beginning of the line
containing the new layer, ggplot2 will not add the new layer and will
return an error message.
Building your plots iteratively
Goal: Compare Expression Changes Across Time
Next, we want to compare how gene expression changes over time. Specifically,
- “How much does gene expression change from 0 to 4 days?
- “How much does gene expression change from 0 to 8 days?
We will make a scatter plot where:
- x-axis = log2 fold change (8 vs 0)
- y-axis = log2 fold change (4 vs 0 )
Step 1: Identifying what data we need
Which columns are required?
- gene
- time
- gene_biotype
- expression_log
Step 2: Compute the Mean Expression Per Gene Per Time
If we want the mean expression for each gene at each time point, what do we need to group by?
- gene
- time (and gene_biotype)
Performing the commands below tells you:
- what
timepoints exist - how many categories are in
gene_biotype
## [1] 8 0 4
## [1] "protein_coding" "TEC"
## [3] "lncRNA" "transcribed_unitary_pseudogene"
## [5] "transcribed_processed_pseudogene" "unprocessed_pseudogene"
## [7] "processed_pseudogene" "transcribed_unprocessed_pseudogene"
## [9] "miRNA" "IG_C_gene"
## [11] "snoRNA" "scaRNA"
## [13] "polymorphic_pseudogene"
Take a look at the rna_small dataframe and sort manually
to show that right now we have data frame that contains multiple
replicates for gene A at time 0 and at time 4.
rna_mean <- rna_small |>
group_by(gene, time, gene_biotype) |>
summarize(mean_exp = mean(expression_log))- We collapsed the replicates
- Each row is now one gene at one time point
Take a look at rna_mean.
Our goal is to compare how gene expression changes over time. So if we want to compute time 8 - time 0, are these values currently on the same row?
- No, they are in separate rows.
We need one row per gene. This means we need to reshape the data.
Step 3: Reshape the data
Step 4: Compute Fold Change
This data has already been log-transformed, so log2FC = log2 (time 8) - log2 (time 0)
Class Exercise 2
Create a scatter plot that
- Places
sampleon the x-axis - Places
expression_logon the y-axis - Uses different colors to represent time - use
aes(color = time)
Please note that:
- Any aesthetics defined inside
ggplot()are global mappings. These mappings are inherited by all geoms added to the plot. This includes variables assigned insideaes()(e.g., x, y, color).
However:
You can also define aesthetics inside a specific geom().
These are local mappings and apply only to that layer.
Moving from a basic boxplot to a layered plot
Above, we are showing the distribution of expression for each sample. Now we are going to build on that plot, not replace it.
Step 1: Add Jitter
Problem, is that the boxplot summarizes the data. It hides individual data points. The solution is to add a layer that shows individual observations.
Step 2: Set order & transparency
Whats the problem now? The points are “on top of” the boxplot. I want the black box plot to be “on top”. So let’s switch the order.
- What is alpha doing? Making the points transparent.
ggplot(data = rna_log,
mapping = aes(y = expression_log, x = sample)) +
geom_jitter(alpha = 0.2, color = "tomato") +
geom_boxplot() Now let’s make the boxplot transparent.
Step 3: Rotate axis labels
ggplot(data = rna_log,
mapping = aes(y = expression_log, x = sample)) +
geom_jitter(alpha = 0.2, color = "tomato") +
geom_boxplot(alpha = 0) +
theme(axis.text.x = element_text(angle = 90,
hjust = 0.5,
vjust = 0.5))What to understand:
- Plots are built incrementally
- Each
+adds a layer or a visual/styling adjustment
Class Exercise 4
Add color to the data points on your boxplot according to the duration of the infection (time).
# time as integer
ggplot(data = rna_log,
mapping = aes(y = expression_log,
x = sample)) +
geom_jitter(alpha = 0.2, aes(color = time)) +
geom_boxplot(alpha = 0) +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))Now, re-create the image. Here, use color = as.factor()
for time instead.
ggplot(data = rna_log,
mapping = aes(y = expression_log,
x = sample)) +
geom_jitter(alpha = 0.2, aes(color = as.factor(time))) +
geom_boxplot(alpha = 0) +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))Finally, replace the box plot with a violin plot
geom_violin(). Fill in the violins according to time with
the argument fill = sex.
Line plots
Alright lets regroup. ggplot builds plots from three
core components: 1) dataset, 2) mapping variables, 3) a geometric object
(points, bars, lines, etc).
Everything from this lesson builds from this core concept.
So our next question is, “For the top 10 expressed genes, how does their expression change from time 0 and time 8 (i.e. over time)?”
Step 1: Identify which genes changed the most
In rna_fc, each row represents one gene with its fold
changes already computed. We care about time_8_vs_0
representing how much expression changed from time 0 to time 8.
- Put the genes that increased the most at the top of the list
desc()means descending order
Step 2: Extract the Top 10 Gene Names
What this is doing is actually looking at the first 10 rows, pulling out their gene names, and storing these names as a character vector. Notice, that these are simply values. We will come back to how these are being used.
Step 3: Go back to rna dataframe
So the rna_fc gave us the vectored gene list. Now, we
want mean expression at each time point. The original dataframe contains
this information.
This is keeping only rows where the gene is one of the top 10. It’s tossing all the other genes. We still have multiple time points, replicates, but for only 10 genes.
Next, we need to:
- group by gene and time
- compute the mean expression for each group.
Perform these steps below using sub_rna as the starting
input. Redirect the output back to mean_exp_by_time
Hint: we did this during Step 2: Compute the Mean Expression Per Gene Per Time
## `summarise()` has grouped output by 'gene'. You can override using the
## `.groups` argument.
Step 4: Plotting a line plot
Now, we can build a line plot with duration of the infection on the x-axis and the mean expression on the y-axis
We can see that this alone does not solve the problem. This is
because we plotted data for all the genes together. We need to tell
ggplot to draw a line for each gene by modifying the
aesthetic function to include color = gene.
ggplot(data = mean_exp_by_time,
mapping = aes(x = time, y = mean_exp, color = gene)) +
geom_line() +
geom_point() +
labs(
title = "Mean Gene Expression Over Time",
x = "Time",
y = "Mean Expression",
color = "Gene"
) +
theme_minimal()