Lesson 4: Data Visualization

2026-03-25

Introduction

  • Last time we finished using dplyr and tidyverse for the purpose of manipulating data.
  • Today we will produce scatter plots, boxplots, etc using ggplot.
  • Understand universal plot settings
  • Review faceting and apply it
  • Modify the aesthetics of an existing ggplot plot

Getting set up

Load any required packages:

library(dplyr)
library(tidyverse)

Go ahead and read in the rnaseq.csv data table. Redirect output to rna.

rna <- read_csv("rnaseq.csv") # comma separated values

Plotting with ggplot2

  • ggplot2 is a core member of tidyverse family of packages. Therefore upon loading tidyverse, we can view under the packages tab that it has been loaded.

  • ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame.

  • ggplot2 functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation.

  • Well-structured data will save you lots of time when making figures with ggplot2.

  • ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template

#ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  #<GEOM_FUNCTION>()

# (1) data set
# (2) a coordinate system
# (3) geoms

We will begin by using the ggplot() function to initialize the basic graph structure. The basic idea is that you can specify different parts of the plot and add them together using the + operator. These parts are often referred to as layers.

Let’s start:

#ggplot(data = rna) #what happens? 

So what we are doing is binding the plot function to a specific data-frame using the data argument.

#?ggplot

# We can see that the first argument that ggplot requires is the data argument. Now let's continue to build this. 

Now lets define the mapping argument, using the aesthetic function. Notice, that we are selecting what is to be plotted on the x axis below.

#ggplot(data = rna,
       #mapping = aes(x = expression))

Notice, that we will still get a blank plot. This is because ggplot2 requires the user to specify an important layer called geometric objects. These are graphical representations of the data in plots. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)

  • lines (geom_line, for time series, trend lines, etc)

  • boxplot (geom_boxplot, for boxplots)

  • histogram (geom_histogram)

  • Any plot created with ggplot() must have at least one geom.

  • You can add a geom to a plot using the + operator.

ggplot(data = rna,
       mapping = aes(x = expression)) +
  geom_histogram()

How would I interpret this plot?

The majority of the counts are of low-expressed genes. We can see that the majority of the rows are of those which have less than 1,000 reads.

Raw RNA-seq count data do not follow a normal distribution. Instead, they follow a count-based distribution (commonly modeled as negative binomial). This is why, DESeq2 and edgeR use negative binomial models.

ggplot(data = rna,
       mapping = aes(x = expression)) +
  geom_histogram(bins = 15)

What does bins = 15 vs bins = 30 do?

It controls how many bars the histogram is divided into.

A histogram works by:

  • Taking the range of your data
  • Dividing that range into equal-width intervals (bins)
  • Counting how many observations fall inside each interval

bins = 15; fewer binds & wider intervals

Class Exercise 1

We observed that the data are skewed to the right. We can apply log2 transformation to have a more symmetric distribution. By doing so, large values get compressed and small values are less affected.

If we tried to take the log2(expression), knowing that it contains zeros, it will give us an output that says log(0) is undefined. To avoid this, you will need to add a small constant value (+1).

#log2(0)

You will need to:

  • Create a new column called expression_log that equals to log2(expression +1).

    • Which dplyr function allows us to create new columns?
  • Redirect this entire data-frame to a new R object called rna_log

  • Finally, create a histogram using rna_log and specify x = expression_log

rna_log <- rna |>
  mutate(expression_log = log2(expression + 1))
ggplot(rna_log, aes(x = expression_log)) + 
  geom_histogram()

Note that the + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.

Building your plots iteratively

Goal: Compare Expression Changes Across Time

Next, we want to compare how gene expression changes over time. Specifically,

  • “How much does gene expression change from 0 to 4 days?
  • “How much does gene expression change from 0 to 8 days?

We will make a scatter plot where:

  • x-axis = log2 fold change (8 vs 0)
  • y-axis = log2 fold change (4 vs 0 )

Step 1: Identifying what data we need

Which columns are required?

  • gene
  • time
  • gene_biotype
  • expression_log
rna_small <- rna_log |>
  select(gene, time, gene_biotype, expression_log)

Step 2: Compute the Mean Expression Per Gene Per Time

If we want the mean expression for each gene at each time point, what do we need to group by?

  • gene
  • time (and gene_biotype)

Performing the commands below tells you:

  • what time points exist
  • how many categories are in gene_biotype
unique(rna$time)
## [1] 8 0 4
unique(rna$gene_biotype)
##  [1] "protein_coding"                     "TEC"                               
##  [3] "lncRNA"                             "transcribed_unitary_pseudogene"    
##  [5] "transcribed_processed_pseudogene"   "unprocessed_pseudogene"            
##  [7] "processed_pseudogene"               "transcribed_unprocessed_pseudogene"
##  [9] "miRNA"                              "IG_C_gene"                         
## [11] "snoRNA"                             "scaRNA"                            
## [13] "polymorphic_pseudogene"

Take a look at the rna_small dataframe and sort manually to show that right now we have data frame that contains multiple replicates for gene A at time 0 and at time 4.

rna_mean <- rna_small |>
  group_by(gene, time, gene_biotype) |>
  summarize(mean_exp = mean(expression_log))
  • We collapsed the replicates
  • Each row is now one gene at one time point

Take a look at rna_mean.

Our goal is to compare how gene expression changes over time. So if we want to compute time 8 - time 0, are these values currently on the same row?

  • No, they are in separate rows.

We need one row per gene. This means we need to reshape the data.

Step 3: Reshape the data

rna_wide <- rna_mean |>
  pivot_wider(names_from = time,
              values_from = mean_exp)

Step 4: Compute Fold Change

This data has already been log-transformed, so log2FC = log2 (time 8) - log2 (time 0)

rna_fc <- rna_wide |>
  mutate(
    time_8_vs_0 = `8` - `0`,
    time_4_vs_0 = `4` - `0`
  )

Step 5: Now, we can make a scatter plot …

ggplot(rna_fc, aes(x = time_8_vs_0,
                   y = time_4_vs_0)) +
  geom_point()

Class Exercise 2

Create a scatter plot that

  • Places sample on the x-axis
  • Places expression_log on the y-axis
  • Uses different colors to represent time - use aes(color = time)
ggplot(data = rna_log, 
       mapping = aes(y = expression_log, 
                     x = sample,
                     color = time)) +
    geom_point()

Please note that:

  • Any aesthetics defined inside ggplot() are global mappings. These mappings are inherited by all geoms added to the plot. This includes variables assigned inside aes() (e.g., x, y, color).

However:

You can also define aesthetics inside a specific geom(). These are local mappings and apply only to that layer.

Class Exercise 3

Now, create a plot of expression_log vs sample. Instead of points, display the distribution using a boxplot.

ggplot(data = rna_log,
         mapping = aes(y = expression_log, x = sample)) +
  geom_boxplot()

Moving from a basic boxplot to a layered plot

Above, we are showing the distribution of expression for each sample. Now we are going to build on that plot, not replace it.

Step 1: Add Jitter

Problem, is that the boxplot summarizes the data. It hides individual data points. The solution is to add a layer that shows individual observations.

ggplot(data = rna_log,
         mapping = aes(y = expression_log, x = sample)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.2, color = "tomato")

Step 2: Set order & transparency

Whats the problem now? The points are “on top of” the boxplot. I want the black box plot to be “on top”. So let’s switch the order.

  • What is alpha doing? Making the points transparent.
ggplot(data = rna_log,
         mapping = aes(y = expression_log, x = sample)) +
  geom_jitter(alpha = 0.2, color = "tomato") +
  geom_boxplot() 

Now let’s make the boxplot transparent.

ggplot(data = rna_log,
         mapping = aes(y = expression_log, x = sample)) +
  geom_jitter(alpha = 0.2, color = "tomato") +
  geom_boxplot(alpha = 0) 

Step 3: Rotate axis labels

ggplot(data = rna_log,
         mapping = aes(y = expression_log, x = sample)) +
  geom_jitter(alpha = 0.2, color = "tomato") +
  geom_boxplot(alpha = 0) +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 0.5,
                                   vjust = 0.5))

What to understand:

  • Plots are built incrementally
  • Each + adds a layer or a visual/styling adjustment

Class Exercise 4

Add color to the data points on your boxplot according to the duration of the infection (time).

# time as integer
ggplot(data = rna_log,
         mapping = aes(y = expression_log,
                       x = sample)) +
  geom_jitter(alpha = 0.2, aes(color = time)) +
  geom_boxplot(alpha = 0) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

Now, re-create the image. Here, use color = as.factor() for time instead.

ggplot(data = rna_log,
         mapping = aes(y = expression_log,
                       x = sample)) +
  geom_jitter(alpha = 0.2, aes(color = as.factor(time))) +
  geom_boxplot(alpha = 0) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

Finally, replace the box plot with a violin plot geom_violin(). Fill in the violins according to time with the argument fill = sex.

ggplot(data = rna_log,
         mapping = aes(y = expression_log, x = sample)) +
  geom_violin(aes(fill = sex)) +
  theme(axis.text.x = element_text(angle = 90, 
                                   hjust = 0.5, vjust = 0.5))

Line plots

Alright lets regroup. ggplot builds plots from three core components: 1) dataset, 2) mapping variables, 3) a geometric object (points, bars, lines, etc).

Everything from this lesson builds from this core concept.

So our next question is, “For the top 10 expressed genes, how does their expression change from time 0 and time 8 (i.e. over time)?”

Step 1: Identify which genes changed the most

In rna_fc, each row represents one gene with its fold changes already computed. We care about time_8_vs_0 representing how much expression changed from time 0 to time 8.

rna_fc <- rna_fc |> 
  arrange(desc(time_8_vs_0))
  • Put the genes that increased the most at the top of the list
  • desc() means descending order

Step 2: Extract the Top 10 Gene Names

genes_selected <- rna_fc$gene[1:10]

What this is doing is actually looking at the first 10 rows, pulling out their gene names, and storing these names as a character vector. Notice, that these are simply values. We will come back to how these are being used.

Step 3: Go back to rna dataframe

So the rna_fc gave us the vectored gene list. Now, we want mean expression at each time point. The original dataframe contains this information.

sub_rna <- rna_log |>
    filter(gene %in% genes_selected)

This is keeping only rows where the gene is one of the top 10. It’s tossing all the other genes. We still have multiple time points, replicates, but for only 10 genes.

Next, we need to:

  • group by gene and time
  • compute the mean expression for each group.

Perform these steps below using sub_rna as the starting input. Redirect the output back to mean_exp_by_time

Hint: we did this during Step 2: Compute the Mean Expression Per Gene Per Time

mean_exp_by_time <- sub_rna |>
  group_by(gene, time) |>
  summarize(mean_exp = mean(expression_log))
## `summarise()` has grouped output by 'gene'. You can override using the
## `.groups` argument.

Step 4: Plotting a line plot

Now, we can build a line plot with duration of the infection on the x-axis and the mean expression on the y-axis

ggplot(data = mean_exp_by_time, 
       mapping = aes(x = time, y = mean_exp)) +
  geom_line()

We can see that this alone does not solve the problem. This is because we plotted data for all the genes together. We need to tell ggplot to draw a line for each gene by modifying the aesthetic function to include color = gene.

ggplot(data = mean_exp_by_time,
       mapping = aes(x = time, y = mean_exp, color = gene)) +
  geom_line()

ggplot(data = mean_exp_by_time,
       mapping = aes(x = time, y = mean_exp, color = gene)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Mean Gene Expression Over Time",
    x = "Time",
    y = "Mean Expression",
    color = "Gene"
  ) +
  theme_minimal()