Analyzing RNA-Seq Part II

Getting Started

Code chunk options

Install the required R libraries

# Install CRAN packages
#install.packages(c("knitr", "RColorBrewer", "pheatmap"))
#BiocManager::install("apeglm") # Update all/some/none? [a/s/n] - type n 
#install.packages("pacman")

Libraries loaded

# loading required packages 
pacman::p_load(DESeq2, dplyr, tidyr, ggplot2, genekitr, pheatmap, 
               RColorBrewer, EnhancedVolcano, clusterProfiler, DOSE, 
               ggnewscale, enrichplot, knitr, cowplot, msigdbr, pathview, 
               patchwork, ggplotify)

About the Data

For today’s lesson we will be working with RNA-seq data from another recent publication in ImmunoHorizons by Sabikunnahar et al. (2025).

Summary: Innate immune cells rapidly respond to microbial threats, but if unchecked, these responses can harm the host, as seen in septic shock. To explore how genetic variation influences these responses, researchers compared standard lab mice (C57BL/6) with genetically diverse wild-derived mice (PWD).

Experimental Design: The authors isolated bone marrow derived dendritic cells from B6, c11.2, and PWD mice and then either left them unstimulated or stimulated with LPS.

This experiment has: + Two treatments: untreated, LPS + Three genotypes: B6, c11.2, and PWD

The goal: + To test how LPS stimulation affects each genotype + Whether the response to LPS differs between genotype

`ENSM_counts` folder contents

Contents of the ENSM_counts folder:

The raw counts matrix called Sabikunnahar.et.al.2025_raw_counts_matrix.csv
No metatable table: we will create this

Loading `countdata`

Write a line of R code that reads in the counts matrix using read.csv(). + Hint: Include header=TRUE and specify the first column as row names. + After your data is in the global environment, convert it to a matrix.

#load data
countdata <- read.csv("Sabikunnahar.et.al.2025_raw_counts_matrix.csv", 
                      header=TRUE, 
                      row.names=1)

countdata <- as.matrix(countdata)

colnames(countdata)

##  [1] "Unstim_B6_Rep1"    "Unstim_B6_Rep2"    "Unstim_B6_Rep3"   
##  [4] "Unstim_B6_Rep4"    "Unstim_c11.2_rep1" "Unstim_c11.2_rep2"
##  [7] "Unstim_c11.2_rep3" "Unstim_c11.2_rep4" "Unstim_PWD_rep1"  
## [10] "Unstim_PWD_rep2"   "Unstim_PWD_rep3"   "Unstim_PWD_rep4"  
## [13] "LPS_B6_rep1"       "LPS_B6_rep2"       "LPS_B6_rep3"      
## [16] "LPS_B6_rep4"       "LPS_c11.2_Rep1"    "LPS_c11.2_Rep2"   
## [19] "LPS_c11.2_Rep3"    "LPS_c11.2_Rep4"    "LPS_PWD_Rep1"     
## [22] "LPS_PWD_Rep2"      "LPS_PWD_Rep3"      "LPS_PWD_Rep4"

Now that we’ve loaded our count data, let’s build the metadata that describes each sample — this includes treatment condition and genotype. This table is essential for many types of downstream analyses like differential expression.

sample_names <- colnames(countdata)
treatments <- rep(c("Untreated", "LPS"), each = 12)
genotypes <- rep(c("B6", "c11.2", "PWD"), each = 4, times = 2)

Let’s check our work we should have 24 sample names, 24 treatments, and 24 genotypes.

length(sample_names)

## [1] 24

length(treatments)

## [1] 24

length(genotypes)

## [1] 24

Now that we’ve created vectors for our sample names, genotypes, and treatments, we’re ready to organize them into a metadata table — often called a coldata object in RNA-seq workflows.

coldata <- data.frame(
  row.names = sample_names,
  Genotype = factor(genotypes, levels = c("B6", "c11.2", "PWD")),
  Treatment = factor(treatments, levels = c("Untreated", "LPS"))
)

coldata$Group <- factor(paste(coldata$Genotype, coldata$Treatment, sep = "_"))

Here, we’re building a data.frame with three key things:

row.names = sample_names — This makes sure each row is labeled with the correct sample ID from our count data.

Genotype = factor(...) — We’re telling R that genotype is a categorical variable with a specific order: B6, c11.2, then PWD. This helps when plotting or running models.

Treatment = factor(...) — Same idea: treatment is also a categorical variable, ordered as Untreated, then LPS.

coldata$Group Line creates a new column called Group, which combines the genotype and treatment into one label.

So now, coldata is a complete and tidy metadata table with rows named by sample, columns for Genotype and Treatment, and a convenient combined Group column.

all(colnames(countdata) == rownames(coldata))

## [1] TRUE

# should return TRUE

Class Exercise: DESeq2 with counts matrix

From htseq-count files - DESeqDataSetFromHTSeqCount
From a count matrix - DESeqDataSetFromMatrix

Use the DESeqDataSetFromMatrix() function to create a dataset object called dds. Set the design to test the ~ Group variable.

dds <- DESeqDataSetFromMatrix(countData = countdata,
                              colData = coldata,
                              design = ~ Group)

Pre-filtering

keep <- rowSums(counts(dds)) >= 20
dds <- dds[keep,]

Class Exercise: Run Differential expression analysis

Now that you’ve defined the dataset, run the DESeq analysis with DESeq().

dds <- DESeq(dds)

Understanding Pairwise Comparisons with `results()` in DESeq2

Once you’ve run your differential expression analysis using DESeq(), you can extract specific results using the results() function.

This is where the contrast argument becomes very important — especially if you’re making multiple pairwise comparisons within your experimental design.

What is contrast doing?

The contrast = c(“Group”, “A”, “B”) argument tells DESeq2:

“Compare condition A against condition B, within the variable called Group.”

This means DESeq2 will test for genes that are significantly different between condition A and condition B — calculating log2 fold changes, p-values, and adjusted p-values.

# Compares B6 mice treated with LPS to untreated B6 mice
B6_LPS_res <- results(dds, contrast = c("Group", "B6_LPS", "B6_Untreated"))

# How do c11.2 mice respond to LPS, compared to their untreated controls
c11.2_LPS_res <- results(dds, contrast = c("Group", "c11.2_LPS", "c11.2_Untreated"))

# How do PWD mice respond to LPS, compared to their untreated controls
PWD_LPS_res <- results(dds, contrast = c("Group", "PWD_LPS", "PWD_Untreated"))

Below write a line to compare, PWD vs. B6 both under LPS treatment. Call the object LPS_compare_res.

# Compare LPS response between PWD and B6 (under LPS)
LPS_compare_res <- results(dds, contrast = c("Group", "PWD_LPS", "B6_LPS"))

To explore the output of the results() function we can also use the head() function.

head(B6_LPS_res)

## log2 fold change (MLE): Group B6_LPS vs B6_Untreated 
## Wald test p-value: Group B6_LPS vs B6_Untreated 
## DataFrame with 6 rows and 6 columns
##                         baseMean log2FoldChange     lfcSE      stat      pvalue
##                        <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSMUSG00000020826.9  3482.13439       8.836127 0.2132265 41.440103 0.00000e+00
## ENSMUSG00000098104.1     1.07948       0.563411 1.4252563  0.395305 6.92618e-01
## ENSMUSG00000103922.1     9.71912       0.210842 0.7620138  0.276691 7.82018e-01
## ENSMUSG00000033845.13  353.21357      -0.187633 0.0811147 -2.313180 2.07128e-02
## ENSMUSG00000102275.1     7.50138      -0.113853 0.4232646 -0.268988 7.87939e-01
## ENSMUSG00000025903.14  834.78800       0.402286 0.0842057  4.777418 1.77560e-06
##                              padj
##                         <numeric>
## ENSMUSG00000020826.9  0.00000e+00
## ENSMUSG00000098104.1  7.96790e-01
## ENSMUSG00000103922.1  8.62258e-01
## ENSMUSG00000033845.13 4.34703e-02
## ENSMUSG00000102275.1  8.66964e-01
## ENSMUSG00000025903.14 6.29707e-06

head(LPS_compare_res)

## log2 fold change (MLE): Group PWD_LPS vs B6_LPS 
## Wald test p-value: Group PWD LPS vs B6 LPS 
## DataFrame with 6 rows and 6 columns
##                         baseMean log2FoldChange     lfcSE       stat
##                        <numeric>      <numeric> <numeric>  <numeric>
## ENSMUSG00000020826.9  3482.13439      -3.812370 0.1757376 -21.693534
## ENSMUSG00000098104.1     1.07948       0.378988 1.2991089   0.291729
## ENSMUSG00000103922.1     9.71912       3.712774 0.6034209   6.152876
## ENSMUSG00000033845.13  353.21357      -0.640256 0.0862566  -7.422692
## ENSMUSG00000102275.1     7.50138      -0.681252 0.4570126  -1.490663
## ENSMUSG00000025903.14  834.78800      -0.188839 0.0837277  -2.255400
##                             pvalue         padj
##                          <numeric>    <numeric>
## ENSMUSG00000020826.9  2.36133e-104 9.03706e-103
## ENSMUSG00000098104.1   7.70494e-01  8.23113e-01
## ENSMUSG00000103922.1   7.60901e-10  2.86879e-09
## ENSMUSG00000033845.13  1.14764e-13  5.38103e-13
## ENSMUSG00000102275.1   1.36050e-01  1.95847e-01
## ENSMUSG00000025903.14  2.41082e-02  4.17033e-02

Replicating Results: Figure 5, panel B

# Principal components analysis
vstcounts <- vst(dds, blind=TRUE)
plotPCA(vstcounts, intgroup=c("Treatment", "Genotype"))

pcaData <- plotPCA(vstcounts, 
                   intgroup = c("Treatment", "Genotype"), 
                   returnData = TRUE)

What is this doing? + plotPCA is the DESeq2 function that performs PCA on the VST-transformed counts matrix + intgroup = c("Treatment", "Genotype"): Tells DESeq2 to pull these columns from the colData(dds) and include them in the output for grouping + returnData = TRUE: Instead of drawing the plot directly, this tells the function to return the underlying PCA coordinates (PC1, PC2) as a data frame.

percentVar <- round(100 * attr(pcaData, "percentVar"))
# returns a numeric vector that is used in axis labeling

#library(ggplot2)

ggplot(pcaData, aes(x = PC1, y = PC2, color = Genotype, shape = Treatment)) +
  geom_point(size = 4, alpha = 0.8) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  labs(title = "PCA of VST-normalized Counts",
       subtitle = "Color = Genotype, Shape = Treatment") +
  theme_grey(base_size = 14) +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_color_brewer(palette = "Set1")

display.brewer.all(type="qual")

display.brewer.all(type="div")

display.brewer.all(type="seq")

Independent Filtering and log-fold shrinkage

shrunk_LPS_compare <- lfcShrink(dds,
                                contrast = c("Group", "PWD_LPS", "B6_LPS"), 
                                type = "ashr")  # or "apeglm" if available

resultsNames(dds)

## [1] "Intercept"                       "Group_B6_Untreated_vs_B6_LPS"   
## [3] "Group_c11.2_LPS_vs_B6_LPS"       "Group_c11.2_Untreated_vs_B6_LPS"
## [5] "Group_PWD_LPS_vs_B6_LPS"         "Group_PWD_Untreated_vs_B6_LPS"

plotMA(shrunk_LPS_compare)

result_LPS_compare <- shrunk_LPS_compare[order(shrunk_LPS_compare$padj), ]

table(result_LPS_compare$padj<0.05)

## 
## FALSE  TRUE 
##  8253 11801

Merged Results

We will now output our results out of R to have a stand-alone file. First, we will merge the stats (contained within result object) with the normalized counts (contained within dds object). The final object created is called resdata and now since it is a data frame we can create output a tabular file with write.csv.

result_LPS_compare <- merge(as.data.frame(result_LPS_compare), 
                 as.data.frame(counts(dds, normalized=TRUE)), 
                 by="row.names", 
                 sort=FALSE)

names(result_LPS_compare)[1] <- "Gene_id"

str(result_LPS_compare)

## 'data.frame':    20055 obs. of  30 variables:
##  $ Gene_id          : 'AsIs' chr  "ENSMUSG00000026158.11" "ENSMUSG00000090272.9" "ENSMUSG00000054203.7" "ENSMUSG00000083563.1" ...
##  $ baseMean         : num  656 3535 2431 9917 3421 ...
##  $ log2FoldChange   : num  -3.93 -6.5 -5.94 9.51 1.85 ...
##  $ lfcSE            : num  0.1025 0.1331 0.1254 0.1512 0.0439 ...
##  $ pvalue           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ padj             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Unstim_B6_Rep1   : num  774.9 1845.4 232.8 26.6 1338.3 ...
##  $ Unstim_B6_Rep2   : num  677.4 1437.2 204.8 38.9 1314.9 ...
##  $ Unstim_B6_Rep3   : num  753.5 2004.2 255.4 38.9 1401 ...
##  $ Unstim_B6_Rep4   : num  920.4 2084.4 304.2 31.1 1238.2 ...
##  $ Unstim_c11.2_rep1: num  363.6 1151.8 106.3 33.4 1235.8 ...
##  $ Unstim_c11.2_rep2: num  428.7 1494.4 179.8 56.9 1439.5 ...
##  $ Unstim_c11.2_rep3: num  338 1170.3 96.7 31.6 1334.8 ...
##  $ Unstim_c11.2_rep4: num  360.7 1128.3 112.8 37.3 1396.6 ...
##  $ Unstim_PWD_rep1  : num  228.71 16.91 3.22 29593.14 3998.42 ...
##  $ Unstim_PWD_rep2  : num  257.67 13.47 3.59 27922.65 3793.23 ...
##  $ Unstim_PWD_rep3  : num  248.2 18.57 2.53 24473.99 3738.21 ...
##  $ Unstim_PWD_rep4  : num  266.1 19.2 4.8 24191.7 3983.6 ...
##  $ LPS_B6_rep1      : num  1857.7 11213.7 10906.2 31.5 2481.3 ...
##  $ LPS_B6_rep2      : num  1719 10948 11126 51 2340 ...
##  $ LPS_B6_rep3      : num  1890.4 11767.5 11143 46.7 2543.2 ...
##  $ LPS_B6_rep4      : num  2013.3 11543.7 11151 46.9 2386 ...
##  $ LPS_c11.2_Rep1   : num  571.1 6564.8 2777.7 70.6 2771.6 ...
##  $ LPS_c11.2_Rep2   : num  573.4 7085.5 3316.9 66.6 2700.9 ...
##  $ LPS_c11.2_Rep3   : num  533 6615 3000 48 2689 ...
##  $ LPS_c11.2_Rep4   : num  485.3 6238.4 2716.9 63.4 2857.5 ...
##  $ LPS_PWD_Rep1     : num  119.5 90.9 153.2 35293.4 8966.2 ...
##  $ LPS_PWD_Rep2     : num  133 164 207 30379 8644 ...
##  $ LPS_PWD_Rep3     : num  107 135 183 31117 8691 ...
##  $ LPS_PWD_Rep4     : num  123 97 165 34315 8812 ...

#Add a column to dataframe to specify up or down regulated
result_LPS_compare$diffexpressed <- "Unchanged"

#if log2FC > 1 and padj < 0.05 set as upregulated
result_LPS_compare$diffexpressed[result_LPS_compare$padj < 0.05 & result_LPS_compare$log2FoldChange > 1] <- "Upregulated"

#if log2FC > -1 and padj < 0.05 set as downregulated
result_LPS_compare$diffexpressed[result_LPS_compare$padj < 0.05 & result_LPS_compare$log2FoldChange < -1] <- "Downregulated"

str(result_LPS_compare)

## 'data.frame':    20055 obs. of  31 variables:
##  $ Gene_id          : 'AsIs' chr  "ENSMUSG00000026158.11" "ENSMUSG00000090272.9" "ENSMUSG00000054203.7" "ENSMUSG00000083563.1" ...
##  $ baseMean         : num  656 3535 2431 9917 3421 ...
##  $ log2FoldChange   : num  -3.93 -6.5 -5.94 9.51 1.85 ...
##  $ lfcSE            : num  0.1025 0.1331 0.1254 0.1512 0.0439 ...
##  $ pvalue           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ padj             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Unstim_B6_Rep1   : num  774.9 1845.4 232.8 26.6 1338.3 ...
##  $ Unstim_B6_Rep2   : num  677.4 1437.2 204.8 38.9 1314.9 ...
##  $ Unstim_B6_Rep3   : num  753.5 2004.2 255.4 38.9 1401 ...
##  $ Unstim_B6_Rep4   : num  920.4 2084.4 304.2 31.1 1238.2 ...
##  $ Unstim_c11.2_rep1: num  363.6 1151.8 106.3 33.4 1235.8 ...
##  $ Unstim_c11.2_rep2: num  428.7 1494.4 179.8 56.9 1439.5 ...
##  $ Unstim_c11.2_rep3: num  338 1170.3 96.7 31.6 1334.8 ...
##  $ Unstim_c11.2_rep4: num  360.7 1128.3 112.8 37.3 1396.6 ...
##  $ Unstim_PWD_rep1  : num  228.71 16.91 3.22 29593.14 3998.42 ...
##  $ Unstim_PWD_rep2  : num  257.67 13.47 3.59 27922.65 3793.23 ...
##  $ Unstim_PWD_rep3  : num  248.2 18.57 2.53 24473.99 3738.21 ...
##  $ Unstim_PWD_rep4  : num  266.1 19.2 4.8 24191.7 3983.6 ...
##  $ LPS_B6_rep1      : num  1857.7 11213.7 10906.2 31.5 2481.3 ...
##  $ LPS_B6_rep2      : num  1719 10948 11126 51 2340 ...
##  $ LPS_B6_rep3      : num  1890.4 11767.5 11143 46.7 2543.2 ...
##  $ LPS_B6_rep4      : num  2013.3 11543.7 11151 46.9 2386 ...
##  $ LPS_c11.2_Rep1   : num  571.1 6564.8 2777.7 70.6 2771.6 ...
##  $ LPS_c11.2_Rep2   : num  573.4 7085.5 3316.9 66.6 2700.9 ...
##  $ LPS_c11.2_Rep3   : num  533 6615 3000 48 2689 ...
##  $ LPS_c11.2_Rep4   : num  485.3 6238.4 2716.9 63.4 2857.5 ...
##  $ LPS_PWD_Rep1     : num  119.5 90.9 153.2 35293.4 8966.2 ...
##  $ LPS_PWD_Rep2     : num  133 164 207 30379 8644 ...
##  $ LPS_PWD_Rep3     : num  107 135 183 31117 8691 ...
##  $ LPS_PWD_Rep4     : num  123 97 165 34315 8812 ...
##  $ diffexpressed    : chr  "Downregulated" "Downregulated" "Downregulated" "Upregulated" ...

GeneID conversion

Our objective now is to convert from one gene id to another.

There are many Gene ID types, below are some of the most common:

Entrez Gene - 1457, 2002, 1950
Gene Symbol - Ikzf1, Tcf1, Cd19
Ensembl ID - ENSMUSG00000018654

# Remove the version from Ensembl gene IDs
result_LPS_compare$Ensembl_id <- sub("\\.\\d+$", "", result_LPS_compare$Gene_id)

# Extract the cleaned Ensembl IDs
ids <- result_LPS_compare$Ensembl_id

# Convert to gene symbols and Entrez IDs
gene_symbol_map <- transId(ids, 
                           transTo = c("symbol", "entrez"), 
                           org = "mouse",   # switch to mouse
                           unique = TRUE)   # ensures one-to-one mapping

result_LPS_compare <- merge(result_LPS_compare, 
                            gene_symbol_map, 
                            by.x = "Ensembl_id", 
                            by.y = "input_id")
str(result_LPS_compare)

## 'data.frame':    17835 obs. of  34 variables:
##  $ Ensembl_id       : 'AsIs' chr  "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000037" "ENSMUSG00000000056" ...
##  $ Gene_id          : 'AsIs' chr  "ENSMUSG00000000001.4" "ENSMUSG00000000028.15" "ENSMUSG00000000037.17" "ENSMUSG00000000056.7" ...
##  $ baseMean         : num  5932.73 145.14 8.27 234.74 1506.78 ...
##  $ log2FoldChange   : num  -0.0709 0.4631 2.2703 0.156 -1.2649 ...
##  $ lfcSE            : num  0.0426 0.1313 0.6674 0.1024 0.085 ...
##  $ pvalue           : num  9.35e-02 1.73e-04 3.17e-05 1.14e-01 6.48e-51 ...
##  $ padj             : num  1.41e-01 4.16e-04 8.29e-05 1.68e-01 9.20e-50 ...
##  $ Unstim_B6_Rep1   : num  5016.28 120.12 2.13 263.62 1119.34 ...
##  $ Unstim_B6_Rep2   : num  5044.72 109.35 2.78 263.17 1237.09 ...
##  $ Unstim_B6_Rep3   : num  5146.78 103.24 3.62 220.98 1315 ...
##  $ Unstim_B6_Rep4   : num  5307.09 112.56 3.99 234.7 1363.5 ...
##  $ Unstim_c11.2_rep1: num  4824.158 224.699 0.858 346.482 1594.331 ...
##  $ Unstim_c11.2_rep2: num  4932.14 231.62 3.05 330.16 1659.96 ...
##  $ Unstim_c11.2_rep3: num  4922.69 253.95 4.52 338.9 1887.92 ...
##  $ Unstim_c11.2_rep4: num  4920.9 237.21 5.33 357.14 1739.5 ...
##  $ Unstim_PWD_rep1  : num  4885.9 181.2 20.9 223.1 641 ...
##  $ Unstim_PWD_rep2  : num  4964 178.7 19.8 242.4 670.7 ...
##  $ Unstim_PWD_rep3  : num  5007.9 172.2 26.2 207.7 526.8 ...
##  $ Unstim_PWD_rep4  : num  5281.4 165.2 22.1 221.9 591.7 ...
##  $ LPS_B6_rep1      : num  7681.26 82.95 1.43 183.06 2092.29 ...
##  $ LPS_B6_rep2      : num  6798.14 89.26 2.55 163.23 2113.02 ...
##  $ LPS_B6_rep3      : num  7887.46 75.05 4.25 131.69 2315.26 ...
##  $ LPS_B6_rep4      : num  7722.19 72.38 1.34 151.47 2265.32 ...
##  $ LPS_c11.2_Rep1   : num  5981.53 142.48 1.22 259.38 2630.31 ...
##  $ LPS_c11.2_Rep2   : num  5983.17 168.64 3.55 240.53 2248.24 ...
##  $ LPS_c11.2_Rep3   : num  5713.96 128.65 4.36 250.76 2286.24 ...
##  $ LPS_c11.2_Rep4   : num  5733.52 184.43 8.07 293.94 2237.39 ...
##  $ LPS_PWD_Rep1     : num  6980.5 106.5 14.3 170.1 988.3 ...
##  $ LPS_PWD_Rep2     : num  6936.6 112.9 15.3 166.8 1008.5 ...
##  $ LPS_PWD_Rep3     : num  7636.31 112.25 9.71 173.77 814.9 ...
##  $ LPS_PWD_Rep4     : num  7076.9 117.9 17.1 198.8 816 ...
##  $ diffexpressed    : chr  "Unchanged" "Unchanged" "Upregulated" "Unchanged" ...
##  $ symbol           : chr  "Gnai3" "Cdc45" "Scml2" "Narf" ...
##  $ entrezid         : chr  "14679" "12544" "107815" "67608" ...

#write results
#treatment groups being compared + full matrix + date
#treatment groups being compared + _log2fc1_fdr05_DEGlist + date

write.csv(result_LPS_compare, 
          "PWD-LPSvsB6-LPS_full_matrix_040625.csv", 
          row.names = FALSE)

ggplot(result_LPS_compare) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), 
                       color=diffexpressed)) + 
  theme_classic() + 
        ggtitle("LPS treated (PWD vs B6)") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("black", "grey", "red"),
                     labels = c("Downregulated", "Not Significant", "Upregulated")) +
  scale_x_continuous(limits = c(-15, 15)) + 
  coord_cartesian(ylim=c(0, 300), clip = "off")

#install.packages("ggrepel")  # only if you haven't installed it
library(ggrepel)

genes_to_label <- c("Mucl1", "Apol7c", "Ptx3", "Cfh", "Ifi202b", "Arnt2")  # example gene symbols

result_LPS_compare$label <- ifelse(result_LPS_compare$symbol %in% genes_to_label,
                                   result_LPS_compare$symbol,
                                   NA)
#Adds a label column to your result table

ggplot(result_LPS_compare) +
  geom_point(aes(x = log2FoldChange, y = -log10(padj), color = diffexpressed, alpha = 0.8)) +
  geom_text_repel(aes(x = log2FoldChange, 
                      y = -log10(padj), 
                      label = label),
                  size = 3, 
                  max.overlaps = Inf) +
  theme_classic() + 
  ggtitle("LPS treated (PWD vs B6)") +
  xlab("Log2 fold change") + 
  ylab("-Log10(padj)") +
  theme(legend.position = "right",
        plot.title = element_text(size = rel(1.25), hjust = 0.5), 
        axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("black", "grey", "red"),
                     labels = c("Downregulated", "Not Significant", "Upregulated")) +
  scale_x_continuous(limits = c(-15, 15)) + 
  coord_cartesian(ylim = c(0, 300), clip = "off")

Why does a smiley face volcano plot happen?

We are plotting:

X-axis: log2 fold change (shrunk, using lfcShrink(type = “ashr”))
Y-axis: -log10(pvalue) or -log10(padj)

What causes the smile?

Very small p-values for small log2FCs
Many genes with very small raw p-values

Session info

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] fstcore_0.10.0              ggplotify_0.1.2            
##  [3] patchwork_1.3.0             pathview_1.46.0            
##  [5] msigdbr_10.0.1              cowplot_1.1.3              
##  [7] knitr_1.50                  enrichplot_1.26.6          
##  [9] ggnewscale_0.5.1            DOSE_4.0.1                 
## [11] clusterProfiler_4.14.6      EnhancedVolcano_1.24.0     
## [13] ggrepel_0.9.6               RColorBrewer_1.1-3         
## [15] pheatmap_1.0.12             genekitr_1.2.8             
## [17] ggplot2_3.5.1               tidyr_1.3.1                
## [19] dplyr_1.1.4                 DESeq2_1.46.0              
## [21] SummarizedExperiment_1.36.0 Biobase_2.66.0             
## [23] MatrixGenerics_1.18.1       matrixStats_1.5.0          
## [25] GenomicRanges_1.58.0        GenomeInfoDb_1.42.3        
## [27] IRanges_2.40.1              S4Vectors_0.44.0           
## [29] BiocGenerics_0.52.0        
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.3           later_1.4.1             bitops_1.0-9           
##   [4] urltools_1.7.3          tibble_3.2.1            R.oo_1.27.0            
##   [7] triebeard_0.4.1         polyclip_1.10-7         graph_1.84.1           
##  [10] XML_3.99-0.18           lifecycle_1.0.4         mixsqp_0.3-54          
##  [13] lattice_0.22-6          MASS_7.3-65             magrittr_2.0.3         
##  [16] openxlsx_4.2.8          sass_0.4.9              rmarkdown_2.29         
##  [19] jquerylib_0.1.4         yaml_2.3.10             remotes_2.5.0          
##  [22] httpuv_1.6.15           ggtangle_0.0.6          zip_2.3.2              
##  [25] sessioninfo_1.2.3       pkgbuild_1.4.7          ggvenn_0.1.10          
##  [28] DBI_1.2.3               abind_1.4-8             pkgload_1.4.0          
##  [31] zlibbioc_1.52.0         purrr_1.0.4             R.utils_2.13.0         
##  [34] RCurl_1.98-1.17         ggraph_2.2.1            yulab.utils_0.2.0      
##  [37] tweenr_2.0.3            GenomeInfoDbData_1.2.13 irlba_2.3.5.1          
##  [40] tidytree_0.4.6          codetools_0.2-20        DelayedArray_0.32.0    
##  [43] xml2_1.3.8              ggforce_0.4.2           tidyselect_1.2.1       
##  [46] aplot_0.2.5             UCSC.utils_1.2.0        farver_2.1.2           
##  [49] viridis_0.6.5           rmdformats_1.0.4        jsonlite_2.0.0         
##  [52] fst_0.9.8               ellipsis_0.3.2          tidygraph_1.3.1        
##  [55] tools_4.4.3             progress_1.2.3          treeio_1.30.0          
##  [58] Rcpp_1.0.14             glue_1.8.0              gridExtra_2.3          
##  [61] SparseArray_1.6.2       xfun_0.51               qvalue_2.38.0          
##  [64] usethis_3.1.0           withr_3.0.2             fastmap_1.2.0          
##  [67] truncnorm_1.0-9         digest_0.6.37           R6_2.6.1               
##  [70] mime_0.13               gridGraphics_0.5-1      colorspace_2.1-1       
##  [73] GO.db_3.20.0            RSQLite_2.3.9           R.methodsS3_1.8.2      
##  [76] generics_0.1.3          data.table_1.17.0       prettyunits_1.2.0      
##  [79] graphlayouts_1.2.2      httr_1.4.7              htmlwidgets_1.6.4      
##  [82] S4Arrays_1.6.0          pkgconfig_2.0.3         gtable_0.3.6           
##  [85] blob_1.2.4              XVector_0.46.0          htmltools_0.5.8.1      
##  [88] geneset_0.2.7           profvis_0.4.0           bookdown_0.42          
##  [91] fgsea_1.32.0            scales_1.3.0            png_0.1-8              
##  [94] ashr_2.2-63             ggfun_0.1.8             rstudioapi_0.17.1      
##  [97] reshape2_1.4.4          nlme_3.1-168            org.Hs.eg.db_3.20.0    
## [100] cachem_1.1.0            stringr_1.5.1           parallel_4.4.3         
## [103] miniUI_0.1.1.1          AnnotationDbi_1.68.0    pillar_1.10.1          
## [106] grid_4.4.3              vctrs_0.6.5             urlchecker_1.0.1       
## [109] promises_1.3.2          xtable_1.8-4            Rgraphviz_2.50.0       
## [112] evaluate_1.0.3          KEGGgraph_1.66.0        invgamma_1.1           
## [115] cli_3.6.4               locfit_1.5-9.12         compiler_4.4.3         
## [118] rlang_1.1.5             crayon_1.5.3            SQUAREM_2021.1         
## [121] labeling_0.4.3          plyr_1.8.9              fs_1.6.5               
## [124] stringi_1.8.7           viridisLite_0.4.2       BiocParallel_1.40.0    
## [127] assertthat_0.2.1        babelgene_22.9          munsell_0.5.1          
## [130] Biostrings_2.74.1       lazyeval_0.2.2          devtools_2.4.5         
## [133] pacman_0.5.1            GOSemSim_2.32.0         Matrix_1.7-3           
## [136] europepmc_0.4.3         hms_1.1.3               bit64_4.6.0-1          
## [139] KEGGREST_1.46.0         shiny_1.10.0            igraph_2.1.4           
## [142] memoise_2.0.1           bslib_0.9.0             ggtree_3.14.0          
## [145] fastmatch_1.1-6         bit_4.6.0               ape_5.8-1              
## [148] gson_0.1.0

Citation

title: “Analyzing RNA-Seq with DESeq2” author: “Michael I. Love, Simon Anders, and Wolfgang Huber”

Note: if you use DESeq2 in published research, please cite:

Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8

Sabikunnahar, B., Snyder, J. P., Rodriguez, P. D., Sessions, K. J., Amiel, E., Frietze, S. E., & Krementsov, D. N. (2025). Natural genetic variation in wild-derived mice controls host survival and transcriptional responses during endotoxic shock. ImmunoHorizons, 9(5), vlaf007. https://doi.org/10.1093/immhor/vlaf007

Analyzing RNA-Seq Part II