Analyzing RNA-Seq with DESeq2
Getting Started
Code chunk options
The knitr package provides a lot of options for customizing nearly
all components of code chunk, including the text outputs, source codes,
and plots. Under the text outputs, you can find echo
. When
echo = TRUE
this means to display the source code in the
output documents. When warning = FALSE
this means that all
warnings will be suppressed.
Install the required R libraries
Libraries loaded
Loading the packages required for subsequent analyses. In the future,
we will be using pacman::p_load
. Please look to have these
packages installed for use by next week.
Introduction to DESeq2 with HTSeq-counts files
As a reminder, we have previously assigned all the sequencing reads
to a given gene by directly counting reads overlapping with gene loci
using HTSeq-count. Now, the next step is to analyze the counts data for
the detection of differentially expressed genes (DEGs). The package
DESeq2
provides methods to test for differential expression
by use of negative binomial generalized linear models.
Overview
To use DESeq2
it requires the following:
counts matrix
table of sample information
design indicates how to model the samples
Why un-normalized counts?
As input, the DESeq2
package expects count data to be in
the form of a matrix of integer values. The values in the matrix should
be un-normalized counts. The DESeq2 model internally corrects for
library size, so transformed or normalized values should not be used as
input. To normalize for sequencing depth and RNA composition,
DESeq2
uses the median of ratios method. On the user-end
there is only one step, but on the back-end there are multiple steps
involved, including:
Step 1: creating a pseudo-reference sample (row-wise geometric mean)
Step 2: calculates ratio of each sample to the reference
- For every gene in a sample, the ratios (sample/ref) are calculated (as shown below). This is performed for each sample in the dataset. Since the majority of genes are not differentially expressed, the majority of genes in each sample should have similar ratios within the sample.
Step 3: calculate the normalization factor for each sample (size factor)
- The median value (column-wise for the above table) of all ratios for a given sample is taken as the normalization factor (size factor) for that sample, as calculated below.
Step 4: calculate the normalized count values using the normalization factor
The DESeqDataSet
The object used by the DESeq2
package to store
the (1) read counts, (2) table of sample information , and (3) design is
the DESeqDataSet, this is usually represented as
dds
.
dds = 1 + 2 + 3
dds = readcounts, table of sample, design
There are multiple ways of constructing a DESeqDataSet,
depending on what pipeline was used upstream of DESeq2 to generated
counts or estimated counts. For example, if you used Salmon to generate
read-counts, you would use DESeqDataSetFromTximport()
- From transcript abundance files and tximport
-
DESeqDataSetFromTximport()
- From a count matrix -
DESeqDataSetFromMatrix
- From htseq-count files -
DESeqDataSetFromHTSeqCount
We will be discussing how to input files output from
htseq-count
program.
About the Data
Our end product was the creation of counts files.So what does the count data actually represent?
The count data used for differential expression analysis represents the number of sequence reads that originated from a particular gene. The higher the number of counts, the more reads associated with that gene, and the assumption that there was a higher level of expression of that gene in the sample. Our overall goal is to identify a list of genes that are statistically distinct between the groups being compared.
For today’s lesson we will be working with RNA-seq data from a recent publication in Nature Communications by Shan et al. (2021).
T cell identity is established during thymic development, but how it is maintained in the periphery remains unknown.
The authors in this paper discover that by ablating Tcf1 and Lef1 transcription factors in mature CD8+ T cells this induces the expression of genes from non-T cell lineages. We will focus on analyzing the Tcf7 dataset. Specifically, Tcf7 fl/fl mice were crossed with hCD2-Cre mice to create mature CD8+ T cells that lacked Tcf7. The protein coded by Tcf7 is TCF1.
htseq-count input
Let’s take a closer look at the contents of the
ENSG_counts
folder:
img
folder contains imagescounts
folder contains outputs from HTSeq2 in the form of a.count
file- The metadata table called
GSE164713_tcf1_anno.csv
To input our metadata file:
## samplename filename condition replicate
## 1 WT_CD8_rep1 SRR13423162.ENSG.count WT_CD8 1
## 2 WT_CD8_rep2 SRR13423163.ENSG.count WT_CD8 2
## 3 WT_CD8_rep3 SRR13423164.ENSG.count WT_CD8 3
## 4 TCF1_KO_CD8_rep1 SRR13423165.ENSG.count TCF1_KO_CD8 1
## 5 TCF1_KO_CD8_rep2 SRR13423166.ENSG.count TCF1_KO_CD8 2
## 6 TCF1_KO_CD8_rep3 SRR13423167.ENSG.count TCF1_KO_CD8 3
The order here actually matters. Let’s check why by turning to the
DESeqDataSetFromHTSeqCount
manual:
## samplename filename condition replicate
## 1 WT_CD8_rep1 SRR13423162.ENSG.count WT_CD8 1
## 2 WT_CD8_rep2 SRR13423163.ENSG.count WT_CD8 2
## 3 WT_CD8_rep3 SRR13423164.ENSG.count WT_CD8 3
## 4 TCF1_KO_CD8_rep1 SRR13423165.ENSG.count TCF1_KO_CD8 1
## 5 TCF1_KO_CD8_rep2 SRR13423166.ENSG.count TCF1_KO_CD8 2
## 6 TCF1_KO_CD8_rep3 SRR13423167.ENSG.count TCF1_KO_CD8 3
## 'data.frame': 6 obs. of 4 variables:
## $ samplename: chr "WT_CD8_rep1" "WT_CD8_rep2" "WT_CD8_rep3" "TCF1_KO_CD8_rep1" ...
## $ filename : chr "SRR13423162.ENSG.count" "SRR13423163.ENSG.count" "SRR13423164.ENSG.count" "SRR13423165.ENSG.count" ...
## $ condition : chr "WT_CD8" "WT_CD8" "WT_CD8" "TCF1_KO_CD8" ...
## $ replicate : int 1 2 3 1 2 3
Next, we want to specify the condition column as a factor. By doing so, this will allow us to categorize the data and store it as a level.
Running DESeq2 requires DESeqDataSet
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = "counts/",
design = ~ condition)
dds
## class: DESeqDataSet
## dim: 57010 6
## metadata(1): version
## assays(1): counts
## rownames(57010): ENSMUSG00000000001.5 ENSMUSG00000000003.16 ...
## ENSMUSG00002076991.1 ENSMUSG00002076992.1
## rowData names(0):
## colnames(6): WT_CD8_rep1 WT_CD8_rep2 ... TCF1_KO_CD8_rep2
## TCF1_KO_CD8_rep3
## colData names(2): condition replicate
The DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should include a tilde (~) followed by the variables with plus signs between them if required.
Note: In order to benefit from the default settings of the package, one should put the variable of interest at the end of the formula and make sure the control level is the first level.
In other words, since we want to measure the effect of the condition between the two treatments this should go at the end and if we wanted to control for potential batch differences due to how these replicates were made we would add this as the first variable(~batch+condition).
Pre-filtering
While it is not necessary to pre-filter low count genes before
running the DESeq2 functions, there are two reasons which make
pre-filtering useful: by removing rows in which there are very few
reads, we reduce the memory size of the dds
data object,
and we increase the speed of the transformation and testing functions
within DESeq2. It can also improve visualizations, as features with no
information for differential expression are not plotted.
Below we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.
Note on factor levels
By default, R will choose a reference level for factors based on alphabetical order. Therefore, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels.
use relevel to specify the reference level:
Library size differences
With DESeq2
, size factors are calculated using the
estimateSizeFactors()
function. The size factors estimated
by this function combines an adjustment for differences in library sizes
with an adjustment for differences in the RNA composition of the
samples. The latter is important due to the compositional nature of
RNA-seq data. There is a fixed number of reads to distribute between the
genes, and if a single (or a few) very highly expressed gene consume a
large part of the reads, all other genes will consequently receive very
low counts.
dds <- estimateSizeFactors(dds)
# Plot the size factors against library size
# and look for any patterns by group:
ggplot(data.frame(libSize = colSums(assay(dds)),
sizeFactor = sizeFactors(dds),
Group = dds$condition),
aes(x = libSize, y = sizeFactor, col = condition)) +
geom_point(size = 5) + theme_bw() +
labs(x = "Library size", y = "Size factor")
Running Differential expression analysis
The standard differential expression analysis steps are wrapped into
a single function, DESeq()
. The estimation steps performed
by this function are described in the manual page for
?DESeq
.
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
: Dispersion estimates in DESeq2 quantify the biological and technical variability in RNA-seq count data.dds <- nbinomWaldTest(dds)
: fit a generalized linear model (GLM) and compute log2 fold changes corresponding to the variables of the design matrix.
Explore results for specific contrasts
Using the results()
function, the user can specify the
comparison to build a results table for with the contrast argument.
Details about the comparison are printed to the console, directly
above the results table. The text, condition KO vs WT
,
tells you that the estimates are of the logarithmic fold change
log2(KO/WT) i.e. log2(treated/untreated).
## log2 fold change (MLE): condition TCF1_KO_CD8 vs WT_CD8
## Wald test p-value: condition TCF1 KO CD8 vs WT CD8
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5 3904.95911 0.0746508 0.0569362 1.31113 0.18981359
## ENSMUSG00000000028.16 66.29971 0.3183793 0.2498710 1.27417 0.20260163
## ENSMUSG00000000037.18 4.93848 -0.2251786 0.9064074 -0.24843 0.80380188
## ENSMUSG00000000056.8 833.77631 -0.2260724 0.0819141 -2.75987 0.00578239
## ENSMUSG00000000078.8 25622.39656 0.3500977 0.1706392 2.05168 0.04020041
## ENSMUSG00000000085.17 377.59854 -0.1435168 0.1126148 -1.27440 0.20252037
## padj
## <numeric>
## ENSMUSG00000000001.5 0.3837944
## ENSMUSG00000000028.16 0.4010438
## ENSMUSG00000000037.18 0.9056941
## ENSMUSG00000000056.8 0.0241046
## ENSMUSG00000000078.8 0.1190318
## ENSMUSG00000000085.17 0.4010438
## [1] 19401 6
_baseMean
- Mean of the normalized counts for all samples in the comparison, for a given gene_log2FoldChange
- log2 fold change between Tcf1 KO and WT_pvalue
- Wald test P value_padj
- Benjamini-Hochberg adjusted Wald test P value (P-value after multiple test correction)
To explore the output of the results() function we can also use the
summary()
function.
##
## out of 19401 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2713, 14%
## LFC < 0 (down) : 2716, 14%
## outliers [1] : 6, 0.031%
## low counts [2] : 2633, 14%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Independent Filtering and log-fold shrinkage
A good check is to explore the relationship between log2fold changes,
significant DE genes and the genes mean count. DESeq2 provides a useful
function to do so, plotMA()
.
We can see that genes with a low mean count tend to have larger log fold changes. This is caused by counts from lowly expressed genes tending to be very noisy. We can shrink the log fold changes of these genes with low mean and high dispersion, as they contain little information.
##
## FALSE TRUE
## 12110 4652
Output Results
We will now output our results out of R to have a stand-alone file.
First, we will merge the stats (contained within result
object) with the normalized counts (contained within dds
object). The final object created is called resdata
and now
since it is a dataframe we can create output a tabular file with
write.csv
.
resdata <- merge(as.data.frame(result),
as.data.frame(counts(dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(resdata)[1] <- "Gene_id"
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 ENSMUSG00000015143.16 6770.207 -2.944392 0.05837777 0 0
## 2 ENSMUSG00000015437.6 1365.097 6.205747 0.15795883 0 0
## 3 ENSMUSG00000021756.13 5340.423 -1.929679 0.04968811 0 0
## 4 ENSMUSG00000025163.7 4445.134 2.450195 0.06249997 0 0
## 5 ENSMUSG00000035042.3 2816.472 6.873397 0.14335657 0 0
## 6 ENSMUSG00000039521.14 1506.650 6.551260 0.16941081 0 0
## WT_CD8_rep1 WT_CD8_rep2 WT_CD8_rep3 TCF1_KO_CD8_rep1 TCF1_KO_CD8_rep2
## 1 12433.34924 11759.60918 11762.92037 1518.717 1477.525
## 2 30.97347 42.89393 34.62547 2565.046 2960.111
## 3 8406.79850 8492.04492 8490.81374 2237.728 2187.952
## 4 1421.78206 1263.94113 1434.79279 7804.848 7529.308
## 5 48.95806 40.03433 54.10229 5114.678 5337.308
## 6 32.97176 32.40875 29.21524 2819.828 3377.057
## TCF1_KO_CD8_rep3
## 1 1669.121
## 2 2556.930
## 3 2227.202
## 4 7216.133
## 5 6303.749
## 6 2748.418
#write results
write.csv(resdata,
file="output/TCF1KOvsWT_unfiltered_matrix.csv",
row.names = FALSE)
The next code chunk makes use of the tidyverse
package.
We are using this package as a way of getting a list of differentially
expressed genes (DEGs) only. We are filtering the 19,000+ rows to only
rows which meet the criteria of having an absolute log2FC of 1 and padj
of less than 0.05.
In the code below, output the DE results.
Label the file TCF1KOvsWT_log2fc1_fdr05_DEGlist.csv
.
#Get DEG up/down lists
resdata <- as.data.frame(result) |>
dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) |>
tibble::rownames_to_column("Gene_id")
#write results
write.csv(resdata,
"output/TCF1KOvsWT_log2fc1_fdr05_DEGlist.csv",
row.names = FALSE)
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 ENSMUSG00000015143.16 6770.207 -2.944392 0.05837777 0 0
## 2 ENSMUSG00000015437.6 1365.097 6.205747 0.15795883 0 0
## 3 ENSMUSG00000021756.13 5340.423 -1.929679 0.04968811 0 0
## 4 ENSMUSG00000025163.7 4445.134 2.450195 0.06249997 0 0
## 5 ENSMUSG00000035042.3 2816.472 6.873397 0.14335657 0 0
## 6 ENSMUSG00000039521.14 1506.650 6.551260 0.16941081 0 0
Data transformation and visualization
Principal components analysis (PCA)
In order to test for differential analysis, we used raw counts. These matrices are so large that we need convenient ways to extract the important information from these large data matrices therefore it is useful to transform the data.
In using principal component analysis (PCA):
- The data is reduced to smaller matrices so they can more easily be examined, plotted and interpreted.
- The most important factors are extracted (principal components).
Below we are using an variance stabilizing transformation (vst) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010) or regularized logarithm (rlog) (Love, Huber, and Anders 2014). Both transformations will produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.
Using the transformed data we will generate a few outputs.
- PCA plot: The PCA plot will show the samples in a 2D plane spanned by the first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates (ex. age) and batch effects (ex. day sample was prepared).
It is also possible to customize the PCA plot using the ggplot function. Below we are adding shapes to represent the different days the samples were created.
pcaData <- plotPCA(vstcounts, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition)) +
geom_point(size=2) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
Now, we are making sure the x and y axis are similar using
ylim
argument
#pdf(file = "output/PCA_plot_TCF1_RNAseq.pdf", height = 6, width = 6)
pcaData <- plotPCA(vstcounts, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pca <- ggplot(pcaData, aes(PC1, PC2, color=condition, shape = condition)) +
geom_point(size=4, alpha=0.8) + # adding transparency
ylim(-10,10) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
ggtitle("PCA of TCF1 RNASeq data") +
scale_color_brewer(palette = "Set1") +
theme_classic(base_size = 10) +
theme(legend.position = "right", legend.title = element_blank(),
plot.title = element_text(hjust=0.5, face = "bold"))
#dev.off()
Now that we got it looking the way we want it, we can take some time to interpret the plot. Each data point represents one sample. The x-axis typically PC1, explains the largest source of variation in the dataset. We can see that PC1 is primarily explained by knocking out the gene Tcf7 (encodes Tcf1) in CD8 T cells. The next largest source of variance within the dataset, or PC2, is explained by one sample created on day1 in the WT group. Overall, the groups are clustering as expected.
Finally, we can generate a correlation heatmap. This is a type of plot used to visualize the strength of relationships between samples.
#pdf(file = "output/correlation_plot_TCF1_RNAseq.pdf", height = 3, width = 4)
vst_mat <- assay(vstcounts)
vst_cor <- cor(vst_mat)
annotation_heatmap <- data.frame(row.names=colnames(vst_mat), condition)
cor_heatmap <- as.ggplot(pheatmap(vst_cor,
annotation_col = annotation_heatmap,
main = "sample correlation",
show_rownames = F,
show_colnames = F,
fontsize = 6, # Decrease overall font size
fontsize_col = 6,
fontsize_row = 6,
width = 3,
height = 4)) # Decrease legend size
library(ggplot2)
library(pheatmap)
library(grid)
library(cowplot)
library(magick)
# Save heatmap to a PDF with controlled size
pdf("temp_heatmap.pdf", width = 4, height = 2)
pheatmap(vst_cor,
annotation_col = annotation_heatmap,
show_rownames = FALSE,
show_colnames = FALSE,
fontsize = 8,
fontsize_col = 6,
fontsize_row = 6,
legend_cex = 0.6)
dev.off()
# Convert PDF to image
heatmap_img <- image_read_pdf("temp_heatmap.pdf")
# Convert image to grob
heatmap_grob <- rasterGrob(as.raster(heatmap_img), interpolate = TRUE)
# Combine PCA plot and heatmap, ensuring heatmap is half the height
FigAB <- plot_grid(pca,
heatmap_grob,
labels = c("(A)", "(B)"),
ncol = 2,
rel_widths = c(1, 1),
rel_heights = c(1, 0.5), # Heatmap is half the height
align = "hv")
# Save final figure
ggsave("QC_figure.pdf", FigAB, height = 6, width = 8, dpi = 300, units = "in", scale = 1)
Session info
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggpubr_0.6.0 ggplotify_0.1.2
## [3] patchwork_1.3.0 lubridate_1.9.4
## [5] forcats_1.0.0 stringr_1.5.1
## [7] dplyr_1.1.4 purrr_1.0.4
## [9] readr_2.1.5 tidyr_1.3.1
## [11] tibble_3.2.1 tidyverse_2.0.0
## [13] ggplot2_3.5.1 pheatmap_1.0.12
## [15] RColorBrewer_1.1-3 DESeq2_1.46.0
## [17] SummarizedExperiment_1.36.0 Biobase_2.66.0
## [19] MatrixGenerics_1.18.1 matrixStats_1.5.0
## [21] GenomicRanges_1.58.0 GenomeInfoDb_1.42.3
## [23] IRanges_2.40.1 S4Vectors_0.44.0
## [25] BiocGenerics_0.52.0 knitr_1.50
##
## loaded via a namespace (and not attached):
## [1] rlang_1.1.5 magrittr_2.0.3 compiler_4.4.3
## [4] vctrs_0.6.5 pkgconfig_2.0.3 crayon_1.5.3
## [7] fastmap_1.2.0 backports_1.5.0 XVector_0.46.0
## [10] rmdformats_1.0.4 labeling_0.4.3 rmarkdown_2.29
## [13] tzdb_0.5.0 UCSC.utils_1.2.0 xfun_0.51
## [16] zlibbioc_1.52.0 cachem_1.1.0 jsonlite_2.0.0
## [19] DelayedArray_0.32.0 BiocParallel_1.40.0 broom_1.0.8
## [22] parallel_4.4.3 R6_2.6.1 bslib_0.9.0
## [25] stringi_1.8.7 car_3.1-3 jquerylib_0.1.4
## [28] numDeriv_2016.8-1.1 Rcpp_1.0.14 bookdown_0.42
## [31] Matrix_1.7-3 timechange_0.3.0 tidyselect_1.2.1
## [34] rstudioapi_0.17.1 abind_1.4-8 yaml_2.3.10
## [37] codetools_0.2-20 lattice_0.22-6 plyr_1.8.9
## [40] withr_3.0.2 coda_0.19-4.1 evaluate_1.0.3
## [43] gridGraphics_0.5-1 pillar_1.10.1 carData_3.0-5
## [46] generics_0.1.3 emdbook_1.3.13 hms_1.1.3
## [49] munsell_0.5.1 scales_1.3.0 glue_1.8.0
## [52] tools_4.4.3 apeglm_1.28.0 locfit_1.5-9.12
## [55] ggsignif_0.6.4 fs_1.6.5 mvtnorm_1.3-3
## [58] grid_4.4.3 bbmle_1.0.25.1 bdsmatrix_1.3-7
## [61] colorspace_2.1-1 GenomeInfoDbData_1.2.13 Formula_1.2-5
## [64] cli_3.6.4 S4Arrays_1.6.0 gtable_0.3.6
## [67] rstatix_0.7.2 yulab.utils_0.2.0 sass_0.4.9
## [70] digest_0.6.37 SparseArray_1.6.2 farver_2.1.2
## [73] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [76] MASS_7.3-65
Citation
title: “Analyzing RNA-Seq with DESeq2” author: “Michael I. Love, Simon Anders, and Wolfgang Huber”
Note: if you use DESeq2 in published research, please cite:
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8