# CRAN
#install.packages(c("knitr", "RColorBrewer", "pheatmap", "remotes", "pacman", #"ggupset"))
# Bioconductor
#BiocManager::install(c("apeglm", "org.Mm.eg.db", "pathview", "GO.db"), ask = FALSE)
# GitHub
#remotes::install_github(c(
# "YuLab-SMU/clusterProfiler",
# "eliocamp/ggnewscale"
#))
# loading required packages
pacman::p_load(DESeq2, dplyr, tidyr, ggplot2, genekitr, pheatmap,
RColorBrewer, EnhancedVolcano, clusterProfiler, DOSE,
ggnewscale, enrichplot, knitr, cowplot, msigdbr, pathview,
patchwork, ggplotify)
Other packages you might not have installed but are required for this tutorial. Check to see if these are installed, if not, go ahead and install.
library(org.Mm.eg.db)
library(ggrepel)
library(ggplot2)
library(pathview)
library(GO.db) #takes a while to install
For today’s lesson we will be working with RNA-seq data from another recent publication in ImmunoHorizons by Sabikunnahar et al. (2025).
Summary: Innate immune cells rapidly respond to microbial threats, but if unchecked, these responses can harm the host, as seen in septic shock. To explore how genetic variation influences these responses, researchers compared standard lab mice (C57BL/6) with genetically diverse wild-derived mice (PWD).
Experimental Design: The authors isolated bone marrow derived dendritic cells from B6, c11.2, and PWD mice and then either left them unstimulated or stimulated with LPS.
This experiment has:
untreated, LPSB6, c11.2, and
PWDThe goal:
ENSM_partIII folder contentsContents of the folder:
PWD-LPSvsB6-LPS_normalized_matrix.csvPWD-LPSvsB6-LPS_DEG_list.csvPWD_annotation.csv file we will use later for the
heatmap sectiondds.RDS file that contains the results of the DESeq2
analysis performed on Monday.We have discovered that ggplot2 has incredible
functionality and versatility; however, it is not always the best choice
for all graphics. There are a plethora of different packages that
specialize in particular types of figures. Today we will concentrate on
creating hierarchical heatmaps using one of these
packages called pheatmap.
pheatmap is a package used for drawing pretty heatmaps
in R. This function has several advantages including the ability to
control heatmap dimensions and appearance.
#install.packages(pheatmap) #install pheatmap
library(pheatmap)
dplyr to heatmap generationdplyr basicsdpylrfilter() allows you to keep rows based on the values of
the columns.summarise() reduces multiple values down to a single
summary.arrange() changes the ordering of the rows.slice_head() and slice_tail() select the
first or last rows.mutate() adds new variables that are functions of
existing variablesselect() picks variables based on their names.Step 1 Load the DE matrix. containing list of DEGs
into an object called deg_matrix. After loading, make sure
that deg_matrix is being read as data frame.
deg_matrix <- read.csv("PWD-LPSvsB6-LPS_DEG_list.csv", header = T)
deg_matrix <- data.frame(deg_matrix)
Step 2 Selecting columns of interest.
heatmap_matrix <- dplyr::select(deg_matrix,
symbol,
Unstim_B6_Rep1:LPS_PWD_Rep4)
heatmap_matrix <- data.frame(heatmap_matrix, row.names = 1)
Step 3 Read in annotation file called
PWD_annotation.csv.
heatmap_anno <- read.csv("PWD_annotation.csv", header = T)
heatmap_anno
## sample_id genotype treatment
## 1 Unstim_B6_Rep1 B6 Unstimulated
## 2 Unstim_B6_Rep2 B6 Unstimulated
## 3 Unstim_B6_Rep3 B6 Unstimulated
## 4 Unstim_B6_Rep4 B6 Unstimulated
## 5 Unstim_c11.2_rep1 c11.2 Unstimulated
## 6 Unstim_c11.2_rep2 c11.2 Unstimulated
## 7 Unstim_c11.2_rep3 c11.2 Unstimulated
## 8 Unstim_c11.2_rep4 c11.2 Unstimulated
## 9 Unstim_PWD_rep1 PWD Unstimulated
## 10 Unstim_PWD_rep2 PWD Unstimulated
## 11 Unstim_PWD_rep3 PWD Unstimulated
## 12 Unstim_PWD_rep4 PWD Unstimulated
## 13 LPS_B6_rep1 B6 LPS
## 14 LPS_B6_rep2 B6 LPS
## 15 LPS_B6_rep3 B6 LPS
## 16 LPS_B6_rep4 B6 LPS
## 17 LPS_c11.2_Rep1 c11.2 LPS
## 18 LPS_c11.2_Rep2 c11.2 LPS
## 19 LPS_c11.2_Rep3 c11.2 LPS
## 20 LPS_c11.2_Rep4 c11.2 LPS
## 21 LPS_PWD_Rep1 PWD LPS
## 22 LPS_PWD_Rep2 PWD LPS
## 23 LPS_PWD_Rep3 PWD LPS
## 24 LPS_PWD_Rep4 PWD LPS
Step 4 Assign the sample names to be the row names.
The row names in the annotation file need to be read as sample names and also match the column names of our counts data. Then we will “select” the genotype and treatment column.
# Assign the sample names to be the row names
rownames(heatmap_anno) <- heatmap_anno$sample_id
heatmap_anno <- dplyr::select(heatmap_anno, genotype, treatment)
The object heatmap_anno will be the data frame for
the column annotations.
If you wanted to add more column annotations, this is the point during the workflow that you would do it.
Step 5 Assign factor levels
Factor levels will determine the color automatically assigned to the
group. We can manually specify the levels using the
factor() function.
# Re-level factors
heatmap_anno$genotype <- factor(heatmap_anno$genotype,
levels = c("B6", "c11.2", "PWD"))
heatmap_anno$treatment <- factor(heatmap_anno$treatment,
levels = c("Unstimulated", "LPS"))
Step 5 Provide customized colors
You can define customized colors to display the heatmap using the
brewer.pal() function from the RColorBrewer package. For
example:
# Colors for heatmap
heatmap_colors <- rev(brewer.pal(10, "RdBu"))
This creates a red-to-blue color gradient with 10 shades, and rev() reverses the order so that red represents higher values and blue represents lower values (or vice versa, depending on your data).
You can also specify custom colors for the annotations (e.g., for sample groups or conditions). These colors need to be provided as a named list, where each element matches the annotation values.
# Colors for denoting annotations
ann_colors <- list(genotype = c(B6 = "#7fc97f",
c11.2 = "#beaed4",
PWD = "#fdc086" ),
treatment = c(Unstimulated = "#a6cee3",
LPS = "#1f78b4"))
# Based on the colorbrewer2
# https://colorbrewer2.org/#type=qualitative&scheme=Paired&n=3
pheatmapBelow, are the major arguments required to plot with
pheatmap.
# BASIC USAGE
## DO NOT RUN
#pheatmap(numeric_matrix,
#color = ,
#annotation_col = ,
#annotation_colors = ,
#show_colnames = ,
#show_rownames = )
?pheatmap() #if you have questions
# We would like to remove column names and row names
# color = A custom color palette to represent heatmap values
# annotation_col = Adds metadata about the columns (samples), such as treatment group, time point, etc.
# annotation_color = Tells pheatmap how to color those annotations (e.g., red for "treated", blue for "control").
#show_colnames = shows/hides column names.
#show_rownames = shows/hides column names.
Using the information outlined above and in your global environment, fill in the (6) arguments required.
pheatmap(heatmap_matrix,
color = heatmap_colors,
annotation_col = heatmap_anno,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = F)
When you add scale = "row" in pheatmap(),
the heatmap displays Z-scores instead of the original normalized count
values.
Z-scores are calculated gene by gene by subtracting the mean and dividing by the standard deviation for each row.
This transformation is applied after clustering, so it doesn’t affect how rows or columns are grouped. Instead, it improves the visual contrast by standardizing each gene’s range of expression.
This is especially helpful when you want to highlight relative differences within each gene across samples, rather than comparing absolute expression values between genes.
pheatmap(heatmap_matrix,
color = heatmap_colors,
annotation_col = heatmap_anno,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = F,
scale = "row")
Major Take-away Some genes or features have naturally higher values than others—but that doesn’t mean they’re more important!
If you don’t scale, the clustering might group samples based on just the genes with the biggest raw numbers, rather than patterns of change.
Let’s add a few more arguments:
pheatmap(heatmap_matrix,
color = heatmap_colors,
annotation_col = heatmap_anno,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = F,
scale = "row",
annotation_names_col = F,
clustering_distance_cols = "euclidean",
clustering_method = "complete")
#cutree_rows = 4
Hierarchical clustering is a method for grouping similar “things” together based on how close they are to each other.
clustering_distance_cols = "euclidean"
This sets the distance metric for clustering the columns. It is asking, “How similar are these samples based on their overall expression patterns?”
Good default; works well with scaled data.
clustering_method = "complete"
This defines the method used to link clusters together in hierarchical clustering.
Tends to produce compact, tight clusters; default and reliable choice.
Next, we will convert this plot into a ggplot object and save it into
heatmap_final.
Go ahead and save heatmap_final as a PDF, width = 5,
height =6.
#pdf(heatmap_final, file = "heatmap_PWD-LPSvsB6-LPS.pdf", width = 5, height = 6)
heatmap_final <- as.ggplot(pheatmap(heatmap_matrix,
color = heatmap_colors,
annotation_col = heatmap_anno,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = F,
scale = "row",
annotation_names_col = F,
clustering_distance_cols = "euclidean",
clustering_method = "complete",
cutree_rows = 4))
#dev.off()
Scenario: Sam, another undergraduate in the Kremenstov lab, wants to visualize only the top 20 differentially expressed (DE) genes. Right now, the heatmap includes all 5,112 genes, which makes it hard to focus on specific genes of interest. Sam is especially interested in seeing whether Apobec3 is among the top 20.
Sam’s hypothesis: Apobec3 expression and function in dendritic cells differ between PWD and B6 mice, contributing to the differential immune responses observed in these strains.
Help Sam create a new heatmap that only displays the top 20 DE genes.
Hint: To create the heatmap for the top 20 genes,
first, you need to arrange the rows by the adjusted p-value (padj).
Then, select the top 20 rows of interest before creating the new
top20_heatmap_matrix.
top20_heatmap_matrix <- deg_matrix |>
arrange(padj) |>
slice_head(n = 20)
top20_heatmap_matrix <- dplyr::select(top20_heatmap_matrix,
symbol,
Unstim_B6_Rep1:LPS_PWD_Rep4)
top20_heatmap_matrix <- data.frame(top20_heatmap_matrix, row.names = 1)
pheatmap(top20_heatmap_matrix,
color = heatmap_colors,
annotation_col = heatmap_anno,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = T,
scale = "row",
annotation_names_col = F,
clustering_distance_cols = "euclidean",
clustering_method = "complete",
fontsize_row = 8)
The goal of functional analysis is to provide biological insight, so it’s important to interpret our results in the context of our experimental hypothesis: Genetically diverse, wild-derived mice (PWD) are expected to have an enhanced ability to suppress inflammation-related cell death compared to standard laboratory mice (B6). Therefore, we anticipate a down-regulation of processes or pathways related to cell death.
The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the function of genes. GO terms are organized into three independent controlled vocabularies (ontologies) in a species-independent manner:
Biological process: refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input.
Molecular function: represents the biochemical activity of the gene product, such activities could include “ligand”, “GTPase”, and “transporter”.
Cellular component: refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.
Each GO term has a term name (e.g. DNA repair) and a unique term accession number (GO:0005125), and a single gene product can be associated with many GO terms.
We will be using clusterProfiler to perform over-representation analysis on GO terms associated with our list of significant genes.
Over-representation (or enrichment) analysis is a statistical method that determines whether genes from pre-defined sets (ex: those beloging to a specific GO term or KEGG pathway) are present more than would be expected (over-represented) in a subset of your data. In this case, the subset is your set of under or over expressed genes.
Clusterprofiler will take as input a significant gene list and a background gene list and performs statistical enrichment analysis using hypergeometric testing. The basic arguments allow the user to select the appropriate organism and GO ontology (BP, CC, MF) to test.
PWDvB6_LPS_data <- read.csv("PWD-LPSvsB6-LPS_normalized_matrix.csv", header = T)
head(PWDvB6_LPS_data)
## Ensembl_id Gene_id baseMean log2FoldChange
## 1 ENSMUSG00000000001 ENSMUSG00000000001.4 5932.726707 -0.07093008
## 2 ENSMUSG00000000028 ENSMUSG00000000028.15 145.143927 0.46312330
## 3 ENSMUSG00000000037 ENSMUSG00000000037.17 8.268551 2.27033900
## 4 ENSMUSG00000000056 ENSMUSG00000000056.7 234.738253 0.15596467
## 5 ENSMUSG00000000058 ENSMUSG00000000058.6 1506.777887 -1.26487297
## 6 ENSMUSG00000000078 ENSMUSG00000000078.7 10025.556928 -0.92963543
## lfcSE pvalue padj Unstim_B6_Rep1 Unstim_B6_Rep2
## 1 0.04256565 9.350730e-02 1.411407e-01 5016.280801 5044.716516
## 2 0.13131617 1.732092e-04 4.163415e-04 120.118612 109.345435
## 3 0.66736121 3.171750e-05 8.285304e-05 2.125993 2.779969
## 4 0.10241008 1.139347e-01 1.675967e-01 263.623149 263.170369
## 5 0.08496567 6.480031e-51 9.196782e-50 1119.335385 1237.086067
## 6 0.04415007 4.385270e-99 1.532094e-97 8222.278448 8505.777536
## Unstim_B6_Rep3 Unstim_B6_Rep4 Unstim_c11.2_rep1 Unstim_c11.2_rep2
## 1 5146.778634 5307.094517 4824.157778 4932.135450
## 2 103.243492 112.560218 224.698549 231.622427
## 3 3.622579 3.991497 0.857628 3.047664
## 4 220.977298 234.700028 346.481732 330.163547
## 5 1314.996054 1363.495402 1594.330544 1659.960726
## 6 8847.242738 8900.240188 7385.892762 8012.307373
## Unstim_c11.2_rep3 Unstim_c11.2_rep4 Unstim_PWD_rep1 Unstim_PWD_rep2
## 1 4922.688289 4920.896484 4885.88407 4963.96770
## 2 253.951792 237.205156 181.19728 178.66333
## 3 4.518715 5.330453 20.93835 19.75173
## 4 338.903637 357.140348 223.07399 242.40754
## 5 1887.919191 1739.504480 641.03572 670.66086
## 6 7999.029565 7403.110742 6473.17227 6218.20225
## Unstim_PWD_rep3 Unstim_PWD_rep4 LPS_B6_rep1 LPS_B6_rep2 LPS_B6_rep3
## 1 5007.92324 5281.35992 7681.263762 6798.135779 7887.455511
## 2 172.22123 165.22261 82.947924 89.264585 75.051192
## 3 26.17087 22.09372 1.430137 2.550417 4.248181
## 4 207.67854 221.89781 183.057487 163.226670 131.693602
## 5 526.79435 591.72748 2092.289868 2113.020256 2315.258485
## 6 5956.82845 6003.72853 17274.620178 15964.333486 17801.293219
## LPS_B6_rep4 LPS_c11.2_Rep1 LPS_c11.2_Rep2 LPS_c11.2_Rep3 LPS_c11.2_Rep4
## 1 7722.190804 5981.528448 5983.173773 5713.96041 5733.518874
## 2 72.382972 142.475332 168.640115 128.64860 184.431649
## 3 1.340425 1.217738 3.550318 4.36097 8.068885
## 4 151.468072 259.378168 240.534059 250.75575 293.937940
## 5 2265.318948 2630.313813 2248.239010 2286.23831 2237.386436
## 6 16992.572959 11585.558155 12460.729358 11675.40584 11282.606099
## LPS_PWD_Rep1 LPS_PWD_Rep2 LPS_PWD_Rep3 LPS_PWD_Rep4 diffexpressed symbol
## 1 6980.49968 6936.64081 7636.313135 7076.87660 Unchanged Gnai3
## 2 106.49320 112.88533 112.251105 117.93209 Unchanged Cdc45
## 3 14.28567 15.33011 9.714038 17.11917 Upregulated Scml2
## 4 170.12939 166.77297 173.773345 198.77264 Unchanged Narf
## 5 988.30888 1008.53517 814.899847 816.01399 Downregulated Cav2
## 6 8983.09140 9050.33728 9125.798948 8489.20851 Unchanged Klf6
## entrezid
## 1 14679
## 2 12544
## 3 107815
## 4 67608
## 5 12390
## 6 23849
PWDvB6_LPS_background <- dplyr::select(PWDvB6_LPS_data,
Ensembl_id,
log2FoldChange,
padj,
symbol,
entrezid)
To perform the over-representation analysis, we need a list of
background genes and a list of significant genes. For our background
dataset we will use all genes and log2FC from the
PWDvB6_LPS_background results table.
For the significant gene list we will use only genes with p-adjusted values less than 0.05 and absolute log2FC of 1.
# we want the log2 fold change
original_gene_list <- PWDvB6_LPS_background$log2FoldChange
head(original_gene_list)
## [1] -0.07093008 0.46312330 2.27033900 0.15596467 -1.26487297 -0.92963543
# name the vector
names(original_gene_list) <- PWDvB6_LPS_background$Ensembl_id
head(original_gene_list)
## ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000037 ENSMUSG00000000056
## -0.07093008 0.46312330 2.27033900 0.15596467
## ENSMUSG00000000058 ENSMUSG00000000078
## -1.26487297 -0.92963543
#original_gene_list
# omit any NA values
background_gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
background_gene_list = sort(background_gene_list, decreasing = TRUE)
head(background_gene_list)
## ENSMUSG00000096449 ENSMUSG00000082062 ENSMUSG00000082179 ENSMUSG00000101431
## 11.118480 10.172383 9.956474 9.854910
## ENSMUSG00000091383 ENSMUSG00000006586
## 9.650647 9.615596
# Extract significant results (padj < 0.05) using subset()
sig_genes_df = subset(PWDvB6_LPS_background, padj < 0.05)
# From significant results, we want to filter on log2fold change
sig_genes <- sig_genes_df$log2FoldChange
# Name the vector
names(sig_genes) <- sig_genes_df$Ensembl_id
# omit NA values
sig_genes <- na.omit(sig_genes)
# filter on min log2fold change (log2FoldChange > 1)
sig_genes <- names(sig_genes)[abs(sig_genes) > 1]
head(sig_genes)
## [1] "ENSMUSG00000000037" "ENSMUSG00000000058" "ENSMUSG00000000127"
## [4] "ENSMUSG00000000142" "ENSMUSG00000000149" "ENSMUSG00000000157"
enrichGO()From the clusterProfiler package, enrichGO() performs
Gene Ontology (GO) enrichment analysis to identify overrepresented
biological terms among a list of genes.
There are ontology options: [“BP”, “MF”, “CC”]. The
ont argument can accept either “BP” (Biological Process),
“MF” (Molecular Function), and “CC” (Cellular Component) subontologies,
or “ALL” for all three.
There are keytypes options: This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Mm.eg.db, the options are:
keytypes(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
OrgDb argument can be found below here: https://bioconductor.org/packages/release/BiocViews.html#___Organism#go_enrich <- enrichGO(gene = sig genes,
#universe = background genes,
#OrgDb = organism Db,
#keyType = 'ENSEMBL',
#readable = T,
#ont = "BP",
#pvalueCutoff = 0.05,
#qvalueCutoff = 0.05)
Together, we will run this next part. It will take a few minutes to run.
What is it doing?
It tests whether GO Biological Process (BP) terms are significantly enriched in your significant genes (sig_genes) compared to a background of all tested genes (background_gene_list), using mouse ENSEMBL gene IDs.
#saveRDS(go_enrich, file = "enrichGO_PWD.rds")
#go_enrich <- readRDS("enrichGO_PWD.rds")
Output results:
go_enrich will be an object containing a table of
significantly enriched GO terms, along with stats like:
## Output results from GO analysis to a table
cluster_summary <- data.frame(go_enrich)
#write.csv(cluster_summary, "clusterProfiler_PWDvB6_LPS.csv")
head(cluster_summary)
## ID Description GeneRatio
## GO:0001906 GO:0001906 cell killing 121/3855
## GO:0042742 GO:0042742 defense response to bacterium 129/3855
## GO:0002250 GO:0002250 adaptive immune response 199/3855
## GO:0001910 GO:0001910 regulation of leukocyte mediated cytotoxicity 68/3855
## GO:0031341 GO:0031341 regulation of cell killing 73/3855
## GO:0006935 GO:0006935 chemotaxis 174/3855
## BgRatio RichFactor FoldEnrichment zScore pvalue
## GO:0001906 215/14054 0.5627907 2.051741 9.554181 3.189878e-19
## GO:0042742 241/14054 0.5352697 1.951409 9.159031 6.199270e-18
## GO:0002250 432/14054 0.4606481 1.679364 8.817430 3.467682e-17
## GO:0001910 105/14054 0.6476190 2.360996 8.605929 1.337880e-15
## GO:0031341 119/14054 0.6134454 2.236410 8.327229 8.493663e-15
## GO:0006935 385/14054 0.4519481 1.647647 7.921694 3.098536e-14
## p.adjust qvalue
## GO:0001906 1.831628e-15 1.479096e-15
## GO:0042742 1.779810e-14 1.437252e-14
## GO:0002250 6.637144e-14 5.359698e-14
## GO:0001910 1.920527e-12 1.550884e-12
## GO:0031341 9.754123e-12 7.876755e-12
## GO:0006935 2.541685e-11 2.052489e-11
## geneID
## GO:0001906 Il4/Lgals9/Il7r/Stat5a/Il12b/Nectin4/Fcgr1/H2-M3/H2-M2/Lyst/Arg1/Il13/Havcr2/Serpinb9b/Cxcl14/Clu/Emp2/Pim1/C3/Tap2/Pcyox1l/Dnase1l3/Il23a/Il18rap/Ccl20/Cfh/Cd55/Mr1/Fcnb/Rasgrp1/Il12a/Cd2/Cd1d1/Gbp3/Gbp2/Prdx1/Rnf19b/Masp2/Stap1/Cxcl5/Ppbp/Cxcl9/P2rx7/Lag3/Klrk1/Klrb1f/Clec2d/Klrd1/Klrb1c/Klrb1a/Itgam/Ccl22/Ccl17/Cadm1/Cd226/Cxcl10/Ccl2/H2-Q4/Tap1/Ninj1/Serpinb9f/Camp/C1s1/Il18/Fgl2/Gbp7/Gbp2b/Pvr/Rps19/Pik3r1/Trem3/Pglyrp4/Trem1/Gimap5/Klri2/Serpinb9/Sfn/Klre1/Lgals3/Hamp/Cx3cr1/Klrc2/Raet1e/C1ra/H2-T22/Hamp2/Gapdh/Serpinb9g/Rpl30/Fcgr4/Arrb2/H2-Q7/B2m/H2-K1/Serpinb9e/H2-T23/H2-Q10/Klri1/Apol9b/Lyz1/Irgm2/Serpinb9h/Tigit/H2-T26/H2-Bl/H2-Q6/H2-D1/Sh2d1b2/Ccl27a/Ceacam1/Ccl28/H60b/Igtp/Klrb1b/H2-T27/H2-Q1/Ulbp1/H2-Q2/C1rb/Gbp5/Pnp
## GO:0042742 Wfdc18/Lgals9/Oas1c/Acp5/Il7r/Ikbkg/Il12b/Aqp1/Il27ra/Fcgr1/H2-M3/Il10/Lyst/Tbk1/Havcr2/Nos2/Arg2/Naip1/Rnase10/Rnase6/Gsdmd/Adamts5/Slc15a2/Abcc1/Cd4/Tfeb/Trem2/Pim1/Tslp/Lta/Pcyox1l/Peli3/Casp7/Il23a/Casp1/Mr1/Optn/Prg2/Rbck1/Pld1/Tlr2/Gbp3/Gbp2/Tnfsf8/Unc13b/Emilin1/Gbp9/Plac8/Cxcl5/P2rx7/Npy/Klrk1/Adm/Pycard/Cd209d/Oas3/Oas2/Casp4/Gbp8/Adgrb1/Ssc5d/Isg15/Colec12/Scd1/Serpine1/Mavs/Camp/Rbpj/Il18/Znfx1/Gbp7/Gbp2b/Aicda/Rps19/Ripk2/Irf8/Trem3/Il17f/Pglyrp4/Trem1/Tnfrsf14/Pla2g6/Tlr9/Serpinb9/Irgm1/Rnf31/Rnase2a/Sfn/Gimap6/Ifnb1/Nlrp10/Hamp/Epx/Oas1a/Slamf8/Lgals4/Iigp1/Spink5/Hamp2/Krt6a/Rpl30/B2m/H2-K1/Ear6/Muc5b/Oas1g/Jchain/Lyz1/Wfdc17/Nlrp1a/Irgm2/Rnf213/Naip5/Ang/Ear2/Ear1/Wfdc3/Ighg2b/Ighg3/Ighm/Igtp/Naip6/Naip2/Gbp4/Pglyrp2/Igha/Gpr15lg/Gbp6/Gbp5
## GO:0002250 Loxl3/Il12rb1/Il4/Il17ra/Cd79a/Il7r/Stat3/Stat5a/Il12b/Ly9/Cd244a/Il27ra/Mef2c/Cd247/Tnfrsf13b/Slamf1/Gata3/Fcgr1/H2-M3/H2-M2/Pdcd1lg2/Cd40/Ada/Irf1/Alox15/Cd70/Lyst/Traf3ip2/Arg1/Vsir/Havcr2/Myo1g/Dbnl/Rsad2/Arg2/Irf4/Serpinb9b/Erap1/Emp2/Bcl6/Cd86/Parp3/Cd4/Clec4n/Vegfa/Tfeb/Trem2/C3/H2-Oa/Tap2/Lta/Cd74/Jak2/Cd7/Sectm1a/Il23a/Irf7/6030468B19Rik/Il6/Smad7/Il18rap/Il1rl1/Il18r1/Il1r1/Pdcd1/Cd55/Mr1/Prkcq/Zbtb46/Il12a/Sema4a/Cd1d1/Rnf19b/Masp2/Pf4/P2rx7/Ung/Lag3/Klrk1/Cd69/Klrd1/Siglecg/Cd19/Lat/Pycard/Hpx/Swap70/Tnfsf13b/Nedd4/Rnf125/Zbtb1/Cd226/Dpp4/H2-Q4/Fut7/C1qa/C1qc/C1qb/Zdhhc9/Tap1/Arid5a/H2-DMb2/H2-DMa/Ccr7/Camk4/Slamf7/C1s1/C1rl/Susd4/Il18/Prdm16/Bach2/Pvr/C3ar1/Cd79b/Aicda/Lat2/Ccr6/Ripk2/Mfsd6/Serpina3g/H2-Ob/Il17f/Trem1/Tnfrsf14/Fcmr/Klhl6/Gimap5/Il27/Serpinb9/Cd24a/Ifnb1/Clec4a1/Il23r/Ccr2/Nlrp10/Gpr183/Lax1/Btla/Cx3cr1/Klrc2/Raet1e/Tarm1/Cd8a/C1ra/H2-T22/Dennd1b/Pirb/Fcgr4/Clec4a4/H2-Q7/H2-Eb1/B2m/H2-K1/Stat4/Jchain/H2-T23/H2-Q10/Tnfrsf13c/Il2rb/Il18bp/Fcrlb/H2-T26/H2-Bl/H2-Q6/H2-D1/C4b/H2-Ab1/Sh2d1b2/Il4i1/Ceacam1/Cd80/H60b/Igkc/Ighg2b/Ighg3/Ighm/Iglv1/Iglc2/Slfn1/H2-T27/H2-Q1/H2-DMb1/Ulbp1/Tnfsf13/H2-Q2/Igkv1-110/Ighv1-53/Igkv4-59/Igkv1-117/Igha/Igkv14-111/Ighv9-3/C1rb/Ighd/Iglc3/Iglc1/Lilrb4a/Pnp
## GO:0001910 Lgals9/Il7r/Stat5a/Il12b/Nectin4/H2-M3/H2-M2/Arg1/Havcr2/Serpinb9b/Tap2/Dnase1l3/Il23a/Il18rap/Mr1/Rasgrp1/Il12a/Cd1d1/Stap1/Cxcl5/P2rx7/Lag3/Klrk1/Klrb1f/Clec2d/Klrd1/Klrb1c/Klrb1a/Itgam/Cadm1/Cd226/Ccl2/H2-Q4/Tap1/Serpinb9f/Pvr/Gimap5/Klri2/Serpinb9/Klre1/Cx3cr1/Klrc2/Raet1e/H2-T22/Serpinb9g/Arrb2/H2-Q7/B2m/H2-K1/Serpinb9e/H2-T23/H2-Q10/Klri1/Serpinb9h/Tigit/H2-T26/H2-Bl/H2-Q6/H2-D1/Sh2d1b2/Ceacam1/H60b/Klrb1b/H2-T27/H2-Q1/Ulbp1/H2-Q2/Pnp
## GO:0031341 Il4/Lgals9/Il7r/Stat5a/Il12b/Nectin4/H2-M3/H2-M2/Arg1/Il13/Havcr2/Serpinb9b/Clu/Tap2/Dnase1l3/Il23a/Il18rap/Cfh/Cd55/Mr1/Rasgrp1/Il12a/Cd1d1/Stap1/Cxcl5/P2rx7/Lag3/Klrk1/Klrb1f/Clec2d/Klrd1/Klrb1c/Klrb1a/Itgam/Cadm1/Cd226/Ccl2/H2-Q4/Tap1/Serpinb9f/Pvr/Gimap5/Klri2/Serpinb9/Klre1/Cx3cr1/Klrc2/Raet1e/H2-T22/Serpinb9g/Arrb2/H2-Q7/B2m/H2-K1/Serpinb9e/H2-T23/H2-Q10/Klri1/Serpinb9h/Tigit/H2-T26/H2-Bl/H2-Q6/H2-D1/Sh2d1b2/Ceacam1/H60b/Klrb1b/H2-T27/H2-Q1/Ulbp1/H2-Q2/Pnp
## GO:0006935 Fer/Itgb2l/Il4/S100a4/Lgals9/Sema6b/Il16/Tiam1/Il17ra/Pgf/Ccl24/Kit/Tnfsf14/Trpm2/Slamf1/Itga2/Mmp9/Rac3/Lsp1/Cxcl16/Lyst/Perp/Dock2/Itgb3/Lgmn/Nedd9/Edn1/Sema4d/Cxcl14/Thbs4/Fezf2/Tnfsf11/Kif21a/Cxadr/Robo1/Nr4a1/Abcc1/Pla2g7/Vegfa/Trem2/C3/Dusp1/Aif1/Hbegf/Lox/Slc12a2/Cd74/Pdgfrb/Vegfb/Sbds/Il23a/Adam8/Alox5/Ccr1/Nrp1/Sema4c/Ccl20/Cxcr2/Gpr35/Prkcq/Fcnb/Stk39/Mdk/Il12a/Rab13/Vcam1/Sema4a/F3/Ccn1/Epha7/Sema3c/Padi2/Bst1/Pdgfra/Stap1/Cxcl5/Ppbp/Pf4/Cxcl9/Ccr9/Plxna4/Prok2/Klrk1/Ptpro/Itgam/Swap70/Cttn/Nox1/Plxnb3/F7/Adgra2/Slit2/Fgfr1/Ednra/Gab1/Cx3cl1/Ccl22/Ccl17/Bcar1/Robo3/Dapk2/Smad3/Ryk/Mst1/Mycbp2/Arhgef5/Dysf/Sema3f/Cxcl10/Dpp4/Ccl5/Ccl12/Ccl7/Ccl2/Ccr3/Ripor2/Fgf1/Ccn3/Serpine1/Ccr7/Ninj1/Sema7a/Sema6c/Itga9/Cyp7b1/Prex1/Thbs1/C3ar1/Ccr6/Rps19/Nbl1/Trem3/Cmklr1/Trem1/Itga1/Pla2g6/Rtn4r/Ackr3/Tubb2b/Mtus1/Wnk1/Rnase2a/Cxcr5/Cxcr1/Jaml/Efna5/Ccr2/Lgr4/Dscam/Lgals3/Ch25h/Gpr183/Ffar2/Flrt3/Casr/Cx3cr1/Cysltr1/Slamf8/Arrb2/Xcr1/Ear6/Cnr2/Kdr/Nrg1/Ano6/Ear2/Ear1/Srp54a/Ccl27a/Rpl13a/Ccl28/Grem1/Fpr3/Gpr15lg
## Count
## GO:0001906 121
## GO:0042742 129
## GO:0002250 199
## GO:0001910 68
## GO:0031341 73
## GO:0006935 174
clusterProfiler has a variety of options for viewing the
over-represented GO terms.
The Bar plot is the most widely used method to visualize enriched terms. It depicts the enrichment scores (e.g. p values) and gene count or ratio as bar height and color. Users can specify the number of terms (most significant) or selected terms to display via the showCategory parameter.
barplot <- barplot(go_enrich,
showCategory = 10,
title = "GO Biological Pathways",
font.size = 8)
barplot
dot <- dotplot(go_enrich,
showCategory = 10,
font.size = 8)
dot
In order to consider if a gene may belong to multiple “pathway”
categories, we can use the cnetplot() function. The
cnetplot() depicts the linkages of genes and biological
concepts (e.g. GO terms or KEGG pathways) as a network.
cnetplot(go_enrich,
showCategory = 3,
foldChange=background_gene_list)
Select which labels to be displayed. Options include ‘category’, ‘gene’, ‘all’(the default) and ‘none’.
cnetplot(go_enrich,
showCategory = 3,
foldChange=background_gene_list,
node_label="category")
#library(ggplot2)
plot_color <- cnetplot(go_enrich,
showCategory = 3,
foldChange=background_gene_list,
node_label="all") +
scale_color_gradient2(name='fold change',
low='darkgreen',
high='firebrick')
plot_color
enrichKEGGFrom the clusterProfiler package, enrichKEGG() performs
KEGG pathway enrichment analysis — identifying biological pathways that
are overrepresented in a list of genes.
# Create vector of log2FoldChange values
kegg_gene_list <- PWDvB6_LPS_background$log2FoldChange
# Assign ENTREZ IDs as names
names(kegg_gene_list) <- PWDvB6_LPS_background$entrezid
# Remove NA values
kegg_gene_list <- na.omit(kegg_gene_list)
# Remove duplicates in ENTREZ IDs (keep the first occurrence)
kegg_gene_list <- kegg_gene_list[!duplicated(names(kegg_gene_list))]
# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)
head(kegg_gene_list)
## 100042862 14337 100502983 666036 667728 12395
## 11.118480 10.172383 9.956474 9.854910 9.650647 9.615596
We just created the background list, similar to what we did above. Now, we will move on to create the significant list.
# Extract significant results
kegg_sig_genes_df = subset(PWDvB6_LPS_background, padj < 0.05)
# From significant results, we want to filter on log2fold change
kegg_sig_genes <- kegg_sig_genes_df$log2FoldChange
# Name the vector with the ENTREZID
names(kegg_sig_genes) <- kegg_sig_genes_df$entrezid
# omit NA values
kegg_sig_genes <- na.omit(kegg_sig_genes)
# filter on log2fold change
kegg_sig_genes <- names(kegg_sig_genes)[abs(kegg_sig_genes) > 1]
head(kegg_sig_genes)
## [1] "107815" "12390" "14158" "12006" "14673" "16415"
BTW, the clusterProfiler package provides
search_kegg_organism() function to help searching supported
organisms.
sal_ent <- search_kegg_organism('Salmonella enterica',
by='scientific_name')
dim(sal_ent)
## [1] 50 3
#sal_ent
What’s the enrichKEGG function
doing?
It tests whether KEGG pathways are significantly enriched in your significant genes (kegg_sig_genes) compared to a background (kegg_gene_list), using NCBI gene IDs for mouse (“mmu”).
organism KEGG Organism Code: The full list is here: https://www.genome.jp/kegg/catalog/org_list.html (need the 3 letter code).
keyType one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’.
head(kk)
## category
## mmu04060 Environmental Information Processing
## mmu04514 Environmental Information Processing
## mmu05330 Human Diseases
## mmu04940 Human Diseases
## mmu05320 Human Diseases
## mmu04061 Environmental Information Processing
## subcategory ID
## mmu04060 Signaling molecules and interaction mmu04060
## mmu04514 Signaling molecules and interaction mmu04514
## mmu05330 Immune disease mmu05330
## mmu04940 Endocrine and metabolic disease mmu04940
## mmu05320 Immune disease mmu05320
## mmu04061 Signaling molecules and interaction mmu04061
## Description
## mmu04060 Cytokine-cytokine receptor interaction
## mmu04514 Cell adhesion molecule (CAM) interaction
## mmu05330 Allograft rejection
## mmu04940 Type I diabetes mellitus
## mmu05320 Autoimmune thyroid disease
## mmu04061 Viral protein interaction with cytokine and cytokine receptor
## GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
## mmu04060 102/1848 197/6493 0.5177665 1.819187 7.364297 2.274010e-12
## mmu04514 67/1848 122/6493 0.5491803 1.929561 6.537341 5.701111e-10
## mmu05330 30/1848 40/6493 0.7500000 2.635146 6.542641 1.271254e-09
## mmu04940 30/1848 43/6493 0.6976744 2.451299 6.022247 2.122136e-08
## mmu05320 28/1848 39/6493 0.7179487 2.522533 6.014949 2.328973e-08
## mmu04061 43/1848 72/6493 0.5972222 2.098357 5.910930 2.632294e-08
## p.adjust qvalue
## mmu04060 7.890816e-10 6.439040e-10
## mmu04514 9.891428e-08 8.071573e-08
## mmu05330 1.470417e-07 1.199886e-07
## mmu04940 1.522343e-06 1.242258e-06
## mmu05320 1.522343e-06 1.242258e-06
## mmu04061 1.522343e-06 1.242258e-06
## geneID
## mmu04060 11482/16161/16189/16170/16172/16197/16160/56221/50931/50930/57916/16153/16164/21939/12981/66102/21948/16163/57266/12166/21943/21933/18414/16180/16169/12504/53603/16994/16992/83430/16193/12768/16174/17082/16182/16177/16178/20297/12765/16184/269275/54450/54448/12156/16159/18049/21949/21941/21942/22163/20311/57349/56744/17329/12769/12154/14562/60504/16186/24099/16168/20312/20299/20295/21813/12163/15945/20304/20293/20306/20296/12771/16324/12775/16173/22035/12458/257630/21936/230979/225392/215257/12778/246779/12145/227288/15977/209590/12772/218624/13051/16880/16847/56066/23832/242700/72049/16185/20301/56838/69583/21944
## mmu04514 16415/16421/94332/14991/14990/58205/20971/21939/74048/319504/18613/13003/58187/17968/13052/12524/12504/15001/12511/20970/54167/18566/241226/12481/22329/12483/16409/16408/12555/54725/18007/19274/18190/225825/171171/102657/15015/15000/14998/16456/104099/17967/52118/15002/270152/216856/67374/12525/71908/15039/15018/14969/14972/15040/15007/100043314/667977/69717/14963/110557/14964/14961/12519/100529082/15006/14999/15013
## mmu05330 16189/16160/14991/14990/16153/21939/12524/15001/16159/15015/15000/14998/15002/15039/15018/14969/14972/15040/15007/667977/69717/14963/110557/14964/14961/12519/100529082/15006/14999/15013
## mmu04940 16160/14991/14990/12524/15001/16992/16159/15015/15000/14998/12876/15002/15039/15018/14969/14972/15893/15040/15007/667977/69717/14963/110557/14964/14961/12519/100529082/15006/14999/15013
## mmu05320 16189/14991/14990/16153/21939/12524/15001/15015/15000/14998/15002/15039/15018/14969/14972/15040/15007/667977/69717/14963/110557/14964/14961/12519/100529082/15006/14999/15013
## mmu04061 56221/50930/16153/57266/21933/16992/16193/12768/16174/16182/20297/12765/16184/20311/57349/56744/17329/12769/16186/20312/20299/20295/15945/20304/20293/20306/20296/12771/12775/16173/22035/12458/230979/12778/12145/227288/12772/13051/56066/23832/16185/20301/56838
## Count
## mmu04060 102
## mmu04514 67
## mmu05330 30
## mmu04940 30
## mmu05320 28
## mmu04061 43
#saveRDS(kk, file = "enrichKEGG_PWD.rds")
#kk <- readRDS("enrichKEGG_PWD.rds")
library("pathview")
mmu04217 <- pathview(gene.data = kegg_gene_list,
pathway.id = "mmu04217",
species = "mmu",
limit = list(gene=max(abs(kegg_gene_list)), cpd=1))
knitr::include_graphics("mmu04217.pathview.png")
cowplot::plot_grid(dot, plot_color, labels=LETTERS[1:2])
Before you go try knitting the document!!
sessionInfo()
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Tahoe 26.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GO.db_3.22.0 org.Mm.eg.db_3.22.0
## [3] AnnotationDbi_1.72.0 ggplotify_0.1.3
## [5] patchwork_1.3.2 pathview_1.50.0
## [7] msigdbr_25.1.1 cowplot_1.2.0
## [9] knitr_1.51 enrichplot_1.30.4
## [11] ggnewscale_0.5.2 DOSE_4.4.0
## [13] clusterProfiler_4.18.4 EnhancedVolcano_1.28.2
## [15] ggrepel_0.9.6 RColorBrewer_1.1-3
## [17] pheatmap_1.0.13 genekitr_1.2.8
## [19] ggplot2_4.0.1 tidyr_1.3.2
## [21] dplyr_1.1.4 DESeq2_1.50.2
## [23] SummarizedExperiment_1.40.0 Biobase_2.70.0
## [25] MatrixGenerics_1.22.0 matrixStats_1.5.0
## [27] GenomicRanges_1.62.1 Seqinfo_1.0.0
## [29] IRanges_2.44.0 S4Vectors_0.48.0
## [31] BiocGenerics_0.56.0 generics_0.1.4
##
## loaded via a namespace (and not attached):
## [1] splines_4.5.3 bitops_1.0-9 urltools_1.7.3.1
## [4] tibble_3.3.1 R.oo_1.27.1 triebeard_0.4.1
## [7] polyclip_1.10-7 graph_1.88.1 XML_3.99-0.20
## [10] lifecycle_1.0.5 lattice_0.22-9 MASS_7.3-65
## [13] magrittr_2.0.4 openxlsx_4.2.8.1 sass_0.4.10
## [16] rmarkdown_2.30 jquerylib_0.1.4 yaml_2.3.12
## [19] remotes_2.5.0 otel_0.2.0 ggtangle_0.1.1
## [22] zip_2.3.3 sessioninfo_1.2.3 pkgbuild_1.4.8
## [25] ggvenn_0.1.19 DBI_1.2.3 abind_1.4-8
## [28] pkgload_1.4.1 purrr_1.2.1 R.utils_2.13.0
## [31] RCurl_1.98-1.17 ggraph_2.2.2 yulab.utils_0.2.3
## [34] tweenr_2.0.3 rappdirs_0.3.4 gdtools_0.4.4
## [37] tidytree_0.4.7 codetools_0.2-20 DelayedArray_0.36.0
## [40] xml2_1.5.2 ggforce_0.5.0 tidyselect_1.2.1
## [43] aplot_0.2.9 farver_2.1.2 viridis_0.6.5
## [46] jsonlite_2.0.0 ellipsis_0.3.2 tidygraph_1.3.1
## [49] systemfonts_1.3.1 tools_4.5.3 progress_1.2.3
## [52] treeio_1.34.0 Rcpp_1.1.1 glue_1.8.0
## [55] gridExtra_2.3 SparseArray_1.10.8 xfun_0.56
## [58] qvalue_2.42.0 usethis_3.2.1 withr_3.0.2
## [61] fastmap_1.2.0 digest_0.6.39 R6_2.6.1
## [64] gridGraphics_0.5-1 RSQLite_2.4.5 R.methodsS3_1.8.2
## [67] fontLiberation_0.1.0 data.table_1.18.0 prettyunits_1.2.0
## [70] graphlayouts_1.2.2 httr_1.4.7 htmlwidgets_1.6.4
## [73] S4Arrays_1.10.1 scatterpie_0.2.6 pkgconfig_2.0.3
## [76] gtable_0.3.6 blob_1.3.0 S7_0.2.1
## [79] XVector_0.50.0 htmltools_0.5.9 fontBitstreamVera_0.1.1
## [82] geneset_0.2.7 fgsea_1.36.2 scales_1.4.0
## [85] png_0.1-8 ggfun_0.2.0 rstudioapi_0.18.0
## [88] reshape2_1.4.5 nlme_3.1-168 curl_7.0.0
## [91] org.Hs.eg.db_3.22.0 cachem_1.1.0 stringr_1.6.0
## [94] parallel_4.5.3 pillar_1.11.1 grid_4.5.3
## [97] vctrs_0.7.0 tidydr_0.0.6 cluster_2.1.8.2
## [100] Rgraphviz_2.54.0 KEGGgraph_1.70.0 evaluate_1.0.5
## [103] cli_3.6.5 locfit_1.5-9.12 compiler_4.5.3
## [106] rlang_1.1.7 crayon_1.5.3 labeling_0.4.3
## [109] plyr_1.8.9 fs_1.6.6 ggiraph_0.9.3
## [112] stringi_1.8.7 viridisLite_0.4.2 BiocParallel_1.44.0
## [115] assertthat_0.2.1 babelgene_22.9 Biostrings_2.78.0
## [118] lazyeval_0.2.2 devtools_2.4.6 pacman_0.5.1
## [121] GOSemSim_2.36.0 fontquiver_0.2.1 Matrix_1.7-4
## [124] europepmc_0.4.3 hms_1.1.4 bit64_4.6.0-1
## [127] KEGGREST_1.50.0 igraph_2.2.1 memoise_2.0.1
## [130] bslib_0.9.0 ggtree_4.0.4 fastmatch_1.1-8
## [133] bit_4.6.0 ape_5.8-1 gson_0.1.0
title: “Analyzing RNA-Seq with DESeq2” author: “Michael I. Love, Simon Anders, and Wolfgang Huber”
Note: if you use DESeq2 in published research, please cite:
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8
Sabikunnahar, B., Snyder, J. P., Rodriguez, P. D., Sessions, K. J., Amiel, E., Frietze, S. E., & Krementsov, D. N. (2025). Natural genetic variation in wild-derived mice controls host survival and transcriptional responses during endotoxic shock. ImmunoHorizons, 9(5), vlaf007. https://doi.org/10.1093/immhor/vlaf007