Castoldi A, Monteiro LB, van Teijlingen Bakker N, et al. Triacylglycerol synthesis enhances macrophage inflammatory function. Nat Commun. 2020;11(1):4107. doi:10.1038/s41467-020-17881-3
Toll-like receptor (TLR) agonists activate naive/inactive macrophages (M0) to become Inflammatory macrophages(M1). M1s increase the synthesis of triacylglycerol (TG) which is needed for lipid droplet (LD) formation. This LD production is critical for the macrophage immune effector functions, as LD helps prevent from ER stress, toxic lipid build-up, and provide lipid donors for various immuno-cellular activities, including chemokine and cytokine productions. Macrophages with large LDs are typically observed in patients with atherosclerosis, tuberculosis, specific types of cancers, and obesity. However, detailed explanation for TG synthesis process and its functional significance in activated macrophages is unclear. Castoldi et al (2020) hypothesize that TG synthesis is essential for proper inflammatory activation in LPS+IFN-g stimulated macrophages. Therefore, the authors treated LPS+IFN-g stimulated-M1 with selective DGAT1 inhibitor (T863) to block the TG synthesis from diacylglycerol (DG). The authors found that T863 treatment caused attenuation of: 1) proinflammatory cytokines (Il-1b, IL-6), 2) overall intracellular TG contents, 3) Lipid droplet presence, 4) macrophage phagocytosis function, 5) sigaling molecule/lipid mediator molecule prostaglandin E2 (PGE2) formation.
PGE2 is a type of lipid mediators that can elicit either pro- or anti-inflammatory signaling, depending on the type of Prostaglandin receptors (EPR1-4). PGE2 is produced from arachidonic acid (AA) by cyclooxygenase1/2 (COX1/2) as the rate-limiting enzyme. Many studies found that inflammatory stimuli, such as LPS and IFNg, co-induce Ptgs-2 (codes for COX2) and Nos2 (codes for iNOS; inducible nitric oxide synthase) in innate immune cells such as Macrophages and Dendritic cells. In addition, Macrophages upregulate EPR2 and EPR4 which promote pro-inflammatory signaling cascade. I am very interested to know whether this DGAT1 inhibition-driven PGE2 drop affects Nos2 expression & downstream NO cellular effects.
Thus, my hypotheses for the RNAseq analysis are as follows: 1. IFNg+LPS activated M1 show significantly increased expression of Nos2 and other pro-inflammatory genes, comparing to inactive/naive M0. 2. T863 treatment in M1 alters the gene expression pattern in comparison to M1 without drug (based on the Authors’ claim that TG synthesis is critical for proper M1 inflmmatory activation).
For the analysis .Rmd script for hypothesis [1], please refer to a different .Rmd file located in /R_Project_Soyeon_Gullickson/M0_VS_M1.
In this script, I will be analyzing/comparing the data sets between unstimulated macrophages (M0) and stimulated macrophages (M1: including T863 treated group!) to test my hypothesis[2]: “T863 treatment in M1 alters the gene expression pattern in comparison to M1 without drug (based on the Authors’ claim that TG synthesis is critical for proper M1 inflmmatory activation)”.
A. DESeq2 analysis B. Volcano Plot C. Heatmap D. Enrichment Analysis Note: I am most interested in Section B and C output files in this Part II script. –> These output files will be used in my final project presentation!
# Install CRAN packages
#install.packages(c("knitr", "RColorBrewer", "pheatmap"))
The knitr package provides a lot of options for customizing nearly
all components of code chunk, including the text outputs, source codes,
and plots. Under the text outputs, you can find echo
. When
echo = TRUE
this means to display the source code in the
output documents. When warning = FALSE
this means that all
warnings will be suppressed.
library(knitr)
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
As a reminder, we have previously assigned all the sequencing reads to a given gene by directly counting reads overlapping with gene loci using HTSeq-count. Now, the next step is to analyze the counts data for the detection of differentially expressed genes (DEGs). The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models.
To use DESeq2 it requires the following: 1. counts matrix
2. table of sample information
3. design indicates how to model the samples
First, we will input our annotation file. This file was created in
Microsoft Excel, you could use nano
or vim
, or
create a data frame. But essentially this annotation file should look
similar to below:
sampleTable <- read.csv("SG_GSE145523_M0_VS_M1andM1T368_anno.csv", header = T)
head(sampleTable)
## sample filename condition replicate treatment
## 1 MO_1 SRR11111648.ENSG.count M0 1 none
## 2 MO_2 SRR11111649.ENSG.count M0 2 none
## 3 MO_3 SRR11111650.ENSG.count M0 3 none
## 4 M1_1 SRR11111651.ENSG.count M1 1 IFNg+LPS
## 5 M1_2 SRR11111652.ENSG.count M1 2 IFNg+LPS
## 6 M1_3 SRR11111653.ENSG.count M1 3 IFNg+LPS
tail(sampleTable)
## sample filename condition replicate treatment
## 4 M1_1 SRR11111651.ENSG.count M1 1 IFNg+LPS
## 5 M1_2 SRR11111652.ENSG.count M1 2 IFNg+LPS
## 6 M1_3 SRR11111653.ENSG.count M1 3 IFNg+LPS
## 7 M1_T368_1 SRR11111657.ENSG.count M1 1 IFNg+LPS+T863
## 8 M1_T368_2 SRR11111658.ENSG.count M1 2 IFNg+LPS+T863
## 9 M1_T368_3 SRR11111659.ENSG.count M1 3 IFNg+LPS+T863
#Convert to data frame
sampleTable <- data.frame(sampleTable)
head(sampleTable)
## sample filename condition replicate treatment
## 1 MO_1 SRR11111648.ENSG.count M0 1 none
## 2 MO_2 SRR11111649.ENSG.count M0 2 none
## 3 MO_3 SRR11111650.ENSG.count M0 3 none
## 4 M1_1 SRR11111651.ENSG.count M1 1 IFNg+LPS
## 5 M1_2 SRR11111652.ENSG.count M1 2 IFNg+LPS
## 6 M1_3 SRR11111653.ENSG.count M1 3 IFNg+LPS
str(sampleTable)
## 'data.frame': 9 obs. of 5 variables:
## $ sample : chr "MO_1" "MO_2" "MO_3" "M1_1" ...
## $ filename : chr "SRR11111648.ENSG.count" "SRR11111649.ENSG.count" "SRR11111650.ENSG.count" "SRR11111651.ENSG.count" ...
## $ condition: chr "M0" "M0" "M0" "M1" ...
## $ replicate: int 1 2 3 1 2 3 1 2 3
## $ treatment: chr "none" "none" "none" "IFNg+LPS" ...
Next, we want to specify the “condition” and “treatment” column as factors. By doing so, this will allow us to categorize the data and store it as levels.
#specify as factor
condition <- factor(sampleTable$condition)
replicate <- factor(sampleTable$replicate)
treatment <- factor(sampleTable$treatment)
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = ".",
design = ~replicate + condition)
dds
## class: DESeqDataSet
## dim: 57010 9
## metadata(1): version
## assays(1): counts
## rownames(57010): ENSMUSG00000000001.5 ENSMUSG00000000003.16 ...
## ENSMUSG00002076991.1 ENSMUSG00002076992.1
## rowData names(0):
## colnames(9): MO_1 MO_2 ... M1_T368_2 M1_T368_3
## colData names(3): condition replicate treatment
The DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them if required.
Note: The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the log2 fold changes of the model.
Note: In order to benefit from the default settings of the package, one should put the variable of interest at the end of the formula and make sure the control level is the first level.
In other words, since we want to measure the effect of the condition between the two treatments this should go at the end and if we wanted to control for potential batch differences due to how these replicates were made we would add this as the first variable(~batch+condition).
While it is not necessary to pre-filter low count genes before
running the DESeq2 functions, there are two reasons which make
pre-filtering useful: by removing rows in which there are very few
reads, we reduce the memory size of the dds
data object,
and we increase the speed of the transformation and testing functions
within DESeq2. It can also improve visualizations, as features with no
information for differential expression are not plotted.
Below we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
By default, R will choose a reference level for factors based on alphabetical order. Therefore, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels.
use relevel to specify the reference level:
dds$condition <- relevel(dds$condition, ref = "M0")
The standard differential expression analysis steps are wrapped into
a single function, DESeq. The estimation steps performed by
this function are described in the manual page for
?DESeq
.
Results table will be generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values.
Using the results function, the user can specify the
comparison to build a results table for, using the contrast
argument.
Details about the comparison are printed to the console, directly
above the results table. The text, condition KO vs WT
,
tells you that the estimates are of the logarithmic fold change
log2(KO/WT) i.e. log2(treated/untreated).
dds <- DESeq(dds)
res <- results(dds, contrast=c('condition', 'M1', 'M0'))
head(res)
## log2 fold change (MLE): condition M1 vs M0
## Wald test p-value: condition M1 vs M0
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5 1925.70090 0.254874 0.0680719 3.74418
## ENSMUSG00000000028.16 90.53512 -5.106565 0.7575109 -6.74124
## ENSMUSG00000000037.18 2.75172 -4.716234 1.3373231 -3.52662
## ENSMUSG00000000056.8 475.90870 0.606575 0.1086542 5.58262
## ENSMUSG00000000058.7 544.94952 -1.882684 0.2592969 -7.26073
## ENSMUSG00000000078.8 11894.37397 2.277699 0.0614485 37.06679
## pvalue padj
## <numeric> <numeric>
## ENSMUSG00000000001.5 1.80983e-04 5.15166e-04
## ENSMUSG00000000028.16 1.57037e-11 8.36804e-11
## ENSMUSG00000000037.18 4.20896e-04 1.13702e-03
## ENSMUSG00000000056.8 2.36920e-08 1.00226e-07
## ENSMUSG00000000058.7 3.85015e-13 2.27624e-12
## ENSMUSG00000000078.8 9.63341e-301 2.31215e-298
dim(res)
## [1] 17281 6
_baseMean
- Mean of the normalized counts for all
samples in the comparison, for a given gene_log2FoldChange
- log2 fold change between ‘untreated’
and ‘treated’ conditions_pvalue
- Wald test P value_padj
- Benjamini-Hochberg adjusted Wald test
P value (P-value after multiple test correction)We can now order our results table by the smallest padj value:
result <- res[order(res$padj), ]
head(result)
## log2 fold change (MLE): condition M1 vs M0
## Wald test p-value: condition M1 vs M0
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000386.19 15235.74 8.11971 0.1783980 45.5146 0
## ENSMUSG00000006818.6 7758.10 4.15912 0.0884191 47.0387 0
## ENSMUSG00000015837.16 21140.90 2.79545 0.0601563 46.4698 0
## ENSMUSG00000016496.8 24244.10 8.13432 0.1392793 58.4030 0
## ENSMUSG00000017652.17 5173.87 6.05825 0.1369732 44.2295 0
## ENSMUSG00000018899.18 15003.97 5.53825 0.0760465 72.8272 0
## padj
## <numeric>
## ENSMUSG00000000386.19 0
## ENSMUSG00000006818.6 0
## ENSMUSG00000015837.16 0
## ENSMUSG00000016496.8 0
## ENSMUSG00000017652.17 0
## ENSMUSG00000018899.18 0
table(result$padj<0.05)
##
## FALSE TRUE
## 8415 8866
Now, we can take this output and merge with the normalized counts data. This output will contain 19,000+ rows still we simply just ordered it by padj value. We will label this file with “unfiltered_matrix”
resdata <- merge(as.data.frame(result), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene_id"
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000000386.19 15235.742 8.119711 0.17839804 45.51457 0
## 2 ENSMUSG00000006818.6 7758.105 4.159120 0.08841911 47.03870 0
## 3 ENSMUSG00000015837.16 21140.896 2.795451 0.06015630 46.46980 0
## 4 ENSMUSG00000016496.8 24244.104 8.134324 0.13927930 58.40296 0
## 5 ENSMUSG00000017652.17 5173.867 6.058249 0.13697318 44.22945 0
## 6 ENSMUSG00000018899.18 15003.974 5.538247 0.07604645 72.82716 0
## padj MO_1 MO_2 MO_3 M1_1 M1_2 M1_3 M1_T368_1
## 1 0 87.45999 86.78768 71.80221 26227.415 27253.492 26498.443 17434.796
## 2 0 588.58211 695.03078 613.86945 12330.056 11839.744 11089.540 11696.676
## 3 0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4 0 121.34089 107.93762 160.17416 32976.318 36028.969 32225.833 35137.266
## 5 0 114.24954 121.79448 110.46494 9026.013 8623.723 8081.866 6895.416
## 6 0 494.03077 508.32786 434.75844 22535.267 22614.290 22597.254 21142.867
## M1_T368_2 M1_T368_3
## 1 19974.740 19486.738
## 2 10580.185 10389.257
## 3 30429.633 29109.123
## 4 43332.355 38106.745
## 5 7008.213 6583.063
## 6 22140.661 22568.308
#write results
write.csv(resdata, file="M0_VS_M1andM1T368_unfiltered_matrix_Final.csv")
The next code chunk makes use of the tidyverse
package.
+ dplyr
lets you manipulate tabular data. +
tidyr
provides additional functions that let you reshape
data to get it into a tidy format. + ggplot2
offers a
grammar of graphics approach to data visualization. +
stringr
provides fast and consistent functions for dealing
with string data. + broom
takes many different kinds of R
objects and turns them into tidy data frames. + magrittr
provide the pipe function (%>%) much used by other R packages and in
this book.
We are using this package as a way of getting a list of differentially expressed genes (DEGs) only. Now, we are filtering the 19,000+ rows to only rows which meet the criteria of having an absolute log2FC of 1 and padj of less than 0.05. We will label this file with “log2fc1_fdr05_DEGlist”
#Get DEG up/down lists
resdata <- as.data.frame(result) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) %>% tibble::rownames_to_column("Gene_id")
#write results
write.csv(resdata, "M0_VS_M1andM1T368_log2fc1_fdr05_DEGlist_Final.csv", row.names = TRUE)
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000000386.19 15235.742 8.119711 0.17839804 45.51457 0
## 2 ENSMUSG00000006818.6 7758.105 4.159120 0.08841911 47.03870 0
## 3 ENSMUSG00000015837.16 21140.896 2.795451 0.06015630 46.46980 0
## 4 ENSMUSG00000016496.8 24244.104 8.134324 0.13927930 58.40296 0
## 5 ENSMUSG00000017652.17 5173.867 6.058249 0.13697318 44.22945 0
## 6 ENSMUSG00000018899.18 15003.974 5.538247 0.07604645 72.82716 0
## padj
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
In order to test for differential analysis, we used raw counts. These matrices are so large that we need convenient ways to extract the important information from these large data matrices - it is useful to transform the data.
In using principal component analysis (PCA): + The data is reduced to smaller matrices so they can more easily be examined, plotted and interpreted. + The most important factors are extracted (principal components).
Below we are using either variance stabilizing transformation (vst) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010) or regularized logarithm (rlog) (Love, Huber, and Anders 2014). Both transformations will produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.
While using either function, one can specify the argument blind for whether the transformation should be blind to the sample information specified by the design formula. A blind dispersion estimate is not appropriate if one expects that many or the majority of genes (rows) will have large differences in counts which are explained by the experimental design. These differences due to experimental design will be interpreted as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula.
Now using the transformed data we will generate a PCA plot. The PCA plot will show the samples in a 2D plane spanned by the first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates (ex. age) and batch effects (ex. day sample was prepared).
vstcounts <- vst(dds, blind = FALSE)
plotPCA(vstcounts, intgroup=c("condition"))
It is also possible to customize the PCA plot using the ggplot function. Below we are adding shapes to represent the different days the samples were created.
Now, we are making sure the x and y axis are similar using
ylim
argument
pcaData <- plotPCA(vstcounts, intgroup=c("condition", "treatment"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, shape=condition, color=treatment)) +
geom_point(size=4) +
ylim(-10,15) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
ggsave("M0_VS_M1andM1T368_DESeq2_PCA_Final.png")
Now that we got it looking the way we want it, we can take some time to interpret the plot. Each data point represents one sample. The x-axis typically PC1, explains the largest source of variation in the dataset. We can see that PC1 is primarily explained by stimulating macrophages (M1) by IFNg+LPS. The next largest source of variance within the dataset, or PC2, is explained by one sample created by DGAT1 inhibition (T863 drug treatment) in the M1 group. I expected to see higher variance % in PC2.
Finally, we can generate a correlation heatmap. This is a type of plot used to visualize the strength of relationships between samples.
library(pheatmap)
rld <- rlog(dds, blind=FALSE)
rld_mat <- assay(rld)
rld_cor <- cor(rld_mat)
pheatmap(rld_cor)
Using some arguments, we can make the plot look pretty
pheatmap(rld_cor, clustering_distance_rows = "euclidean",
fontsize = 10, cellwidth = 35, show_rownames = F, color =
colorRampPalette(brewer.pal(10,"PiYG"))(400))
ggsave("M0_VS_M1amdT368_DESeq_heatmap_Final.png")
According to this correlation heatmap, M1(including M1+T863) and M0 are highly distinguishable, but M1 and M1+T863 appeared to be highly correlated.
Previously, I generated plots (PCA/correlation heatmap) to visualize the quality of the samples. I will continue to visualize the RNA-Seq dataset of Macrophages that are stimulated with LPS+IFNg (with and without DGAT1 inhibitor, T863). In the data set, I will use untreated group (M0) as my WT (reference) and all stimulated groups (M1) as testing group. The goal of this volcano plot is to visualize significantly differentially expressed genes between reference and testing groups.
A common plot used to represent a global view of the data is a volcano plot. Here, log transformed adjusted p-values are plotted on the y-axis while the log2 fold change is plotted on the x-axis.
We will be generating volcano plots using:
ggplots2
: an data visualization package that is
offered via the tidyverse
collection
EnhancedVolcano: a package created by Kevin Blighe, Sharmila Rana, and Myles Lewis to generate publication-ready volcano plots.
Remember to first check if you require a package for this tutorial
using library()
. Once the package is installed there is no
need to reinstall it over and over again.
#BiocManager::install('EnhancedVolcano') #if you need to install
#BiocManager::install("biomaRt") #if you need to install
library(EnhancedVolcano)
library(biomaRt)
library(dplyr)
library(tidyr)
library(ggplot2)
resdata <- read.csv("M0_VS_M1andM1T368_unfiltered_matrix_Final.csv", header = T)
resdata <- resdata[, c(2:17)]
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000000386.19 15235.742 8.119711 0.17839804 45.51457 0
## 2 ENSMUSG00000006818.6 7758.105 4.159120 0.08841911 47.03870 0
## 3 ENSMUSG00000015837.16 21140.896 2.795451 0.06015630 46.46980 0
## 4 ENSMUSG00000016496.8 24244.104 8.134324 0.13927930 58.40296 0
## 5 ENSMUSG00000017652.17 5173.867 6.058249 0.13697318 44.22945 0
## 6 ENSMUSG00000018899.18 15003.974 5.538247 0.07604645 72.82716 0
## padj MO_1 MO_2 MO_3 M1_1 M1_2 M1_3 M1_T368_1
## 1 0 87.45999 86.78768 71.80221 26227.415 27253.492 26498.443 17434.796
## 2 0 588.58211 695.03078 613.86945 12330.056 11839.744 11089.540 11696.676
## 3 0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4 0 121.34089 107.93762 160.17416 32976.318 36028.969 32225.833 35137.266
## 5 0 114.24954 121.79448 110.46494 9026.013 8623.723 8081.866 6895.416
## 6 0 494.03077 508.32786 434.75844 22535.267 22614.290 22597.254 21142.867
## M1_T368_2 M1_T368_3
## 1 19974.740 19486.738
## 2 10580.185 10389.257
## 3 30429.633 29109.123
## 4 43332.355 38106.745
## 5 7008.213 6583.063
## 6 22140.661 22568.308
Notice, that under the Gene_id column are Ensembl gene ids with versions included. This was due to the GTF file used to count reads. This is a tricky part with downloading reference files from public repositories as they may differ slightly depending on which repository was used.
Either way moving forward, we would like to create a volcano plot
with gene symbols not ENSEMBL gene ids, so this is a good opportunity
for us to use biomaRt
to perform some id conversion!
The biomaRt
program provides an interface to a growing
collection of databases implementing the BioMart software suite. This
package enables retrival of large amounts of data in a uniform way. The
most prominent examples of BioMart databases are maintained by Ensembl,
which provides users access to a diverse set of data and enables users
to perform gene annotation and data mining.
To begin, a first step in using the biomaRt
package is
to select a BioMart database and dataset to use. To know which BioMart
databases are available we will use listEnsembl()
. Please
note we have already loaded the biomaRt
package library
above!
listEnsembl()
## biomart version
## 1 genes Ensembl Genes 109
## 2 mouse_strains Mouse strains 109
## 3 snps Ensembl Variation 109
## 4 regulation Ensembl Regulation 109
The useEnsembl
function enables one to connect to a
specified BioMart database and dataset hosted by Ensembl without having
to specify the Ensembl URL. We will select genes
.
The mirror
specifies an Ensembl mirror to connect to.
The valid options include www
, uswest
,
useast
, and asia
.
ensembl <- useEnsembl(biomart = "genes", mirror="useast")
Next, we will be selecting the dataset we would like to use. We can view all the species datasets available by using:
#View(listDatasets(ensembl))
This dataset was created using Mouse bone marrow-derived macrophages (BMDMs). Therefore, we will be selecting Mus musculus.
ensembl <- useEnsembl(biomart = "genes", dataset = "mmusculus_gene_ensembl")
Overall our objective is to convert from one gene id to another.
There are many Gene ID types, below are some of the most common: + Entrez Gene - 1457, 2002, 1950 + Gene Symbol - Ikzf1, Tcf1, Cd19 + Ensembl ID - ENSMUSG00000018654
First, we will need to find the “type” of gene id of our query and target, respectively. To do this we can use Filters, which are used as inputs for a biomaRt query. Then we will need to use Attributes, which are the outputs or the information we want to retrieve based on our biomaRt target.
#View(listFilters(ensembl)) #ensembl_gene_id_version
#View(listAttributes(ensembl)) #external_gene_name
From this we selected that our gene id is in the form of
ensembl_gene_id_version
and we would like
external_gene_name
.
Now we will be using the getBM
function. Given a set of
filters, it will retrieve the user specified attribute fromt he BioMart
database.
biomart.output <- getBM(mart = ensembl, values = resdata$Gene_id,
filters = "ensembl_gene_id_version",
attributes = c("external_gene_name",
"ensembl_gene_id_version"))
head(biomart.output)
## external_gene_name ensembl_gene_id_version
## 1 Cdc45 ENSMUSG00000000028.16
## 2 Narf ENSMUSG00000000056.8
## 3 Cav2 ENSMUSG00000000058.7
## 4 Klf6 ENSMUSG00000000078.8
## 5 Cox5a ENSMUSG00000000088.8
## 6 Fer ENSMUSG00000000127.16
Let’s look at the structure of resdata:
str(resdata)
## 'data.frame': 17281 obs. of 16 variables:
## $ Gene_id : chr "ENSMUSG00000000386.19" "ENSMUSG00000006818.6" "ENSMUSG00000015837.16" "ENSMUSG00000016496.8" ...
## $ baseMean : num 15236 7758 21141 24244 5174 ...
## $ log2FoldChange: num 8.12 4.16 2.8 8.13 6.06 ...
## $ lfcSE : num 0.1784 0.0884 0.0602 0.1393 0.137 ...
## $ stat : num 45.5 47 46.5 58.4 44.2 ...
## $ pvalue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ padj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MO_1 : num 87.5 588.6 4258.8 121.3 114.2 ...
## $ MO_2 : num 86.8 695 4119.1 107.9 121.8 ...
## $ MO_3 : num 71.8 613.9 4398.9 160.2 110.5 ...
## $ M1_1 : num 26227 12330 30475 32976 9026 ...
## $ M1_2 : num 27253 11840 28894 36029 8624 ...
## $ M1_3 : num 26498 11090 28055 32226 8082 ...
## $ M1_T368_1 : num 17435 11697 30529 35137 6895 ...
## $ M1_T368_2 : num 19975 10580 30430 43332 7008 ...
## $ M1_T368_3 : num 19487 10389 29109 38107 6583 ...
We would like to add the gene names to this data.frame. To do this we
will use dplyr::left_join
.
resdata_gene <- dplyr::left_join(resdata, biomart.output,
by=c('Gene_id'='ensembl_gene_id_version'))
resdata_gene <- resdata_gene[, c(1:17)]
str(resdata_gene)
## 'data.frame': 17281 obs. of 17 variables:
## $ Gene_id : chr "ENSMUSG00000000386.19" "ENSMUSG00000006818.6" "ENSMUSG00000015837.16" "ENSMUSG00000016496.8" ...
## $ baseMean : num 15236 7758 21141 24244 5174 ...
## $ log2FoldChange : num 8.12 4.16 2.8 8.13 6.06 ...
## $ lfcSE : num 0.1784 0.0884 0.0602 0.1393 0.137 ...
## $ stat : num 45.5 47 46.5 58.4 44.2 ...
## $ pvalue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ padj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MO_1 : num 87.5 588.6 4258.8 121.3 114.2 ...
## $ MO_2 : num 86.8 695 4119.1 107.9 121.8 ...
## $ MO_3 : num 71.8 613.9 4398.9 160.2 110.5 ...
## $ M1_1 : num 26227 12330 30475 32976 9026 ...
## $ M1_2 : num 27253 11840 28894 36029 8624 ...
## $ M1_3 : num 26498 11090 28055 32226 8082 ...
## $ M1_T368_1 : num 17435 11697 30529 35137 6895 ...
## $ M1_T368_2 : num 19975 10580 30430 43332 7008 ...
## $ M1_T368_3 : num 19487 10389 29109 38107 6583 ...
## $ external_gene_name: chr "Mx1" "Sod2" "Sqstm1" "Cd274" ...
In its most basic usage, EnhancedVolcano
requires: +
data-frame (i.e. resdata_gene
), + a column of rownames
(i.e. lab =
), + a column name containing log2 fold changes
(i.e. x =log2FoldChange
) + a column name containing nominal
or adjusted pvalues (i.e. y = padj
)
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj')
Now, there are numerous arguments that can be used to customize the
volcano plot with EnhancedVolcano
. We will begin with the
most fundamental and most important of these options which includes
visualizing sections of the volcano plot based off p adjusted cutoff of
< 0.05 and absolute log2 of 1. For both, vertical and horizontal
lines will be drawn to show these cutoffs exactly where we want them.
Note, this was done in the plot above, however were you confident where
these lines were drawn?
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj',
title = "M1 (+/- T863) VS M0",
FCcutoff = 1,
pCutoff = 0.05)
Now, lets convert our colors to red and black like the article.
Naturally, EnhancedVolcano will color based on four quadrants. These
quadrants are shown (in order):
1) NS - shown in black 2) Log2FC - shown in green 3) p-value - shown in
blue 4) p-value & Log2FC - shown in red
In addition, let’s move the legend position to the right of the plot.
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj',
title = "Stimulated VS Unstimulated",
subtitle = "",
FCcutoff = 1,
pCutoff = 0.05,
labSize = 4,
axisLabSize = 12,
col=c('black', 'black', 'black', 'red3'),
legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
legendPosition = 'right',
legendLabSize = 12,
legendIconSize = 3.0)
There might be too many genes labeled here making the plot look cluttered. Let’s label a few select DE genes. First, lets get a complete of list of our DEGs using:
resdata_sig <- as.data.frame(resdata_gene) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(resdata_sig)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000000386.19 15235.742 8.119711 0.17839804 45.51457 0
## 2 ENSMUSG00000006818.6 7758.105 4.159120 0.08841911 47.03870 0
## 3 ENSMUSG00000015837.16 21140.896 2.795451 0.06015630 46.46980 0
## 4 ENSMUSG00000016496.8 24244.104 8.134324 0.13927930 58.40296 0
## 5 ENSMUSG00000017652.17 5173.867 6.058249 0.13697318 44.22945 0
## 6 ENSMUSG00000018899.18 15003.974 5.538247 0.07604645 72.82716 0
## padj MO_1 MO_2 MO_3 M1_1 M1_2 M1_3 M1_T368_1
## 1 0 87.45999 86.78768 71.80221 26227.415 27253.492 26498.443 17434.796
## 2 0 588.58211 695.03078 613.86945 12330.056 11839.744 11089.540 11696.676
## 3 0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4 0 121.34089 107.93762 160.17416 32976.318 36028.969 32225.833 35137.266
## 5 0 114.24954 121.79448 110.46494 9026.013 8623.723 8081.866 6895.416
## 6 0 494.03077 508.32786 434.75844 22535.267 22614.290 22597.254 21142.867
## M1_T368_2 M1_T368_3 external_gene_name
## 1 19974.740 19486.738 Mx1
## 2 10580.185 10389.257 Sod2
## 3 30429.633 29109.123 Sqstm1
## 4 43332.355 38106.745 Cd274
## 5 7008.213 6583.063 Cd40
## 6 22140.661 22568.308 Irf1
write.csv(resdata_sig, file="M1 and M1T863 VS M0_DEG_forHeatmap_Final.csv")
Activated M1s are known to highly produce pro-inflammatory molecules, including Prostaglandine E2 (PGE2), TNF, IL-1b, IL-6, Nitric Oxide (N0). In order to visualize differential expressions of genes that are associated with pro-inflammation, I am selecting ‘Ptgs2’,‘Ptgs’,‘Ptges’ (involved in PGE2 synthesis),‘Acod1’ (production of proinflammatory metabolite called ‘itaconate’),‘Nfkb’ (transcription factor - as a postive control),‘Gbp2’,‘Gbp3’ (used as M1 marker genes - as a positive control),‘Dgat1’,‘Dgat2’ (TG synthesis proteins, Author’s genes of interest),‘Il6’,‘Il1b’,‘Tnf’ (pro-inflammatory cytokines),‘Sod2’ (a build up of nitric oxide is known to induce ROS-response genes such as Sod2 which codes for superoxide dismutase),‘Cpt1a’ (involved in fatty acid oxidation),‘Ccr7’ (involved in Macrophage chemoxasis), and’Fasn’ (fatty acid synthesis).
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj',
selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
title = "M1 (+/- T863) VS M0",
subtitle = "",
FCcutoff = 1,
pCutoff = 0.05,
labSize = 4,
axisLabSize = 12,
col=c('black', 'black', 'black', 'red3'),
legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
legendPosition = 'right',
legendLabSize = 16,
legendIconSize = 5.0,
drawConnectors = TRUE,
widthConnectors = 1.0,
colConnectors = 'black')
It looks like a few genes are hard to read. We can fix this by adding a boxed label. This will make the gene name pop out and will be given a white background. I am also choosing to bold the gene name. In addition, lets get rid of the major and minor grid lines as they are taking away from the image aesthetics. 3
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj',
selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
title = "M1 (+/- T863) VS M0",
subtitle = "",
FCcutoff = 1,
pCutoff = 0.05,
labSize = 6,
axisLabSize = 15,
col=c('black', 'black', 'black', 'indianred2'),
labCol = 'black',
labFace = 'plain',
boxedLabels = TRUE,
legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
legendPosition = 'right',
legendLabSize = 14,
legendIconSize = 5.0,
drawConnectors = TRUE,
widthConnectors = 0.5,
colConnectors = 'black',
gridlines.major = FALSE,
gridlines.minor = FALSE)
ggsave("M0 VS M1_enhanced Volcano_run2.png")
If you wanted to play around with colors (up versus down regulated) below is a script to help you get started.
keyvals <- ifelse(
resdata_gene$log2FoldChange < -2, 'purple2',
ifelse(resdata_gene$log2FoldChange > 2, 'seagreen3',
'black'))
keyvals[is.na(keyvals)] <- 'black'
names(keyvals)[keyvals == 'purple2'] <- 'Downregulated'
names(keyvals)[keyvals == 'black'] <- 'NA'
names(keyvals)[keyvals == 'seagreen3'] <- 'Upregulated'
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj',
selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
boxedLabels = TRUE,
drawConnectors = TRUE,
widthConnectors = 1.0,
colConnectors = 'black',
xlab = bquote(~Log[1]~ 'fold change'),
colCustom = keyvals,
title = "M1 (+/- T863) VS M0",
subtitle = "",
FCcutoff = 2,
pCutoff = 0.05,
labSize = 5.5,
labFace = 'plain',
axisLabSize = 13,
legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
legendPosition = 'right',
legendLabSize = 14,
legendIconSize = 4.5,
gridlines.major = FALSE,
gridlines.minor = FALSE)
ggsave("M0 VS M1_enhance volcano_run2.png")
For my visualization preference, I would like for my enhanced volcano plot to be sitting sideways. I would like the unregulated genes to be first top half of the screen. So, I will add ‘+coord_flip()’ to the previous r chunk.
keyvals <- ifelse(
resdata_gene$log2FoldChange < -2, 'purple2',
ifelse(resdata_gene$log2FoldChange > 2, 'seagreen3',
'black'))
keyvals[is.na(keyvals)] <- 'black'
names(keyvals)[keyvals == 'purple2'] <- 'Downregulated'
names(keyvals)[keyvals == 'black'] <- 'NA'
names(keyvals)[keyvals == 'seagreen3'] <- 'Upregulated'
EnhancedVolcano(resdata_gene,
lab = as.character(resdata_gene$external_gene_name),
x = 'log2FoldChange',
y = 'padj',
selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
boxedLabels = TRUE,
drawConnectors = TRUE,
widthConnectors = 0.8,
colConnectors = 'red2',
parseLabels = TRUE,
xlab = bquote(~Log[1]~ 'fold change'),
colCustom = keyvals,
title = "M1 (+/- T863) VS M0",
subtitle = "",
FCcutoff = 2,
pCutoff = 0.05,
labSize = 5.5,
labFace = 'plain',
axisLabSize = 13,
legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
legendPosition = 'right',
legendLabSize = 14,
legendIconSize = 4.5,
gridlines.major = FALSE,
gridlines.minor = FALSE) + coord_flip()
ggsave("M0 VS M1andM1T863_Volcano_Side.png")
This enhanced volcano plot shown above will be used for my presentation. Regardless of T863 treatment, M1 appears to upregulate the genes that are associated with inflammation. Interestingly, Dgat1 and 2 expression levels were not too different in M1 in comparison to M0. This low level Dgat1 expression difference is different than what the authors found in their data analysis. This data analysis discrepancy may due to sequencing library analyzing method - Authors’ used SnakePipe pipeline program for the analysis. ***
We have discovered that ggplot2
has incredible
functionality and versatility; however, it is not always the best choice
for all graphics. There are a plethora of different packages that
specialize in particular types of figures. In this section, we will
concentrate on creating hierarchical heatmaps using one of these
packages called pheatmap
.
pheatmap
is a package used for drawing pretty heatmaps
in R. This function has several advantages including the ability to
control heatmap dimensions and appearance.
#install.packages(pheatmap) #install pheatmap
library(pheatmap)
library(RColorBrewer)
library(ggplot2)
library(cowplot)
We will continue using our counts matrix generated identifying differentially expressed genes between M1 (with and without T863) versus M0.
Load the ..DEG.csv
file and the ..anno.csv
and place them into the object called heatmap_normDEG
and
heatmap_meta
, respectively.
heatmap_normDEG <- read.csv("M0_VS_M1andM1T863_DEG_Trimmed_final.csv", header = T, row.names = 1)
#heatmap_normDEG
heatmap_meta <- read.csv("SG_GSE145523_M0_VS_M1andM1T368_anno_simple_heatmap.csv", header = T)
heatmap_meta
## sample condition treatment
## 1 MO_1 M0 none
## 2 MO_2 M0 none
## 3 MO_3 M0 none
## 4 M1_1 M1 IFNg+LPS
## 5 M1_2 M1 IFNg+LPS
## 6 M1_3 M1 IFNg+LPS
## 7 M1_T368_1 M1 IFNg+LPS+T863
## 8 M1_T368_2 M1 IFNg+LPS+T863
## 9 M1_T368_3 M1 IFNg+LPS+T863
pheatmap
There are few steps that need to be performed prior to using the
pheatmap()
function.
# Assign the sample names to be the row names
rownames(heatmap_meta) <- heatmap_meta$sample
factor()
function.# Re-level factors
heatmap_meta$condition <- factor(heatmap_meta$condition, levels = c("M1", "M0"))
colorRampPalette()
function.# Colors for heatmap
heatmap_colors <- colorRampPalette(c("blue", "white", "red"))(6)
# Colors for denoting annotations
ann_colors <- list(condition = c(M1 = "#D11313", M0 = "#6DB0F8"))
Now we are ready to draw the heatmap using the
pheatmap()
function. Go ahead a fill it out!
# BASIC USAGE
## DO NOT RUN
#pheatmap(numeric_matrix,
#color = ,
#annotation_col = ,
#annotation_colors = ,
#show_colnames = ,
#show_rownames = )
?pheatmap() #if you have questions
# Plot heatmap
pheatmap(heatmap_normDEG,
color = heatmap_colors,
annotation_col = heatmap_meta,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = F)
***
pheatmap(heatmap_normDEG,
color = heatmap_colors,
annotation_col = heatmap_meta,
annotation_colors = ann_colors,
show_colnames = F,
show_rownames = F,
cellwidth = 20,
cellheight = 0.1,
scale = "row")
NOTE: Upon addition of
scale="row"
Z-scores are plotted, rather than the actual normalized count value. Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed after the clustering, so that it only affects the graphical aesthetics and the color visualization is improved.
Now we are only missing the text annotations that can be added using
the cowplot
package. However, the pheatmap
image is stored as a list rather than a graphics object. The
ggplotify
package allows us to manipulate this list object
into a ggplot object with the as.ggplot()
function,
facilitating further modifications with ggplot2
or
cowplot
.
# Load library
#install.packages("ggplotify")
library(ggplotify)
# Turn into a ggplot object
heatmap <- as.ggplot(pheatmap(heatmap_normDEG,
color = heatmap_colors,
cluster_rows = T,
annotation_col = heatmap_meta,
annotation_colors = ann_colors,
border_color="white",
cluster_cols = T,
show_colnames = F,
show_rownames = F,
scale="row",
treeheight_col = 10,
treeheight_row = 25,
cellwidth = 25,
cellheight = 0.1,
fontsize = 9,
annotation_names_col = F))
To add the annotations to the left sides of the plot, we will use the
theme()
function from ggplot2 to add whitespace. With the
plot.margin argument we can specify the amount of space to add to the
top, right, bottom and left margins (in that order).
# Adding white space to add annotations to the top, right, bottom, and left margins
heatmap <- heatmap +
theme(plot.margin = unit(c(1,0.5,0.5,0.5), "cm"))
Now we can use cowplot to add in our desired annotations to match the published figure below.
# Adding in annotations with cowplot
heatmap_figure <- ggdraw(heatmap) +
draw_label("Macrophages",
x = 0.31,
y = 0.89,
size = 17,
hjust = 0,
vjust = 0,
fontface = "bold") +
draw_label("M0",
x = 0.25,
y = 0.85,
size = 13,
hjust = 0,
vjust = 0,
fontface = "bold") +
draw_label("M1",
x = 0.38,
y = 0.85,
size = 13,
hjust = 0,
vjust = 0,
fontface = "bold") +
draw_label("M1+T863",
x = 0.48,
y = 0.85,
size = 13,
hjust = 0,
vjust = 0,
fontface = "bold") +
draw_label("Down-regulated (2443)",
angle = 90,
size = 14,
x = 0.1,
y = 0.55,
hjust = 0,
vjust = 0) +
draw_label("Up-regulated (2301)",
angle = 90,
size = 14,
x = 0.1,
y = 0.25,
hjust = 0,
vjust = 0)
ggsave("M0 VS M1 and M1T368_heatmap_Final.png")
heatmap_figure
As it was reflected by the correlation heatmap (Part A. DESeq2), both M1 (M1 with and without T863) showed distinguishable gene epression pattern than M0. However, according to this annotated heatmap, not many genes appeared to be significantly different between M1 and M1+T863. Further narrow down analysis between M1 and M1+T863 will be done using ‘Gullickson_M1_VS_M1T863.Rmd’ located in /R_project_Soyeon_Gullickson_M1_VS_M1T863/.
library(biomaRt)
library(clusterProfiler) #BiocManager::install("clusterProfiler")
library(org.Mm.eg.db) #BiocManager::install("org.Mm.eg.db") ## this is mouse! not a human or bear!
library(DOSE) #BiocManager::install("DOSE")
library(ggnewscale) #install.packages("ggnewscale")
library(pathview) #BiocManager::install("pathview")
library(ggupset) #install.packages("ggupset")
library(enrichplot) #BiocManager::install("enrichplot")
The goal of functional analysis is provide biological insight, so it’s necessary to analyze our results in the context of our experimental hypothesis: Stimulation of macrophages using LPS+IFNg enhances macrophages pro-inflammatory immune effector functions. Therefore, we may expect an enrichment of processes/pathways related to myeloid/innate immune responses.
The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the function of genes. GO terms are organized into three independent controlled vocabularies (ontologies) in a species-independent manner:
Biological process: refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input.
Molecular function: represents the biochemical activity of the gene product, such activities could include “ligand”, “GTPase”, and “transporter”.
Cellular component: refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.
Each GO term has a term name (e.g. DNA repair) and a unique term accession number (GO:0005125), and a single gene product can be associated with many GO terms.
We will be using clusterProfiler to perform over-representation analysis on GO terms associated with our list of significant genes. The tool takes as input a significant gene list and a background gene list and performs statistical enrichment analysis using hypergeometric testing. The basic arguments allow the user to select the appropriate organism and GO ontology (BP, CC, MF) to test.
resdata <- read.csv("M0_VS_M1andM1T368_unfiltered_matrix_Final.csv", header = T)
resdata <- resdata[, c(2:17)]
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000000386.19 15235.742 8.119711 0.17839804 45.51457 0
## 2 ENSMUSG00000006818.6 7758.105 4.159120 0.08841911 47.03870 0
## 3 ENSMUSG00000015837.16 21140.896 2.795451 0.06015630 46.46980 0
## 4 ENSMUSG00000016496.8 24244.104 8.134324 0.13927930 58.40296 0
## 5 ENSMUSG00000017652.17 5173.867 6.058249 0.13697318 44.22945 0
## 6 ENSMUSG00000018899.18 15003.974 5.538247 0.07604645 72.82716 0
## padj MO_1 MO_2 MO_3 M1_1 M1_2 M1_3 M1_T368_1
## 1 0 87.45999 86.78768 71.80221 26227.415 27253.492 26498.443 17434.796
## 2 0 588.58211 695.03078 613.86945 12330.056 11839.744 11089.540 11696.676
## 3 0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4 0 121.34089 107.93762 160.17416 32976.318 36028.969 32225.833 35137.266
## 5 0 114.24954 121.79448 110.46494 9026.013 8623.723 8081.866 6895.416
## 6 0 494.03077 508.32786 434.75844 22535.267 22614.290 22597.254 21142.867
## M1_T368_2 M1_T368_3
## 1 19974.740 19486.738
## 2 10580.185 10389.257
## 3 30429.633 29109.123
## 4 43332.355 38106.745
## 5 7008.213 6583.063
## 6 22140.661 22568.308
It is helpful if we align all of our ducks in a row. Therefore, we
need to do some gene_id conversion. For clusterprofiler, we will use
both ensembl id and Entrez id. Below is the basic format for the
useEnsembl()
function:
#listEnsembl() #list databases available
## DO NOT RUN
#ensembl <- useEnsembl(biomart = ,
#dataset = ,
#mirror=)
ensembl <- useEnsembl(biomart = "genes",
dataset = "mmusculus_gene_ensembl",
mirror="useast")
Now, go ahead and fill in the getBM()
function to
perform the gene_id match of ensembl gene ids to entrez gene ids. Output
this function to an object called biomart.output
.
#View(listFilters(ensembl))
#View(listAttributes(ensembl))
biomart.output <- getBM(mart = ensembl, values = resdata$Gene_id,
filters = "ensembl_gene_id_version",
attributes = c("ensembl_gene_id_version",
"entrezgene_id"))
head(biomart.output)
## ensembl_gene_id_version entrezgene_id
## 1 ENSMUSG00000000028.16 12544
## 2 ENSMUSG00000000056.8 67608
## 3 ENSMUSG00000000058.7 12390
## 4 ENSMUSG00000000078.8 23849
## 5 ENSMUSG00000000088.8 12858
## 6 ENSMUSG00000000127.16 14158
We are ready to merge our new entrez annotations with
resdata
names(biomart.output)[1] <- "Gene_id"
#biomart.output
resdata_gene <- dplyr::left_join(resdata, biomart.output,
by=c('Gene_id'='Gene_id'))
#resdata_gene
To perform the over-representation analysis, we need a list of background genes and a list of significant genes. For our background dataset we will use all genes in our results table.. For our significant gene list we will use genes with p-adjusted values less than 0.05 and absolute log2FC of 1.
# we want the log2 fold change
original_gene_list <- resdata_gene$log2FoldChange
head(original_gene_list)
## [1] 8.119711 4.159120 2.795451 8.134324 6.058249 5.538247
# name the vector
names(original_gene_list) <- resdata_gene$entrezgene_id
head(original_gene_list)
## <NA> 20656 18412 60533 21939 16362
## 8.119711 4.159120 2.795451 8.134324 6.058249 5.538247
#original_gene_list
# omit any NA values
gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order (will be required so good to do this now)
gene_list = sort(gene_list, decreasing = TRUE)
# Extract significant results (padj < 0.05) using subset()
sig_genes_df = subset(resdata_gene, padj < 0.05)
# From significant results, we want to filter on log2fold change
genes <- sig_genes_df$log2FoldChange
# Name the vector i.e. add entrez gene id
names(genes) <- sig_genes_df$entrezgene_id
# omit NA values
genes <- na.omit(genes)
# filter on min log2fold change (log2FoldChange > 1)
genes <- names(genes)[abs(genes) > 1]
Ok, now we are ready to perform the GO enrichment analysis! Go ahead and fill in the rest!
NOTE: The different organisms with annotation databases available to use with for the
OrgDb
argument can be found below:Also, the
keyType()
argument can be used once you have filled in the OrgDb.Finally, the
ont
argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three. For this example, set it to “BP”.
#go_enrich <- enrichGO(gene = ,
#universe = ,
#OrgDb = ,
#keyType = ,
#readable = T,
#ont = ,
#pvalueCutoff = 0.05,
#qvalueCutoff = 0.05)
keytypes(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
go_enrich <- enrichGO(gene = genes,
universe = names(gene_list),
OrgDb = org.Mm.eg.db,
keyType = 'ENTREZID',
readable = T,
ont = "BP",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05)
## Output results from GO analysis to a table
cluster_summary <- data.frame(go_enrich)
write.csv(cluster_summary, "clusterProfiler_M1andM1T863 vs M0.csv")
head(cluster_summary)
## ID
## GO:0009617 GO:0009617
## GO:0031347 GO:0031347
## GO:0002460 GO:0002460
## GO:0006935 GO:0006935
## GO:0042330 GO:0042330
## GO:0001819 GO:0001819
## Description
## GO:0009617 response to bacterium
## GO:0031347 regulation of defense response
## GO:0002460 adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains
## GO:0006935 chemotaxis
## GO:0042330 taxis
## GO:0001819 positive regulation of cytokine production
## GeneRatio BgRatio pvalue p.adjust qvalue
## GO:0009617 250/3841 462/12418 5.238966e-26 2.845282e-22 2.001285e-22
## GO:0031347 254/3841 475/12418 2.130683e-25 5.785869e-22 4.069604e-22
## GO:0002460 151/3841 264/12418 4.321770e-19 7.823844e-16 5.503054e-16
## GO:0006935 192/3841 365/12418 2.114030e-18 2.870325e-15 2.018899e-15
## GO:0042330 192/3841 366/12418 3.111990e-18 3.380244e-15 2.377561e-15
## GO:0001819 197/3841 379/12418 4.169712e-18 3.774284e-15 2.654716e-15
## geneID
## GO:0009617 Cd274/Cmpk2/Cd86/C3/Tap2/Mrc1/Gbp3/Gbp2/Herc6/Clec4e/Trim30a/Ifit1/Isg15/Nod1/Gbp7/Irgm1/Slamf8/Trim30d/H2-K1/Irgm2/Slfn2/Ifi204/Igtp/Ifi211/Serpinb9/Ccl2/Nlrc5/Irf8/Cxcl16/Cd52/Ifit3/Iigp1/Mndal/Tnfrsf1b/Abr/Jak2/Rnf213/Oas3/Gbp9/Usp18/Gbp5/Casp4/Slfn4/Ccl5/Cflar/B2m/Stat1/C5ar1/Dhx58/Znfx1/Fcgr1/Irf5/Nfkbia/Oas2/Casp1/Cd300lb/Tlr9/Arid5a/Psmb9/Ifi44/Malt1/Anxa3/Plaat3/Lgals9/Cxcl10/Cd47/Fcgr4/Tnfaip3/Ptgs2/Ripk2/Slc11a1/Bcl3/AA467197/Mx2/Trim12a/Rnf31/Lyz2/Sbno2/Peli1/Ifi205/Cx3cr1/Acod1/Gsdmd/Eif2ak2/Enpp1/Ptafr/Vim/Tnf/Tbk1/Wfdc17/Stap1/Syk/Lcn2/Sp110/Irak2/Axl/Mpeg1/Cebpb/Pde4b/Ednrb/Nfkbib/Tnfrsf1a/Fkbp5/Gch1/Rnase4/Ly6a/Nos2/Zc3h12a/Gbp4/Lpl/Pf4/Cyba/Nod2/Abcd2/Oas1a/Tnfrsf14/Hmgb2/Stab1/Trf/Mt2/Cxcl9/Oas1g/Acp5/Tnip1/Saa3/Gbp8/Trim12c/Naip2/Colec12/Cd84/Hck/Bnip3/Trem2/Cdk4/Tgtp1/Plscr4/Fcer1g/Mef2c/Notch2/Gbp6/Scarb1/Slc17a5/Chd7/Tent5a/Ticam1/Il12b/Serpina3f/Irak3/Foxp1/Pmp22/Spi1/Resf1/Adgrl2/Nlrc3/Zfp36/Pdcd4/Nlrc4/Tlr6/Ang/Serpine1/Il1a/Il1b/Trim5/Rela/Ptger2/Cfb/Sirpa/Ccl12/Spn/Hmgb1/H2-M3/Il18/Ncf1/Gbp2b/Cd36/Naip1/Mapkbp1/Trim6/Tnip3/Noct/Ptpn22/Prkce/Oas1c/Il6/Gbp10/Gbp11/Myd88/Klrk1/Cd80/Trib1/Cfh/Dusp10/Tnfrsf11a/Il36g/Pdcd1lg2/Tmem229b/Ccr7/Gpr31b/Slpi/Ptgir/Stab2/Klhl6/Ociad2/Bank1/Scimp/Cd24a/Tnfsf4/Cxcl2/Rps6ka3/Fgr/Adm/Cd79b/Ppard/Plac8/Raet1a/Raet1b/Raet1c/Mlh1/Bcr/Il23r/Mgst1/Nr1h3/Arg2/Zfp983/Rab34/Hamp/Gstp1/Lsm5/Car5b/Mir29a/Mir146/Mir23b/Ifnb1/Lta/Cav1/Ido1/Il12rb2/Pde4d/Cd4/Ssc5d/Dab2ip/Cx3cl1/Mir30b/Mir210/Tlr5/Mpo/S100a8/Pde2a/Hmcn1/Fabp4
## GO:0031347 Irf1/C3/Tap2/Irf7/Samhd1/Trim30a/Slc7a2/Parp14/Isg15/Nod1/Stat2/Trafd1/Ifi209/Irgm1/Slamf8/Trim30d/Ifi204/Igtp/Ifi211/Serpinb9/Daglb/Dtx3l/Cd74/Nlrc5/Socs3/Socs1/Ifi206/Ifi203/Tap1/Ninj1/Mndal/Tnfrsf1b/Abr/Jak2/Oas3/Gbp5/Casp4/Ccl5/Stat1/Bst1/Dhx58/N4bp1/Znfx1/Fcgr1/Zbp1/Pml/Nfkbia/Casp1/Tlr9/Trim56/Ctsc/Rigi/Zc3hav1/Ifi35/Adar/Il15/Aoah/Riok3/Anxa1/Lgals9/Fgl2/Cd47/Tnfaip3/Cst7/Ptgs2/Ripk2/Il27/Apobec3/Ifi208/Trim12a/Pik3cg/Rnf31/Mmp12/Sbno2/Arel1/Ifi205/Igf1/Cx3cr1/Cd200/Acod1/Nupr1/Clec2d/Parp9/Fancd2/Tnf/Tbk1/Pla2g4a/Stap1/Syk/Rnf135/Ptges/Abhd12/Cd28/Mefv/Ilrun/Fanca/Cebpb/Ednrb/Tnfrsf1a/Zc3h12a/Gbp4/Lpl/Nfe2l1/Cyba/Nod2/Oas1a/Traf3/Hmgb2/Serpinb1a/Rabgef1/Il16/Pvr/Clec4n/Oas1g/Acp5/Tnip1/Casp8/Creb3/Trim12c/Ier3/Il12rb1/Adrb2/Xrcc6/Calcrl/Nmi/Trim21/Ifi214/Trem2/Pbk/Smpdl3b/Fcer1g/Mef2c/Ets1/Hpse/Cdk19/Pparg/Il12b/Prkdc/Aim2/Irak3/Foxp1/Pmp22/C1qtnf12/Spi1/Mul1/Ccr2/Dnase1l3/Tlr8/Nlrc3/Zfp36/Pdcd4/Il17ra/Nlrc4/Ifi207/Fgfr1/Tlr6/Metrnl/Ptpn2/Serpine1/Uaca/Il1b/Trim5/Rela/Spn/Hmgb1/H2-M3/Cd200r4/Adora2b/Il18/Ncf1/Rasgrp1/Tlr3/Mapkbp1/Trim6/Ptpn22/Ttll12/Oas1c/Esr1/Il6/Adora2a/Dagla/Slamf6/Cebpa/Myd88/C1qbp/Klrk1/Treml4/Il1rl1/Cadm1/Ccr1/Pik3ap1/Cfh/Dusp10/Tnfrsf11a/Ccr7/Clcf1/Ctla2a/Lgals1/Hsp90aa1/Polr3g/Gpr31b/Serpinb9b/Pspc1/Serping1/Tnc/Osm/Sema7a/Banf1/Lacc1/Cd24a/Mcph1/Tnfsf4/Hgf/Cd81/Fgr/Ppard/Mmp8/Ceacam1/Nfkbiz/Il33/Raet1b/Casp12/Bcr/Il23r/Nr1h3/Arg2/Htra1/Lrrk2/Rab34/Xrcc5/Gprc5b/Hamp/Gstp1/Ikbke/Sphk1/Npas2/Il2ra/Ccl24/Alox5/Fut7/Lta/Bcl6b/Ido1/Cd276/Cx3cl1/Ada/Stk39/Pla2g3/Lag3/S100a8/Pde2a/Arg1/Fabp4/Il1rl2
## GO:0002460 Cd274/Cd40/Irf1/Rsad2/C3/Tap2/Irf7/Ctsh/H2-T24/H2-T22/H2-K1/H2-T23/Pnp/Serpinb9/H2-Ab1/Cd74/Il18bp/H2-Aa/H2-D1/Icam1/Tnfrsf1b/Jak2/H2-Eb1/B2m/Stx11/H2-DMa/Fcgr1/Pirb/Il4ra/Arid5a/Ctsc/Malt1/Sema4a/Parp3/Anxa1/Fgl2/H2-Q4/Tnfaip3/Ripk2/Slc11a1/Il27/Bcl3/Unc13d/Fas/Vsir/Enpp1/Stat3/H2-DMb1/Tnf/C1ra/Pla2g4a/Csf2rb/Msh2/Vegfa/Emp2/Cd28/C1rl/Atad5/Msh6/Zc3h12a/Nod2/Hfe/Traf2/Il21r/Ly9/Relb/Pvr/Tcirg1/Clec4n/Nfkb2/Il12rb1/Jak3/Trem2/Hprt/Zbtb7b/Cd1d1/Fcer1g/Exo1/Mef2c/Nsd2/Rif1/Tyk2/Il12b/Csf2rb2/Swap70/Ccr2/Ung/Il17ra/Il1b/Hspa8/Batf/Spn/Hmgb1/H2-M3/Gadd45g/Il18/Slamf1/H2-Q6/Unc93b1/Il6/Slamf6/Myd88/C1qbp/H2-M2/Treml4/Il1rl1/Cfh/Jag1/Ccr7/Clcf1/Stard7/Pms2/Serpinb9b/Ptpn6/Klhl6/Serping1/C4b/Ephb6/H2-Oa/Cracr2a/Tfrc/Hmces/H2-Q7/H2-Q9/Cd24a/Tnfsf4/Cd81/C1s1/Nfkbiz/H2-Q1/Il33/Mlh1/Il23r/Irf4/Tnfsf13/Gm8909/Aplf/Raet1e/Smad7/Cd46/Gapt/Ifnb1/Gm7030/Rab27a/Fut7/Lta/Cd4/H2-DMb2/Ada/H2-Q2/Arg1
## GO:0006935 Gas6/Slamf8/Cd74/Ccl2/Cxcl16/Pla2g7/Ninj1/Ccl5/Bst1/C5ar1/Slc8b1/St6gal1/Sema4a/Stx3/Itgb2/Ccl7/Anxa1/Lgals9/Cxcl10/Lrp1/Rhoh/Ccrl2/Prex1/Flrt2/Nrp2/Pik3cg/Met/Cxcr4/Cx3cr1/Gpr183/Coro1b/Ptafr/Fpr2/Stap1/Sema4d/Csf3r/Syk/Vegfa/Pde4b/Ednrb/Lama3/Plxna1/Dock4/Pf4/Nod2/Hmgb2/Padi2/Cxcl9/Il16/Mtus1/Ccl8/Saa3/Creb3/Ccl6/Ccr3/Lpar1/Plxna2/Csf1/Coro1a/Trem2/Cmklr1/Lhx2/Nrcam/Fcer1g/Mef2c/Apbb2/C5ar2/Bin2/Cysltr1/Plxnd1/Sbds/Nrp1/Gpr18/Lamb2/Swap70/Foxp1/Eng/Spi1/Egr2/Ccr2/Dusp1/Ccl22/Il17ra/Nup85/Fgfr1/Etv1/Ptpn2/Serpine1/Thbs1/Aif1/Il1b/Sema3c/Fn1/Ccl12/Flot1/Ccl9/Kif5c/Hmgb1/P2rx4/Smo/Rin3/Lsp1/Slamf1/Ptk2/Pdgfa/Itga9/Sell/Cxcr3/Plxna4/Unc5b/Plgrkt/Vasp/F7/Nr4a1/Vav3/Ptch1/Kalrn/Ch25h/Kdr/Sema6b/C1qbp/Fpr1/Celsr3/Amot/Vegfb/Ccr1/Mcu/Rac2/Emb/Plxnc1/Gpr35/L1cam/Agrn/Rtn4rl1/Cttn/Ccr7/Matn2/Sema4c/S1pr1/Ptpro/Ephb6/Sema7a/Kif5a/Bcar1/Plxna3/Runx3/Hgf/B3gnt2/Cxcl2/Ryk/Plekhg5/Egr3/Flrt3/Sema6d/Unc5a/Hsd3b7/Cxcl14/Ntn1/Rtn4r/Klf7/Thbs4/Cdk5r1/Gstp1/Itga1/Xcr1/Tnfsf14/Nectin1/Neo1/Cxcr6/Ccl4/Apbb1/Pgf/Ccl24/Alox5/Crppa/Tpbg/Pde4d/Slit1/Edn1/Nrg1/Cx3cl1/Hoxa1/Lox/Stk39/P2ry12/Fpr3/Vcam1/Sema6c/S100a8/Ccr1l1/Plxnb1/Sema3f
## GO:0042330 Gas6/Slamf8/Cd74/Ccl2/Cxcl16/Pla2g7/Ninj1/Ccl5/Bst1/C5ar1/Slc8b1/St6gal1/Sema4a/Stx3/Itgb2/Ccl7/Anxa1/Lgals9/Cxcl10/Lrp1/Rhoh/Ccrl2/Prex1/Flrt2/Nrp2/Pik3cg/Met/Cxcr4/Cx3cr1/Gpr183/Coro1b/Ptafr/Fpr2/Stap1/Sema4d/Csf3r/Syk/Vegfa/Pde4b/Ednrb/Lama3/Plxna1/Dock4/Pf4/Nod2/Hmgb2/Padi2/Cxcl9/Il16/Mtus1/Ccl8/Saa3/Creb3/Ccl6/Ccr3/Lpar1/Plxna2/Csf1/Coro1a/Trem2/Cmklr1/Lhx2/Nrcam/Fcer1g/Mef2c/Apbb2/C5ar2/Bin2/Cysltr1/Plxnd1/Sbds/Nrp1/Gpr18/Lamb2/Swap70/Foxp1/Eng/Spi1/Egr2/Ccr2/Dusp1/Ccl22/Il17ra/Nup85/Fgfr1/Etv1/Ptpn2/Serpine1/Thbs1/Aif1/Il1b/Sema3c/Fn1/Ccl12/Flot1/Ccl9/Kif5c/Hmgb1/P2rx4/Smo/Rin3/Lsp1/Slamf1/Ptk2/Pdgfa/Itga9/Sell/Cxcr3/Plxna4/Unc5b/Plgrkt/Vasp/F7/Nr4a1/Vav3/Ptch1/Kalrn/Ch25h/Kdr/Sema6b/C1qbp/Fpr1/Celsr3/Amot/Vegfb/Ccr1/Mcu/Rac2/Emb/Plxnc1/Gpr35/L1cam/Agrn/Rtn4rl1/Cttn/Ccr7/Matn2/Sema4c/S1pr1/Ptpro/Ephb6/Sema7a/Kif5a/Bcar1/Plxna3/Runx3/Hgf/B3gnt2/Cxcl2/Ryk/Plekhg5/Egr3/Flrt3/Sema6d/Unc5a/Hsd3b7/Cxcl14/Ntn1/Rtn4r/Klf7/Thbs4/Cdk5r1/Gstp1/Itga1/Xcr1/Tnfsf14/Nectin1/Neo1/Cxcr6/Ccl4/Apbb1/Pgf/Ccl24/Alox5/Crppa/Tpbg/Pde4d/Slit1/Edn1/Nrg1/Cx3cl1/Hoxa1/Lox/Stk39/P2ry12/Fpr3/Vcam1/Sema6c/S100a8/Ccr1l1/Plxnb1/Sema3f
## GO:0001819 Cd274/Cd40/Irf1/Rsad2/C3/Irf7/Clec4e/Isg15/Nod1/Ifi204/Pnp/Ifi211/Cd74/Ccl2/Irf8/Ifi206/Mndal/Jak2/Oas3/Gbp5/Casp4/Ccl5/B2m/Hif1a/Stat1/C5ar1/Dhx58/Irf5/Il4ra/Sulf2/Oas2/Casp1/Tlr9/Arid5a/Trim56/Rigi/Malt1/Ifih1/Zc3hav1/Il15/Riok3/Anxa1/Lgals9/Ptgs2/Lrp1/Ripk2/Slc11a1/Il27/Bcl3/Ifngr1/Mmp12/Peli1/Ifi205/Cd200/Gsdmd/Eif2ak2/Stat3/Ptafr/Tnf/Tbk1/Syk/Rnf135/Cd28/Mefv/Cebpb/Pde4b/Scamp5/Tnfrsf1a/Mapk9/Nos2/Ticam2/Pnp2/Lpl/Cyba/Nod2/Oas1a/Brca1/Tnfrsf14/Src/Sphk2/Traf2/Hmgb2/Ly9/Il16/Gapdh/Tarm1/Clec4n/Oas1g/Casp8/Il12rb1/Nfam1/Rtn4/Cd84/Trem2/Atf4/Zbtb7b/Cd1d1/Fcer1g/Ccdc88b/Hpse/Psen1/Ticam1/Tyk2/Il12b/Aim2/Irak3/Foxp1/Hk1/Mbp/Ccr2/Tlr8/Il17ra/Nlrc4/Ifi207/Rgcc/Tlr6/Serpine1/Il1a/Thbs1/Mcoln2/Aif1/Il1b/Rela/Batf/Flot1/Spn/Hmgb1/H2-M3/Adora2b/Il18/Slamf1/Rasgrp1/Cd36/Hilpda/Tlr3/Trim6/Ddit3/Unc93b1/Ptpn22/Oas1c/Tnfsf15/Panx1/Il6/Tnfrsf8/Slamf6/Myd88/Egr1/Klrk1/Il1rl1/Cadm1/Htr2b/Flt4/Il36g/Ccr7/Hsp90aa1/Polr3g/Eif2ak3/Il1rap/Osm/Clec9a/Sema7a/Scimp/Lacc1/H2-Q9/Tnfsf4/Hgf/Cd81/Fgr/Mmp8/Il33/Raet1b/Il23r/Irf4/Chil3/Lrrk2/Gprc5b/Cd244a/Il7/Sphk1/Cd34/Card11/Cd46/Ccl4/Akap12/Cyp2j6/Mapk11/Lta/Ido1/Il12rb2/Cd276/Pde4d/Cx3cl1/Tlr5/Pla2g3/Postn/F3/Il1rl2
## Count
## GO:0009617 250
## GO:0031347 254
## GO:0002460 151
## GO:0006935 192
## GO:0042330 192
## GO:0001819 197
clusterProfiler
has a variety of options for viewing the
over-represented GO terms. We will explore the barplot()
,
dotplot()
firtst.
barplot <- barplot(go_enrich, showCategory = 10,
title = "GO Biological Pathways",
font.size = 8)
barplot
Let’s make y-axis labels readable!
barplot <- barplot(go_enrich, showCategory = 15,
title = "GO Biological Pathways",
font.size = 12,
label_format = 50)
barplot
Next, we will represent the pathways as dotplot to display the top 10 GO terms by gene ratio (# genes related to GO term / total number of sig genes).
dot <- dotplot(go_enrich)
dot
Again, let’s make the y-axis label readable!
dot <- dotplot(go_enrich, label_format = 60, font.size = 12)
dot
pathway <- goplot(go_enrich, showCategory = 5)
pathway
Both the barplot()
and dotplot()
only
displayed most significant terms, but some users may want to know which
genes are involved in these significant terms.
In order to consider the potentially biological complexities in which
a gene may belong to multiple categories, we can use the
cnetplot()
function. The cnetplot()
depicts
the linkages of genes and biological concepts (e.g. GO terms or KEGG
pathways) as a network.
cnetplot(go_enrich, showCategory = 3,
categorySize="pvalue", foldChange=gene_list,
vertex.label.font=15, cex_label_gene = 0.6, cex_label_category= 0.9)
cnetplot(go_enrich, showCategory = 3,
categorySize="pvalue", foldChange=gene_list,
vertex.label.font=6, cex_label_category = 0.9,
node_label="category")
> Select which labels to be displayed. Options include ‘category’,
‘gene’, ‘all’(the default) and ‘none’.
cnetplot(go_enrich, showCategory = 2,
categorySize="pvalue", foldChange=gene_list,
circular = TRUE, colorEdge = TRUE,
cex_label_gene = 0.55, vertex.label.font=12)
Circular plots can be quite powerful, if the list was smaller, here things get a bit crowded - overall makes it hard to interpret. Moral of the story, just because you can make it doesn’t mean you should!
library(ggplot2)
plot_color <- cnetplot(go_enrich, showCategory = 5,
categorySize="pvalue", foldChange=gene_list,
vertex.label.font=10, cex_label_gene = 0.55, cex_label_category= 0.75) +
scale_color_gradient2(name='fold change', low='darkgreen', high='purple1')
plot_color
Within clusterprofiler we can take advantage of the
gseGO
function which will allow us to perform Gene Set
Enrichment Analysis. This package is great, because you can select which
pathways you would like to represent specifically. GSEA works by
aggregating the per gene statistics across genes within the given gene
set, therefore making it possible to detect situations where all genes
in a predefined set change in a small but coordinated way.
Some GSEA statistics:
Calculation of an Enrichment Score. The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The ES is the maximum deviation from zero encountered in walking the list. A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list.
Normalized Enrichment Score. The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset; therefore, the normalized enrichment scores (NES) can be used to compare analysis results across gene sets.
False Discovery Rate (FDR) The false discovery rate (FDR) is the estimated probability that a gene set with a given NES represents a false positive finding. An FDR cutoff of 5% is standard.
Nominal P Value The nominal p value estimates the statistical significance of the enrichment score for a single gene set.
organism = get("org.Mm.eg.db") #load genome
gsea <- gseGO(geneList=gene_list,
ont ="BP",
keyType = "ENTREZID",
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = organism,
pAdjustMethod = "BH")
write.csv(gsea,"gsea_BP_M0 VS M1andM1T863.csv")
head(gsea)
## ID Description setSize
## GO:0051304 GO:0051304 chromosome separation 120
## GO:0034341 GO:0034341 response to interferon-gamma 122
## GO:0051983 GO:0051983 regulation of chromosome segregation 115
## GO:1905818 GO:1905818 regulation of chromosome separation 99
## GO:0035456 GO:0035456 response to interferon-beta 57
## GO:0045087 GO:0045087 innate immune response 573
## enrichmentScore NES pvalue p.adjust qvalue rank
## GO:0051304 -0.7519754 -2.725219 0.0002513826 0.005815937 0.004906671 1003
## GO:0034341 0.7913420 2.701156 0.0001652346 0.005815937 0.004906671 1610
## GO:0051983 -0.7480332 -2.690859 0.0002511301 0.005815937 0.004906671 1486
## GO:1905818 -0.7558384 -2.662508 0.0002474635 0.005815937 0.004906671 1486
## GO:0035456 0.8685041 2.641859 0.0001748863 0.005815937 0.004906671 1375
## GO:0045087 0.6596275 2.632016 0.0001370802 0.005815937 0.004906671 2090
## leading_edge
## GO:0051304 tags=34%, list=6%, signal=32%
## GO:0034341 tags=51%, list=9%, signal=46%
## GO:0051983 tags=34%, list=9%, signal=31%
## GO:1905818 tags=34%, list=9%, signal=32%
## GO:0035456 tags=70%, list=8%, signal=65%
## GO:0045087 tags=36%, list=12%, signal=33%
## core_enrichment
## GO:0051304 75430/70099/78658/21335/104806/52276/69716/56150/74335/14211/76044/77045/69928/68014/83558/66442/67141/268465/68298/209334/23834/67052/229841/12236/20877/105988/11799/12704/215387/22137/268697/21973/12235/54392/208628/67629/18817/107995/66977/234396/68612
## GO:0034341 18126/16365/17472/14468/21822/100702/20304/229898/16160/14469/24108/16161/626578/20307/634650/12265/12703/70789/55932/229900/76074/209590/16145/54483/20299/21939/20293/14961/14969/16362/14960/15944/54396/56221/104252/236573/16149/15018/110558/66102/547253/434341/74732/20296/20846/20306/16452/22793/19332/69550/80285/12362/64685/20912/68713/15900/66141/20684/20821/69834/16423/18173
## GO:0051983 75430/70099/78658/58186/21335/52276/69716/56150/74335/14211/76044/77045/68014/66442/67141/214901/68298/209334/23834/67052/229841/12236/20877/17345/11799/12704/215387/108912/22137/268697/12235/54392/208628/67629/18817/107995/66977/73804/68612
## GO:1905818 75430/70099/78658/21335/52276/69716/56150/74335/14211/76044/77045/68014/66442/67141/68298/209334/23834/67052/229841/12236/20877/11799/12704/215387/22137/268697/12235/54392/208628/67629/18817/107995/66977/68612
## GO:0035456 16365/14468/21822/100702/60440/225594/14469/100039796/226695/240328/545384/55932/102639543/623121/100033459/15957/240327/229900/15959/620913/236312/16145/15953/15977/16362/15944/54396/381308/327959/20846/15951/15950/432555/100040462/69550/94088/23960/68713/66141/246730
## GO:0045087 18126/16365/17472/14468/21822/100702/60440/20304/229898/16160/19419/17167/14962/14469/24108/16161/626578/27007/20307/12266/226695/634650/12265/12703/15958/70789/231655/545384/55932/102639543/100033459/58185/245126/384059/15957/434219/12258/229900/56619/76074/100038882/20811/22164/15959/215257/209590/244183/50908/54123/246779/236312/16145/27218/11847/58203/54483/20299/74748/21939/56620/57444/20293/14961/14969/15977/16818/16362/17858/16666/14960/81897/20723/243659/107607/15944/20459/225825/54396/56221/381308/104252/12642/56631/236573/67138/16149/232371/56045/15018/110558/75345/214855/21354/66102/20568/75234/246727/23962/547253/12977/434341/30925/12363/74732/503550/209387/224840/16819/20296/20846/71586/50909/21355/13482/20847/20779/16768/15951/20706/15950/20128/20306/257632/192656/18477/16452/239081/209200/100040462/22793/19332/230979/14129/21929/66107/69550/65221/231712/80750/94088/230073/80287/80285/109032/23960/12362/16859/434218/66222/64685/12479/20912/68713/80861/70110/76681/98999/17174/246728/12010/18854/15900/73670/71956/66141/20684/14191/246730/269799/100340/13058/56417/94094/240354/63953/20821/69834/19698/210808/68279/69146/15162/16423/18173/217069/26362/14528/384309/14991/286940/27056/52118/100034251/319236/11846/70450/16173/224647/93694/19106/56489/269855/78781/142980/225471/12263/19255/211550/27361
gsea1 <- gseaplot2(gsea, title = gsea$Description[10], geneSetID = 1)
gsea1
gsea2 <- gseaplot2(gsea, title = gsea$Description[20], geneSetID = 1)
gsea2
final_gsea <- gseaplot2(gsea, geneSetID = 1:5)
final_gsea
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] enrichplot_1.18.4 ggupset_0.3.0
## [3] pathview_1.38.0 ggnewscale_0.4.8
## [5] DOSE_3.24.2 org.Mm.eg.db_3.16.0
## [7] AnnotationDbi_1.60.2 clusterProfiler_4.6.2
## [9] ggplotify_0.1.0 cowplot_1.1.1
## [11] biomaRt_2.54.1 EnhancedVolcano_1.16.0
## [13] ggrepel_0.9.3 lubridate_1.9.2
## [15] forcats_1.0.0 stringr_1.5.0
## [17] dplyr_1.1.1 purrr_1.0.1
## [19] readr_2.1.4 tidyr_1.3.0
## [21] tibble_3.2.1 tidyverse_2.0.0
## [23] ggplot2_3.4.2 pheatmap_1.0.12
## [25] RColorBrewer_1.1-3 DESeq2_1.38.3
## [27] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [29] MatrixGenerics_1.10.0 matrixStats_0.63.0
## [31] GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
## [33] IRanges_2.32.0 S4Vectors_0.36.2
## [35] BiocGenerics_0.44.0 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] shadowtext_0.1.2 fastmatch_1.1-3 BiocFileCache_2.6.1
## [4] systemfonts_1.0.4 plyr_1.8.8 igraph_1.4.2
## [7] lazyeval_0.2.2 splines_4.2.2 BiocParallel_1.32.6
## [10] digest_0.6.31 yulab.utils_0.0.6 htmltools_0.5.5
## [13] GOSemSim_2.24.0 viridis_0.6.2 GO.db_3.16.0
## [16] fansi_1.0.4 magrittr_2.0.3 memoise_2.0.1
## [19] tzdb_0.3.0 Biostrings_2.66.0 annotate_1.76.0
## [22] graphlayouts_0.8.4 timechange_0.2.0 prettyunits_1.1.1
## [25] colorspace_2.1-0 blob_1.2.4 rappdirs_0.3.3
## [28] textshaping_0.3.6 xfun_0.38 crayon_1.5.2
## [31] RCurl_1.98-1.12 jsonlite_1.8.4 graph_1.76.0
## [34] scatterpie_0.1.8 ape_5.7-1 glue_1.6.2
## [37] polyclip_1.10-4 gtable_0.3.3 zlibbioc_1.44.0
## [40] XVector_0.38.0 DelayedArray_0.24.0 Rgraphviz_2.42.0
## [43] scales_1.2.1 DBI_1.1.3 Rcpp_1.0.10
## [46] viridisLite_0.4.1 xtable_1.8-4 progress_1.2.2
## [49] tidytree_0.4.2 gridGraphics_0.5-1 bit_4.0.5
## [52] httr_1.4.5 fgsea_1.24.0 pkgconfig_2.0.3
## [55] XML_3.99-0.14 farver_2.1.1 sass_0.4.5
## [58] dbplyr_2.3.2 locfit_1.5-9.7 utf8_1.2.3
## [61] tidyselect_1.2.0 labeling_0.4.2 rlang_1.1.0
## [64] reshape2_1.4.4 munsell_0.5.0 tools_4.2.2
## [67] cachem_1.0.7 downloader_0.4 cli_3.6.1
## [70] generics_0.1.3 RSQLite_2.3.1 gson_0.1.0
## [73] evaluate_0.20 fastmap_1.1.1 yaml_2.3.7
## [76] ragg_1.2.5 ggtree_3.6.2 org.Hs.eg.db_3.16.0
## [79] bit64_4.0.5 tidygraph_1.2.3 KEGGREST_1.38.0
## [82] ggraph_2.1.0 nlme_3.1-160 KEGGgraph_1.58.3
## [85] aplot_0.1.10 xml2_1.3.3 compiler_4.2.2
## [88] rstudioapi_0.14 filelock_1.0.2 curl_5.0.0
## [91] png_0.1-8 treeio_1.22.0 tweenr_2.0.2
## [94] geneplotter_1.76.0 bslib_0.4.2 stringi_1.7.12
## [97] highr_0.10 lattice_0.20-45 Matrix_1.5-4
## [100] vctrs_0.6.1 pillar_1.9.0 lifecycle_1.0.3
## [103] jquerylib_0.1.4 data.table_1.14.8 bitops_1.0-7
## [106] patchwork_1.1.2 qvalue_2.30.0 R6_2.5.1
## [109] gridExtra_2.3 codetools_0.2-18 MASS_7.3-58.1
## [112] withr_2.5.0 GenomeInfoDbData_1.2.9 parallel_4.2.2
## [115] hms_1.1.3 grid_4.2.2 ggfun_0.0.9
## [118] HDO.db_0.99.1 rmarkdown_2.21 ggforce_0.4.1