Please note that There are total three .Rmd scripts for my Final Project (Part I, Part II, and Part III).

GSE145523

Original article

Castoldi A, Monteiro LB, van Teijlingen Bakker N, et al. Triacylglycerol synthesis enhances macrophage inflammatory function. Nat Commun. 2020;11(1):4107. doi:10.1038/s41467-020-17881-3

Introduction

Toll-like receptor (TLR) agonists activate naive/inactive macrophages (M0) to become Inflammatory macrophages(M1). M1s increase the synthesis of triacylglycerol (TG) which is needed for lipid droplet (LD) formation. This LD production is critical for the macrophage immune effector functions, as LD helps prevent from ER stress, toxic lipid build-up, and provide lipid donors for various immuno-cellular activities, including chemokine and cytokine productions. Macrophages with large LDs are typically observed in patients with atherosclerosis, tuberculosis, specific types of cancers, and obesity. However, detailed explanation for TG synthesis process and its functional significance in activated macrophages is unclear. Castoldi et al (2020) hypothesize that TG synthesis is essential for proper inflammatory activation in LPS+IFN-g stimulated macrophages. Therefore, the authors treated LPS+IFN-g stimulated-M1 with selective DGAT1 inhibitor (T863) to block the TG synthesis from diacylglycerol (DG). The authors found that T863 treatment caused attenuation of: 1) proinflammatory cytokines (Il-1b, IL-6), 2) overall intracellular TG contents, 3) Lipid droplet presence, 4) macrophage phagocytosis function, 5) sigaling molecule/lipid mediator molecule prostaglandin E2 (PGE2) formation.

PGE2 is a type of lipid mediators that can elicit either pro- or anti-inflammatory signaling, depending on the type of Prostaglandin receptors (EPR1-4). PGE2 is produced from arachidonic acid (AA) by cyclooxygenase1/2 (COX1/2) as the rate-limiting enzyme. Many studies found that inflammatory stimuli, such as LPS and IFNg, co-induce Ptgs-2 (codes for COX2) and Nos2 (codes for iNOS; inducible nitric oxide synthase) in innate immune cells such as Macrophages and Dendritic cells. In addition, Macrophages upregulate EPR2 and EPR4 which promote pro-inflammatory signaling cascade. I am very interested to know whether this DGAT1 inhibition-driven PGE2 drop affects Nos2 expression & downstream NO cellular effects.

Thus, my hypotheses for the RNAseq analysis are as follows: 1. IFNg+LPS activated M1 show significantly increased expression of Nos2 and other pro-inflammatory genes, comparing to inactive/naive M0. 2. T863 treatment in M1 alters the gene expression pattern in comparison to M1 without drug (based on the Authors’ claim that TG synthesis is critical for proper M1 inflmmatory activation).

For the analysis .Rmd script for hypothesis [1], please refer to a different .Rmd file located in /R_Project_Soyeon_Gullickson/M0_VS_M1.

Goal of this .Rmd script

In this script, I will be analyzing/comparing the data sets between unstimulated macrophages (M0) and stimulated macrophages (M1: including T863 treated group!) to test my hypothesis[2]: “T863 treatment in M1 alters the gene expression pattern in comparison to M1 without drug (based on the Authors’ claim that TG synthesis is critical for proper M1 inflmmatory activation)”.

Analysis process plan:

A. DESeq2 analysis B. Volcano Plot C. Heatmap D. Enrichment Analysis Note: I am most interested in Section B and C output files in this Part II script. –> These output files will be used in my final project presentation!

A. DESeq2 Analysis

Install the required R libraries

# Install CRAN packages
#install.packages(c("knitr", "RColorBrewer", "pheatmap"))

Code chunk options

The knitr package provides a lot of options for customizing nearly all components of code chunk, including the text outputs, source codes, and plots. Under the text outputs, you can find echo. When echo = TRUE this means to display the source code in the output documents. When warning = FALSE this means that all warnings will be suppressed.

Load the required libraries for DESeq2

library(knitr)
library(DESeq2) 
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)

DESeq2 Introduction

As a reminder, we have previously assigned all the sequencing reads to a given gene by directly counting reads overlapping with gene loci using HTSeq-count. Now, the next step is to analyze the counts data for the detection of differentially expressed genes (DEGs). The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models.

DESeq2 Overview

To use DESeq2 it requires the following: 1. counts matrix
2. table of sample information
3. design indicates how to model the samples

htseq-count input

First, we will input our annotation file. This file was created in Microsoft Excel, you could use nano or vim, or create a data frame. But essentially this annotation file should look similar to below:

sampleTable <- read.csv("SG_GSE145523_M0_VS_M1andM1T368_anno.csv", header = T)
head(sampleTable)

##   sample               filename condition replicate treatment
## 1   MO_1 SRR11111648.ENSG.count        M0         1      none
## 2   MO_2 SRR11111649.ENSG.count        M0         2      none
## 3   MO_3 SRR11111650.ENSG.count        M0         3      none
## 4   M1_1 SRR11111651.ENSG.count        M1         1  IFNg+LPS
## 5   M1_2 SRR11111652.ENSG.count        M1         2  IFNg+LPS
## 6   M1_3 SRR11111653.ENSG.count        M1         3  IFNg+LPS

tail(sampleTable)

##      sample               filename condition replicate     treatment
## 4      M1_1 SRR11111651.ENSG.count        M1         1      IFNg+LPS
## 5      M1_2 SRR11111652.ENSG.count        M1         2      IFNg+LPS
## 6      M1_3 SRR11111653.ENSG.count        M1         3      IFNg+LPS
## 7 M1_T368_1 SRR11111657.ENSG.count        M1         1 IFNg+LPS+T863
## 8 M1_T368_2 SRR11111658.ENSG.count        M1         2 IFNg+LPS+T863
## 9 M1_T368_3 SRR11111659.ENSG.count        M1         3 IFNg+LPS+T863

#Convert to data frame 
sampleTable <- data.frame(sampleTable)
head(sampleTable)

##   sample               filename condition replicate treatment
## 1   MO_1 SRR11111648.ENSG.count        M0         1      none
## 2   MO_2 SRR11111649.ENSG.count        M0         2      none
## 3   MO_3 SRR11111650.ENSG.count        M0         3      none
## 4   M1_1 SRR11111651.ENSG.count        M1         1  IFNg+LPS
## 5   M1_2 SRR11111652.ENSG.count        M1         2  IFNg+LPS
## 6   M1_3 SRR11111653.ENSG.count        M1         3  IFNg+LPS

str(sampleTable)

## 'data.frame':    9 obs. of  5 variables:
##  $ sample   : chr  "MO_1" "MO_2" "MO_3" "M1_1" ...
##  $ filename : chr  "SRR11111648.ENSG.count" "SRR11111649.ENSG.count" "SRR11111650.ENSG.count" "SRR11111651.ENSG.count" ...
##  $ condition: chr  "M0" "M0" "M0" "M1" ...
##  $ replicate: int  1 2 3 1 2 3 1 2 3
##  $ treatment: chr  "none" "none" "none" "IFNg+LPS" ...

Next, we want to specify the “condition” and “treatment” column as factors. By doing so, this will allow us to categorize the data and store it as levels.

#specify as factor 
condition <- factor(sampleTable$condition)
replicate <- factor(sampleTable$replicate)
treatment <- factor(sampleTable$treatment)

Running DESeq2 requires DESeqDataSet

dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = ".",
                                  design = ~replicate + condition)

dds

## class: DESeqDataSet 
## dim: 57010 9 
## metadata(1): version
## assays(1): counts
## rownames(57010): ENSMUSG00000000001.5 ENSMUSG00000000003.16 ...
##   ENSMUSG00002076991.1 ENSMUSG00002076992.1
## rowData names(0):
## colnames(9): MO_1 MO_2 ... M1_T368_2 M1_T368_3
## colData names(3): condition replicate treatment

The DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them if required.

Note: The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the log2 fold changes of the model.

Note: In order to benefit from the default settings of the package, one should put the variable of interest at the end of the formula and make sure the control level is the first level.

In other words, since we want to measure the effect of the condition between the two treatments this should go at the end and if we wanted to control for potential batch differences due to how these replicates were made we would add this as the first variable(~batch+condition).

Pre-filtering

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.

Below we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

Note on factor levels

By default, R will choose a reference level for factors based on alphabetical order. Therefore, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels.

use relevel to specify the reference level:

dds$condition <- relevel(dds$condition, ref = "M0")

Now let’s run the Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described in the manual page for ?DESeq.

Results table will be generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values.

Using the results function, the user can specify the comparison to build a results table for, using the contrast argument.

Details about the comparison are printed to the console, directly above the results table. The text, condition KO vs WT, tells you that the estimates are of the logarithmic fold change log2(KO/WT) i.e. log2(treated/untreated).

dds <- DESeq(dds)
res <- results(dds, contrast=c('condition', 'M1', 'M0')) 
head(res)

## log2 fold change (MLE): condition M1 vs M0 
## Wald test p-value: condition M1 vs M0 
## DataFrame with 6 rows and 6 columns
##                          baseMean log2FoldChange     lfcSE      stat
##                         <numeric>      <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5   1925.70090       0.254874 0.0680719   3.74418
## ENSMUSG00000000028.16    90.53512      -5.106565 0.7575109  -6.74124
## ENSMUSG00000000037.18     2.75172      -4.716234 1.3373231  -3.52662
## ENSMUSG00000000056.8    475.90870       0.606575 0.1086542   5.58262
## ENSMUSG00000000058.7    544.94952      -1.882684 0.2592969  -7.26073
## ENSMUSG00000000078.8  11894.37397       2.277699 0.0614485  37.06679
##                             pvalue         padj
##                          <numeric>    <numeric>
## ENSMUSG00000000001.5   1.80983e-04  5.15166e-04
## ENSMUSG00000000028.16  1.57037e-11  8.36804e-11
## ENSMUSG00000000037.18  4.20896e-04  1.13702e-03
## ENSMUSG00000000056.8   2.36920e-08  1.00226e-07
## ENSMUSG00000000058.7   3.85015e-13  2.27624e-12
## ENSMUSG00000000078.8  9.63341e-301 2.31215e-298

dim(res)

## [1] 17281     6

_baseMean - Mean of the normalized counts for all samples in the comparison, for a given gene
_log2FoldChange - log2 fold change between ‘untreated’ and ‘treated’ conditions
_pvalue - Wald test P value
_padj - Benjamini-Hochberg adjusted Wald test P value (P-value after multiple test correction)

We can now order our results table by the smallest padj value:

result <- res[order(res$padj), ]
head(result)

## log2 fold change (MLE): condition M1 vs M0 
## Wald test p-value: condition M1 vs M0 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat    pvalue
##                       <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000386.19  15235.74        8.11971 0.1783980   45.5146         0
## ENSMUSG00000006818.6    7758.10        4.15912 0.0884191   47.0387         0
## ENSMUSG00000015837.16  21140.90        2.79545 0.0601563   46.4698         0
## ENSMUSG00000016496.8   24244.10        8.13432 0.1392793   58.4030         0
## ENSMUSG00000017652.17   5173.87        6.05825 0.1369732   44.2295         0
## ENSMUSG00000018899.18  15003.97        5.53825 0.0760465   72.8272         0
##                            padj
##                       <numeric>
## ENSMUSG00000000386.19         0
## ENSMUSG00000006818.6          0
## ENSMUSG00000015837.16         0
## ENSMUSG00000016496.8          0
## ENSMUSG00000017652.17         0
## ENSMUSG00000018899.18         0

table(result$padj<0.05)

## 
## FALSE  TRUE 
##  8415  8866

Now, we can take this output and merge with the normalized counts data. This output will contain 19,000+ rows still we simply just ordered it by padj value. We will label this file with “unfiltered_matrix”

resdata <- merge(as.data.frame(result), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene_id"
head(resdata)

##                 Gene_id  baseMean log2FoldChange      lfcSE     stat pvalue
## 1 ENSMUSG00000000386.19 15235.742       8.119711 0.17839804 45.51457      0
## 2  ENSMUSG00000006818.6  7758.105       4.159120 0.08841911 47.03870      0
## 3 ENSMUSG00000015837.16 21140.896       2.795451 0.06015630 46.46980      0
## 4  ENSMUSG00000016496.8 24244.104       8.134324 0.13927930 58.40296      0
## 5 ENSMUSG00000017652.17  5173.867       6.058249 0.13697318 44.22945      0
## 6 ENSMUSG00000018899.18 15003.974       5.538247 0.07604645 72.82716      0
##   padj       MO_1       MO_2       MO_3      M1_1      M1_2      M1_3 M1_T368_1
## 1    0   87.45999   86.78768   71.80221 26227.415 27253.492 26498.443 17434.796
## 2    0  588.58211  695.03078  613.86945 12330.056 11839.744 11089.540 11696.676
## 3    0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4    0  121.34089  107.93762  160.17416 32976.318 36028.969 32225.833 35137.266
## 5    0  114.24954  121.79448  110.46494  9026.013  8623.723  8081.866  6895.416
## 6    0  494.03077  508.32786  434.75844 22535.267 22614.290 22597.254 21142.867
##   M1_T368_2 M1_T368_3
## 1 19974.740 19486.738
## 2 10580.185 10389.257
## 3 30429.633 29109.123
## 4 43332.355 38106.745
## 5  7008.213  6583.063
## 6 22140.661 22568.308

#write results
write.csv(resdata, file="M0_VS_M1andM1T368_unfiltered_matrix_Final.csv")

The next code chunk makes use of the tidyverse package. + dplyr lets you manipulate tabular data. + tidyr provides additional functions that let you reshape data to get it into a tidy format. + ggplot2 offers a grammar of graphics approach to data visualization. + stringr provides fast and consistent functions for dealing with string data. + broom takes many different kinds of R objects and turns them into tidy data frames. + magrittr provide the pipe function (%>%) much used by other R packages and in this book.

We are using this package as a way of getting a list of differentially expressed genes (DEGs) only. Now, we are filtering the 19,000+ rows to only rows which meet the criteria of having an absolute log2FC of 1 and padj of less than 0.05. We will label this file with “log2fc1_fdr05_DEGlist”

#Get DEG up/down lists
resdata <- as.data.frame(result) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) %>% tibble::rownames_to_column("Gene_id")

#write results
write.csv(resdata, "M0_VS_M1andM1T368_log2fc1_fdr05_DEGlist_Final.csv", row.names = TRUE)

head(resdata)

##                 Gene_id  baseMean log2FoldChange      lfcSE     stat pvalue
## 1 ENSMUSG00000000386.19 15235.742       8.119711 0.17839804 45.51457      0
## 2  ENSMUSG00000006818.6  7758.105       4.159120 0.08841911 47.03870      0
## 3 ENSMUSG00000015837.16 21140.896       2.795451 0.06015630 46.46980      0
## 4  ENSMUSG00000016496.8 24244.104       8.134324 0.13927930 58.40296      0
## 5 ENSMUSG00000017652.17  5173.867       6.058249 0.13697318 44.22945      0
## 6 ENSMUSG00000018899.18 15003.974       5.538247 0.07604645 72.82716      0
##   padj
## 1    0
## 2    0
## 3    0
## 4    0
## 5    0
## 6    0

Data transformation and visualization

Principal components analysis (PCA)

In order to test for differential analysis, we used raw counts. These matrices are so large that we need convenient ways to extract the important information from these large data matrices - it is useful to transform the data.

In using principal component analysis (PCA): + The data is reduced to smaller matrices so they can more easily be examined, plotted and interpreted. + The most important factors are extracted (principal components).

Below we are using either variance stabilizing transformation (vst) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010) or regularized logarithm (rlog) (Love, Huber, and Anders 2014). Both transformations will produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.

While using either function, one can specify the argument blind for whether the transformation should be blind to the sample information specified by the design formula. A blind dispersion estimate is not appropriate if one expects that many or the majority of genes (rows) will have large differences in counts which are explained by the experimental design. These differences due to experimental design will be interpreted as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula.

Now using the transformed data we will generate a PCA plot. The PCA plot will show the samples in a 2D plane spanned by the first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates (ex. age) and batch effects (ex. day sample was prepared).

vstcounts <- vst(dds, blind = FALSE)
plotPCA(vstcounts, intgroup=c("condition"))

It is also possible to customize the PCA plot using the ggplot function. Below we are adding shapes to represent the different days the samples were created.

Now, we are making sure the x and y axis are similar using ylim argument

pcaData <- plotPCA(vstcounts, intgroup=c("condition", "treatment"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, shape=condition, color=treatment)) +
  geom_point(size=4) +
  ylim(-10,15) + 
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

ggsave("M0_VS_M1andM1T368_DESeq2_PCA_Final.png")

Now that we got it looking the way we want it, we can take some time to interpret the plot. Each data point represents one sample. The x-axis typically PC1, explains the largest source of variation in the dataset. We can see that PC1 is primarily explained by stimulating macrophages (M1) by IFNg+LPS. The next largest source of variance within the dataset, or PC2, is explained by one sample created by DGAT1 inhibition (T863 drug treatment) in the M1 group. I expected to see higher variance % in PC2.

Finally, we can generate a correlation heatmap. This is a type of plot used to visualize the strength of relationships between samples.

library(pheatmap)
rld <- rlog(dds, blind=FALSE)
rld_mat <- assay(rld)
rld_cor <- cor(rld_mat)
pheatmap(rld_cor)

Using some arguments, we can make the plot look pretty

pheatmap(rld_cor, clustering_distance_rows = "euclidean", 
         fontsize = 10,  cellwidth = 35, show_rownames = F, color =
           colorRampPalette(brewer.pal(10,"PiYG"))(400))

ggsave("M0_VS_M1amdT368_DESeq_heatmap_Final.png")

According to this correlation heatmap, M1(including M1+T863) and M0 are highly distinguishable, but M1 and M1+T863 appeared to be highly correlated.

B. Volcano plot

Volcano Plot Introduction

Previously, I generated plots (PCA/correlation heatmap) to visualize the quality of the samples. I will continue to visualize the RNA-Seq dataset of Macrophages that are stimulated with LPS+IFNg (with and without DGAT1 inhibitor, T863). In the data set, I will use untreated group (M0) as my WT (reference) and all stimulated groups (M1) as testing group. The goal of this volcano plot is to visualize significantly differentially expressed genes between reference and testing groups.

A common plot used to represent a global view of the data is a volcano plot. Here, log transformed adjusted p-values are plotted on the y-axis while the log2 fold change is plotted on the x-axis.

We will be generating volcano plots using:

ggplots2: an data visualization package that is offered via the tidyverse collection
EnhancedVolcano: a package created by Kevin Blighe, Sharmila Rana, and Myles Lewis to generate publication-ready volcano plots.

Load the required libraries for Volcano Plot

Remember to first check if you require a package for this tutorial using library(). Once the package is installed there is no need to reinstall it over and over again.

#BiocManager::install('EnhancedVolcano') #if you need to install
#BiocManager::install("biomaRt") #if you need to install

library(EnhancedVolcano)
library(biomaRt)
library(dplyr)
library(tidyr)
library(ggplot2)

Volcano plot w/ EnhancedVolcano

Data input

resdata <- read.csv("M0_VS_M1andM1T368_unfiltered_matrix_Final.csv", header = T)
resdata <- resdata[, c(2:17)]
head(resdata)

##                 Gene_id  baseMean log2FoldChange      lfcSE     stat pvalue
## 1 ENSMUSG00000000386.19 15235.742       8.119711 0.17839804 45.51457      0
## 2  ENSMUSG00000006818.6  7758.105       4.159120 0.08841911 47.03870      0
## 3 ENSMUSG00000015837.16 21140.896       2.795451 0.06015630 46.46980      0
## 4  ENSMUSG00000016496.8 24244.104       8.134324 0.13927930 58.40296      0
## 5 ENSMUSG00000017652.17  5173.867       6.058249 0.13697318 44.22945      0
## 6 ENSMUSG00000018899.18 15003.974       5.538247 0.07604645 72.82716      0
##   padj       MO_1       MO_2       MO_3      M1_1      M1_2      M1_3 M1_T368_1
## 1    0   87.45999   86.78768   71.80221 26227.415 27253.492 26498.443 17434.796
## 2    0  588.58211  695.03078  613.86945 12330.056 11839.744 11089.540 11696.676
## 3    0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4    0  121.34089  107.93762  160.17416 32976.318 36028.969 32225.833 35137.266
## 5    0  114.24954  121.79448  110.46494  9026.013  8623.723  8081.866  6895.416
## 6    0  494.03077  508.32786  434.75844 22535.267 22614.290 22597.254 21142.867
##   M1_T368_2 M1_T368_3
## 1 19974.740 19486.738
## 2 10580.185 10389.257
## 3 30429.633 29109.123
## 4 43332.355 38106.745
## 5  7008.213  6583.063
## 6 22140.661 22568.308

Notice, that under the Gene_id column are Ensembl gene ids with versions included. This was due to the GTF file used to count reads. This is a tricky part with downloading reference files from public repositories as they may differ slightly depending on which repository was used.

In the GENCODE_GTF the gene_id contains information with versions included (i.e. ENSMUSG00000015143.16).
In the ENSEMBL_GTF the gene_id does not contain version. (i.e. ENSMUSG00000015143).

Either way moving forward, we would like to create a volcano plot with gene symbols not ENSEMBL gene ids, so this is a good opportunity for us to use biomaRt to perform some id conversion!

BioMart id conversion

The biomaRt program provides an interface to a growing collection of databases implementing the BioMart software suite. This package enables retrival of large amounts of data in a uniform way. The most prominent examples of BioMart databases are maintained by Ensembl, which provides users access to a diverse set of data and enables users to perform gene annotation and data mining.

To begin, a first step in using the biomaRt package is to select a BioMart database and dataset to use. To know which BioMart databases are available we will use listEnsembl(). Please note we have already loaded the biomaRt package library above!

listEnsembl()

##         biomart                version
## 1         genes      Ensembl Genes 109
## 2 mouse_strains      Mouse strains 109
## 3          snps  Ensembl Variation 109
## 4    regulation Ensembl Regulation 109

The useEnsembl function enables one to connect to a specified BioMart database and dataset hosted by Ensembl without having to specify the Ensembl URL. We will select genes.

The mirror specifies an Ensembl mirror to connect to. The valid options include www, uswest, useast, and asia.

ensembl <- useEnsembl(biomart = "genes", mirror="useast")

Next, we will be selecting the dataset we would like to use. We can view all the species datasets available by using:

#View(listDatasets(ensembl))

This dataset was created using Mouse bone marrow-derived macrophages (BMDMs). Therefore, we will be selecting Mus musculus.

ensembl <- useEnsembl(biomart = "genes", dataset = "mmusculus_gene_ensembl")

Overall our objective is to convert from one gene id to another.

There are many Gene ID types, below are some of the most common: + Entrez Gene - 1457, 2002, 1950 + Gene Symbol - Ikzf1, Tcf1, Cd19 + Ensembl ID - ENSMUSG00000018654

First, we will need to find the “type” of gene id of our query and target, respectively. To do this we can use Filters, which are used as inputs for a biomaRt query. Then we will need to use Attributes, which are the outputs or the information we want to retrieve based on our biomaRt target.

#View(listFilters(ensembl)) #ensembl_gene_id_version
#View(listAttributes(ensembl)) #external_gene_name

From this we selected that our gene id is in the form of ensembl_gene_id_version and we would like external_gene_name.

Now we will be using the getBM function. Given a set of filters, it will retrieve the user specified attribute fromt he BioMart database.

biomart.output <- getBM(mart = ensembl, values = resdata$Gene_id, 
                      filters = "ensembl_gene_id_version",
                      attributes = c("external_gene_name",
                                     "ensembl_gene_id_version"))
                
head(biomart.output)

##   external_gene_name ensembl_gene_id_version
## 1              Cdc45   ENSMUSG00000000028.16
## 2               Narf    ENSMUSG00000000056.8
## 3               Cav2    ENSMUSG00000000058.7
## 4               Klf6    ENSMUSG00000000078.8
## 5              Cox5a    ENSMUSG00000000088.8
## 6                Fer   ENSMUSG00000000127.16

Let’s look at the structure of resdata:

str(resdata)

## 'data.frame':    17281 obs. of  16 variables:
##  $ Gene_id       : chr  "ENSMUSG00000000386.19" "ENSMUSG00000006818.6" "ENSMUSG00000015837.16" "ENSMUSG00000016496.8" ...
##  $ baseMean      : num  15236 7758 21141 24244 5174 ...
##  $ log2FoldChange: num  8.12 4.16 2.8 8.13 6.06 ...
##  $ lfcSE         : num  0.1784 0.0884 0.0602 0.1393 0.137 ...
##  $ stat          : num  45.5 47 46.5 58.4 44.2 ...
##  $ pvalue        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ padj          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MO_1          : num  87.5 588.6 4258.8 121.3 114.2 ...
##  $ MO_2          : num  86.8 695 4119.1 107.9 121.8 ...
##  $ MO_3          : num  71.8 613.9 4398.9 160.2 110.5 ...
##  $ M1_1          : num  26227 12330 30475 32976 9026 ...
##  $ M1_2          : num  27253 11840 28894 36029 8624 ...
##  $ M1_3          : num  26498 11090 28055 32226 8082 ...
##  $ M1_T368_1     : num  17435 11697 30529 35137 6895 ...
##  $ M1_T368_2     : num  19975 10580 30430 43332 7008 ...
##  $ M1_T368_3     : num  19487 10389 29109 38107 6583 ...

We would like to add the gene names to this data.frame. To do this we will use dplyr::left_join.

resdata_gene <- dplyr::left_join(resdata, biomart.output,
                        by=c('Gene_id'='ensembl_gene_id_version'))
resdata_gene <- resdata_gene[, c(1:17)]
str(resdata_gene)

## 'data.frame':    17281 obs. of  17 variables:
##  $ Gene_id           : chr  "ENSMUSG00000000386.19" "ENSMUSG00000006818.6" "ENSMUSG00000015837.16" "ENSMUSG00000016496.8" ...
##  $ baseMean          : num  15236 7758 21141 24244 5174 ...
##  $ log2FoldChange    : num  8.12 4.16 2.8 8.13 6.06 ...
##  $ lfcSE             : num  0.1784 0.0884 0.0602 0.1393 0.137 ...
##  $ stat              : num  45.5 47 46.5 58.4 44.2 ...
##  $ pvalue            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ padj              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MO_1              : num  87.5 588.6 4258.8 121.3 114.2 ...
##  $ MO_2              : num  86.8 695 4119.1 107.9 121.8 ...
##  $ MO_3              : num  71.8 613.9 4398.9 160.2 110.5 ...
##  $ M1_1              : num  26227 12330 30475 32976 9026 ...
##  $ M1_2              : num  27253 11840 28894 36029 8624 ...
##  $ M1_3              : num  26498 11090 28055 32226 8082 ...
##  $ M1_T368_1         : num  17435 11697 30529 35137 6895 ...
##  $ M1_T368_2         : num  19975 10580 30430 43332 7008 ...
##  $ M1_T368_3         : num  19487 10389 29109 38107 6583 ...
##  $ external_gene_name: chr  "Mx1" "Sod2" "Sqstm1" "Cd274" ...

In its most basic usage, EnhancedVolcano requires: + data-frame (i.e. resdata_gene), + a column of rownames (i.e. lab =), + a column name containing log2 fold changes (i.e. x =log2FoldChange) + a column name containing nominal or adjusted pvalues (i.e. y = padj)

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$external_gene_name),
                x = 'log2FoldChange',
                y = 'padj')

Now, there are numerous arguments that can be used to customize the volcano plot with EnhancedVolcano. We will begin with the most fundamental and most important of these options which includes visualizing sections of the volcano plot based off p adjusted cutoff of < 0.05 and absolute log2 of 1. For both, vertical and horizontal lines will be drawn to show these cutoffs exactly where we want them. Note, this was done in the plot above, however were you confident where these lines were drawn?

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$external_gene_name),
                x = 'log2FoldChange',
                y = 'padj', 
                title = "M1 (+/- T863) VS M0",
                FCcutoff = 1,
                pCutoff = 0.05)

Now, lets convert our colors to red and black like the article. Naturally, EnhancedVolcano will color based on four quadrants. These quadrants are shown (in order):
1) NS - shown in black 2) Log2FC - shown in green 3) p-value - shown in blue 4) p-value & Log2FC - shown in red

In addition, let’s move the legend position to the right of the plot.

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$external_gene_name),
                x = 'log2FoldChange',
                y = 'padj', 
                title = "Stimulated VS Unstimulated",
                subtitle = "",
                FCcutoff = 1,
                pCutoff = 0.05,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 12,
                legendIconSize = 3.0)

There might be too many genes labeled here making the plot look cluttered. Let’s label a few select DE genes. First, lets get a complete of list of our DEGs using:

resdata_sig <- as.data.frame(resdata_gene) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(resdata_sig)

##                 Gene_id  baseMean log2FoldChange      lfcSE     stat pvalue
## 1 ENSMUSG00000000386.19 15235.742       8.119711 0.17839804 45.51457      0
## 2  ENSMUSG00000006818.6  7758.105       4.159120 0.08841911 47.03870      0
## 3 ENSMUSG00000015837.16 21140.896       2.795451 0.06015630 46.46980      0
## 4  ENSMUSG00000016496.8 24244.104       8.134324 0.13927930 58.40296      0
## 5 ENSMUSG00000017652.17  5173.867       6.058249 0.13697318 44.22945      0
## 6 ENSMUSG00000018899.18 15003.974       5.538247 0.07604645 72.82716      0
##   padj       MO_1       MO_2       MO_3      M1_1      M1_2      M1_3 M1_T368_1
## 1    0   87.45999   86.78768   71.80221 26227.415 27253.492 26498.443 17434.796
## 2    0  588.58211  695.03078  613.86945 12330.056 11839.744 11089.540 11696.676
## 3    0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4    0  121.34089  107.93762  160.17416 32976.318 36028.969 32225.833 35137.266
## 5    0  114.24954  121.79448  110.46494  9026.013  8623.723  8081.866  6895.416
## 6    0  494.03077  508.32786  434.75844 22535.267 22614.290 22597.254 21142.867
##   M1_T368_2 M1_T368_3 external_gene_name
## 1 19974.740 19486.738                Mx1
## 2 10580.185 10389.257               Sod2
## 3 30429.633 29109.123             Sqstm1
## 4 43332.355 38106.745              Cd274
## 5  7008.213  6583.063               Cd40
## 6 22140.661 22568.308               Irf1

write.csv(resdata_sig, file="M1 and M1T863 VS M0_DEG_forHeatmap_Final.csv")

Activated M1s are known to highly produce pro-inflammatory molecules, including Prostaglandine E2 (PGE2), TNF, IL-1b, IL-6, Nitric Oxide (N0). In order to visualize differential expressions of genes that are associated with pro-inflammation, I am selecting ‘Ptgs2’,‘Ptgs’,‘Ptges’ (involved in PGE2 synthesis),‘Acod1’ (production of proinflammatory metabolite called ‘itaconate’),‘Nfkb’ (transcription factor - as a postive control),‘Gbp2’,‘Gbp3’ (used as M1 marker genes - as a positive control),‘Dgat1’,‘Dgat2’ (TG synthesis proteins, Author’s genes of interest),‘Il6’,‘Il1b’,‘Tnf’ (pro-inflammatory cytokines),‘Sod2’ (a build up of nitric oxide is known to induce ROS-response genes such as Sod2 which codes for superoxide dismutase),‘Cpt1a’ (involved in fatty acid oxidation),‘Ccr7’ (involved in Macrophage chemoxasis), and’Fasn’ (fatty acid synthesis).

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$external_gene_name),
                x = 'log2FoldChange',
                y = 'padj', 
                selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
                title = "M1 (+/- T863) VS M0",
                subtitle = "", 
                FCcutoff = 1,
                pCutoff = 0.05,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black')

It looks like a few genes are hard to read. We can fix this by adding a boxed label. This will make the gene name pop out and will be given a white background. I am also choosing to bold the gene name. In addition, lets get rid of the major and minor grid lines as they are taking away from the image aesthetics. 3

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$external_gene_name),
                x = 'log2FoldChange',
                y = 'padj', 
                selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
                title = "M1 (+/- T863) VS M0",
                subtitle = "", 
                FCcutoff = 1,
                pCutoff = 0.05,
                labSize = 6, 
                axisLabSize = 15, 
                col=c('black', 'black', 'black', 'indianred2'),
                labCol = 'black',
                labFace = 'plain',
                boxedLabels = TRUE,
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 14,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black', 
                gridlines.major = FALSE,
                gridlines.minor = FALSE)

ggsave("M0 VS M1_enhanced Volcano_run2.png")

If you wanted to play around with colors (up versus down regulated) below is a script to help you get started.

keyvals <- ifelse(
  resdata_gene$log2FoldChange < -2, 'purple2',
  ifelse(resdata_gene$log2FoldChange > 2, 'seagreen3',
         'black'))
keyvals[is.na(keyvals)] <- 'black'
names(keyvals)[keyvals == 'purple2'] <- 'Downregulated'
names(keyvals)[keyvals == 'black'] <- 'NA'
names(keyvals)[keyvals == 'seagreen3'] <- 'Upregulated'

EnhancedVolcano(resdata_gene,
  lab = as.character(resdata_gene$external_gene_name),
  x = 'log2FoldChange',
  y = 'padj', 
  selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
  boxedLabels = TRUE,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black', 
  xlab = bquote(~Log[1]~ 'fold change'),
  colCustom = keyvals,
  title = "M1 (+/- T863) VS M0",
  subtitle = "", 
  FCcutoff = 2,
  pCutoff = 0.05,
  labSize = 5.5,
  labFace = 'plain',
  axisLabSize = 13,
  legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
  legendPosition = 'right',
  legendLabSize = 14,
  legendIconSize = 4.5,
  gridlines.major = FALSE,
  gridlines.minor = FALSE)

ggsave("M0 VS M1_enhance volcano_run2.png")

For my visualization preference, I would like for my enhanced volcano plot to be sitting sideways. I would like the unregulated genes to be first top half of the screen. So, I will add ‘+coord_flip()’ to the previous r chunk.

  keyvals <- ifelse(
  resdata_gene$log2FoldChange < -2, 'purple2',
  ifelse(resdata_gene$log2FoldChange > 2, 'seagreen3',
         'black'))
keyvals[is.na(keyvals)] <- 'black'
names(keyvals)[keyvals == 'purple2'] <- 'Downregulated'
names(keyvals)[keyvals == 'black'] <- 'NA'
names(keyvals)[keyvals == 'seagreen3'] <- 'Upregulated'

EnhancedVolcano(resdata_gene,
  lab = as.character(resdata_gene$external_gene_name),
  x = 'log2FoldChange',
  y = 'padj', 
  selectLab = c('Nos2', 'Ptgs2','Ptgs','Ptges','Acod1','Nfkb','Gbp2','Gbp3','Dgat1','Dgat2','Il6','Il1b','Tnf','Sod2','Cpt1a','Ccr7','Fasn'),
  boxedLabels = TRUE,
                drawConnectors = TRUE,
                widthConnectors = 0.8,
                colConnectors = 'red2',
  parseLabels = TRUE,
  xlab = bquote(~Log[1]~ 'fold change'),
  colCustom = keyvals,
  title = "M1 (+/- T863) VS M0",
  subtitle = "", 
  FCcutoff = 2,
  pCutoff = 0.05,
  labSize = 5.5,
  labFace = 'plain',
  axisLabSize = 13,
  legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
  legendPosition = 'right',
  legendLabSize = 14,
  legendIconSize = 4.5,
  gridlines.major = FALSE,
  gridlines.minor = FALSE) + coord_flip()

  ggsave("M0 VS M1andM1T863_Volcano_Side.png")

This enhanced volcano plot shown above will be used for my presentation. Regardless of T863 treatment, M1 appears to upregulate the genes that are associated with inflammation. Interestingly, Dgat1 and 2 expression levels were not too different in M1 in comparison to M0. This low level Dgat1 expression difference is different than what the authors found in their data analysis. This data analysis discrepancy may due to sequencing library analyzing method - Authors’ used SnakePipe pipeline program for the analysis. ***

C.Heatmap

Introduction to Heatmaps

We have discovered that ggplot2 has incredible functionality and versatility; however, it is not always the best choice for all graphics. There are a plethora of different packages that specialize in particular types of figures. In this section, we will concentrate on creating hierarchical heatmaps using one of these packages called pheatmap.

pheatmap is a package used for drawing pretty heatmaps in R. This function has several advantages including the ability to control heatmap dimensions and appearance.

Loading packages for Heatmap

#install.packages(pheatmap) #install pheatmap 
library(pheatmap)
library(RColorBrewer)
library(ggplot2)
library(cowplot)

Loading data

We will continue using our counts matrix generated identifying differentially expressed genes between M1 (with and without T863) versus M0.

Load the ..DEG.csv file and the ..anno.csv and place them into the object called heatmap_normDEG and heatmap_meta, respectively.

heatmap_normDEG <- read.csv("M0_VS_M1andM1T863_DEG_Trimmed_final.csv", header = T, row.names = 1)
#heatmap_normDEG

heatmap_meta <- read.csv("SG_GSE145523_M0_VS_M1andM1T368_anno_simple_heatmap.csv", header = T)
heatmap_meta

##      sample condition     treatment
## 1      MO_1        M0          none
## 2      MO_2        M0          none
## 3      MO_3        M0          none
## 4      M1_1        M1      IFNg+LPS
## 5      M1_2        M1      IFNg+LPS
## 6      M1_3        M1      IFNg+LPS
## 7 M1_T368_1        M1 IFNg+LPS+T863
## 8 M1_T368_2        M1 IFNg+LPS+T863
## 9 M1_T368_3        M1 IFNg+LPS+T863

Set up prior to using `pheatmap`

There are few steps that need to be performed prior to using the pheatmap() function.

The row names of our metadata file need to be assigned and match the column names of our counts (numeric) data.

# Assign the sample names to be the row names
rownames(heatmap_meta) <- heatmap_meta$sample

The factor levels specified in the genotype column will determine the color automatically assigned to the group. We can manually specify the levels using the factor() function.

# Re-level factors
heatmap_meta$condition <- factor(heatmap_meta$condition, levels = c("M1", "M0"))

We can provide any customized colors to use in the heatmap to denote the groups using the colorRampPalette() function.

# Colors for heatmap
heatmap_colors <- colorRampPalette(c("blue", "white", "red"))(6)

We can also specify the colors to use to denote the annotations. The colors for the annotations need to be provided as a list:

# Colors for denoting annotations
ann_colors <- list(condition = c(M1 = "#D11313", M0 = "#6DB0F8"))

Now we are ready to draw the heatmap using the pheatmap() function. Go ahead a fill it out!

# BASIC USAGE 
## DO NOT RUN 
#pheatmap(numeric_matrix, 
         #color = ,
         #annotation_col = ,
         #annotation_colors = ,
         #show_colnames = , 
         #show_rownames = )

?pheatmap() #if you have questions

# Plot heatmap  
pheatmap(heatmap_normDEG, 
         color = heatmap_colors,
         annotation_col = heatmap_meta,
         annotation_colors = ann_colors,
         show_colnames = F, 
         show_rownames = F)

***

Z score

pheatmap(heatmap_normDEG, 
         color = heatmap_colors,
         annotation_col = heatmap_meta,
         annotation_colors = ann_colors,
         show_colnames = F, 
         show_rownames = F,
         cellwidth = 20,
         cellheight = 0.1,
         scale = "row")

NOTE: Upon addition of scale="row" Z-scores are plotted, rather than the actual normalized count value. Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed after the clustering, so that it only affects the graphical aesthetics and the color visualization is improved.

Adding annotation

Now we are only missing the text annotations that can be added using the cowplot package. However, the pheatmap image is stored as a list rather than a graphics object. The ggplotify package allows us to manipulate this list object into a ggplot object with the as.ggplot() function, facilitating further modifications with ggplot2 or cowplot.

# Load library
#install.packages("ggplotify")
library(ggplotify)
  
# Turn into a ggplot object
heatmap <- as.ggplot(pheatmap(heatmap_normDEG, 
                              color = heatmap_colors,
                              cluster_rows = T, 
                              annotation_col = heatmap_meta,
                              annotation_colors = ann_colors,
                              border_color="white", 
                              cluster_cols = T, 
                              show_colnames = F, 
                              show_rownames = F,  
                              scale="row",
                              treeheight_col = 10,
                              treeheight_row = 25,
                              cellwidth = 25,
                              cellheight = 0.1,
                              fontsize = 9,
                              annotation_names_col = F))

To add the annotations to the left sides of the plot, we will use the theme() function from ggplot2 to add whitespace. With the plot.margin argument we can specify the amount of space to add to the top, right, bottom and left margins (in that order).

# Adding white space to add annotations to the top, right, bottom, and left margins
heatmap <- heatmap + 
  theme(plot.margin = unit(c(1,0.5,0.5,0.5), "cm"))

Now we can use cowplot to add in our desired annotations to match the published figure below.

# Adding in annotations with cowplot 
heatmap_figure <- ggdraw(heatmap) +  
  draw_label("Macrophages", 
             x = 0.31, 
             y = 0.89,
             size = 17,
             hjust = 0,
             vjust = 0,
             fontface = "bold") +
  draw_label("M0", 
             x = 0.25, 
             y = 0.85,
             size = 13,
             hjust = 0,
             vjust = 0,
             fontface = "bold") +
  draw_label("M1", 
             x = 0.38, 
             y = 0.85,
             size = 13,
             hjust = 0,
             vjust = 0,
             fontface = "bold") +
  draw_label("M1+T863", 
             x = 0.48, 
             y = 0.85,
             size = 13,
             hjust = 0,
             vjust = 0,
             fontface = "bold") +
  draw_label("Down-regulated (2443)",
             angle = 90,
             size = 14,
             x = 0.1,
             y = 0.55,
             hjust = 0,
             vjust = 0) +
  draw_label("Up-regulated (2301)",
             angle = 90,
             size = 14,
             x = 0.1,
             y = 0.25,
             hjust = 0,
             vjust = 0) 
ggsave("M0 VS M1 and M1T368_heatmap_Final.png")
heatmap_figure

As it was reflected by the correlation heatmap (Part A. DESeq2), both M1 (M1 with and without T863) showed distinguishable gene epression pattern than M0. However, according to this annotated heatmap, not many genes appeared to be significantly different between M1 and M1+T863. Further narrow down analysis between M1 and M1+T863 will be done using ‘Gullickson_M1_VS_M1T863.Rmd’ located in /R_project_Soyeon_Gullickson_M1_VS_M1T863/.

D. Functional Analysis

Install and load required ackages for Functional Analysis

library(biomaRt)
library(clusterProfiler) #BiocManager::install("clusterProfiler")
library(org.Mm.eg.db) #BiocManager::install("org.Mm.eg.db") ## this is mouse! not a human or bear!
library(DOSE) #BiocManager::install("DOSE")
library(ggnewscale) #install.packages("ggnewscale")
library(pathview) #BiocManager::install("pathview")
library(ggupset) #install.packages("ggupset")
library(enrichplot) #BiocManager::install("enrichplot")

Functional analysis

The goal of functional analysis is provide biological insight, so it’s necessary to analyze our results in the context of our experimental hypothesis: Stimulation of macrophages using LPS+IFNg enhances macrophages pro-inflammatory immune effector functions. Therefore, we may expect an enrichment of processes/pathways related to myeloid/innate immune responses.

GO Ontologies

The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the function of genes. GO terms are organized into three independent controlled vocabularies (ontologies) in a species-independent manner:

Biological process: refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input.
Molecular function: represents the biochemical activity of the gene product, such activities could include “ligand”, “GTPase”, and “transporter”.
Cellular component: refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.

Each GO term has a term name (e.g. DNA repair) and a unique term accession number (GO:0005125), and a single gene product can be associated with many GO terms.

clusterProfiler

We will be using clusterProfiler to perform over-representation analysis on GO terms associated with our list of significant genes. The tool takes as input a significant gene list and a background gene list and performs statistical enrichment analysis using hypergeometric testing. The basic arguments allow the user to select the appropriate organism and GO ontology (BP, CC, MF) to test.

Prepping the data

resdata <- read.csv("M0_VS_M1andM1T368_unfiltered_matrix_Final.csv", header = T)
resdata <- resdata[, c(2:17)]
head(resdata)

##                 Gene_id  baseMean log2FoldChange      lfcSE     stat pvalue
## 1 ENSMUSG00000000386.19 15235.742       8.119711 0.17839804 45.51457      0
## 2  ENSMUSG00000006818.6  7758.105       4.159120 0.08841911 47.03870      0
## 3 ENSMUSG00000015837.16 21140.896       2.795451 0.06015630 46.46980      0
## 4  ENSMUSG00000016496.8 24244.104       8.134324 0.13927930 58.40296      0
## 5 ENSMUSG00000017652.17  5173.867       6.058249 0.13697318 44.22945      0
## 6 ENSMUSG00000018899.18 15003.974       5.538247 0.07604645 72.82716      0
##   padj       MO_1       MO_2       MO_3      M1_1      M1_2      M1_3 M1_T368_1
## 1    0   87.45999   86.78768   71.80221 26227.415 27253.492 26498.443 17434.796
## 2    0  588.58211  695.03078  613.86945 12330.056 11839.744 11089.540 11696.676
## 3    0 4258.75008 4119.13310 4398.87168 30474.816 28893.605 28054.684 30529.450
## 4    0  121.34089  107.93762  160.17416 32976.318 36028.969 32225.833 35137.266
## 5    0  114.24954  121.79448  110.46494  9026.013  8623.723  8081.866  6895.416
## 6    0  494.03077  508.32786  434.75844 22535.267 22614.290 22597.254 21142.867
##   M1_T368_2 M1_T368_3
## 1 19974.740 19486.738
## 2 10580.185 10389.257
## 3 30429.633 29109.123
## 4 43332.355 38106.745
## 5  7008.213  6583.063
## 6 22140.661 22568.308

BioMart id conversion

It is helpful if we align all of our ducks in a row. Therefore, we need to do some gene_id conversion. For clusterprofiler, we will use both ensembl id and Entrez id. Below is the basic format for the useEnsembl() function:

#listEnsembl() #list databases available 

## DO NOT RUN 

#ensembl <- useEnsembl(biomart = , 
                      #dataset = , 
                      #mirror=)

ensembl <- useEnsembl(biomart = "genes", 
                      dataset = "mmusculus_gene_ensembl", 
                      mirror="useast")

Now, go ahead and fill in the getBM() function to perform the gene_id match of ensembl gene ids to entrez gene ids. Output this function to an object called biomart.output.

#View(listFilters(ensembl))
#View(listAttributes(ensembl))

biomart.output <- getBM(mart = ensembl, values = resdata$Gene_id, 
                      filters = "ensembl_gene_id_version",
                      attributes = c("ensembl_gene_id_version", 
                                     "entrezgene_id"))
                                     
                
head(biomart.output)

##   ensembl_gene_id_version entrezgene_id
## 1   ENSMUSG00000000028.16         12544
## 2    ENSMUSG00000000056.8         67608
## 3    ENSMUSG00000000058.7         12390
## 4    ENSMUSG00000000078.8         23849
## 5    ENSMUSG00000000088.8         12858
## 6   ENSMUSG00000000127.16         14158

We are ready to merge our new entrez annotations with resdata

names(biomart.output)[1] <- "Gene_id"
#biomart.output

resdata_gene <- dplyr::left_join(resdata, biomart.output,
                        by=c('Gene_id'='Gene_id'))
#resdata_gene

Create Gene lists

To perform the over-representation analysis, we need a list of background genes and a list of significant genes. For our background dataset we will use all genes in our results table.. For our significant gene list we will use genes with p-adjusted values less than 0.05 and absolute log2FC of 1.

first lets prep background gene set

# we want the log2 fold change
original_gene_list <- resdata_gene$log2FoldChange
head(original_gene_list)

## [1] 8.119711 4.159120 2.795451 8.134324 6.058249 5.538247

# name the vector
names(original_gene_list) <- resdata_gene$entrezgene_id
head(original_gene_list)

##     <NA>    20656    18412    60533    21939    16362 
## 8.119711 4.159120 2.795451 8.134324 6.058249 5.538247

#original_gene_list

# omit any NA values
gene_list<-na.omit(original_gene_list)

# sort the list in decreasing order (will be required so good to do this now)
gene_list = sort(gene_list, decreasing = TRUE)

now lets prep significant gene set

# Extract significant results (padj < 0.05) using subset()
sig_genes_df = subset(resdata_gene, padj < 0.05)

# From significant results, we want to filter on log2fold change
genes <- sig_genes_df$log2FoldChange

# Name the vector i.e. add entrez gene id 
names(genes) <- sig_genes_df$entrezgene_id

# omit NA values
genes <- na.omit(genes)

# filter on min log2fold change (log2FoldChange > 1)
genes <- names(genes)[abs(genes) > 1]

Ok, now we are ready to perform the GO enrichment analysis! Go ahead and fill in the rest!

NOTE: The different organisms with annotation databases available to use with for the OrgDb argument can be found below:

Also, the keyType() argument can be used once you have filled in the OrgDb.

Finally, the ont argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three. For this example, set it to “BP”.

#go_enrich <- enrichGO(gene = ,
                      #universe = ,
                      #OrgDb = ,
                      #keyType = ,
                      #readable = T,
                      #ont = ,
                      #pvalueCutoff = 0.05,
                      #qvalueCutoff = 0.05)

keytypes(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

go_enrich <- enrichGO(gene = genes,
                      universe = names(gene_list),
                      OrgDb = org.Mm.eg.db,
                      keyType = 'ENTREZID',
                      readable = T,
                      ont = "BP",
                      pvalueCutoff = 0.05,
                      qvalueCutoff = 0.05)

## Output results from GO analysis to a table
cluster_summary <- data.frame(go_enrich)

write.csv(cluster_summary, "clusterProfiler_M1andM1T863 vs M0.csv")

head(cluster_summary)

##                    ID
## GO:0009617 GO:0009617
## GO:0031347 GO:0031347
## GO:0002460 GO:0002460
## GO:0006935 GO:0006935
## GO:0042330 GO:0042330
## GO:0001819 GO:0001819
##                                                                                                                          Description
## GO:0009617                                                                                                     response to bacterium
## GO:0031347                                                                                            regulation of defense response
## GO:0002460 adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains
## GO:0006935                                                                                                                chemotaxis
## GO:0042330                                                                                                                     taxis
## GO:0001819                                                                                positive regulation of cytokine production
##            GeneRatio   BgRatio       pvalue     p.adjust       qvalue
## GO:0009617  250/3841 462/12418 5.238966e-26 2.845282e-22 2.001285e-22
## GO:0031347  254/3841 475/12418 2.130683e-25 5.785869e-22 4.069604e-22
## GO:0002460  151/3841 264/12418 4.321770e-19 7.823844e-16 5.503054e-16
## GO:0006935  192/3841 365/12418 2.114030e-18 2.870325e-15 2.018899e-15
## GO:0042330  192/3841 366/12418 3.111990e-18 3.380244e-15 2.377561e-15
## GO:0001819  197/3841 379/12418 4.169712e-18 3.774284e-15 2.654716e-15
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   geneID
## GO:0009617                                                       Cd274/Cmpk2/Cd86/C3/Tap2/Mrc1/Gbp3/Gbp2/Herc6/Clec4e/Trim30a/Ifit1/Isg15/Nod1/Gbp7/Irgm1/Slamf8/Trim30d/H2-K1/Irgm2/Slfn2/Ifi204/Igtp/Ifi211/Serpinb9/Ccl2/Nlrc5/Irf8/Cxcl16/Cd52/Ifit3/Iigp1/Mndal/Tnfrsf1b/Abr/Jak2/Rnf213/Oas3/Gbp9/Usp18/Gbp5/Casp4/Slfn4/Ccl5/Cflar/B2m/Stat1/C5ar1/Dhx58/Znfx1/Fcgr1/Irf5/Nfkbia/Oas2/Casp1/Cd300lb/Tlr9/Arid5a/Psmb9/Ifi44/Malt1/Anxa3/Plaat3/Lgals9/Cxcl10/Cd47/Fcgr4/Tnfaip3/Ptgs2/Ripk2/Slc11a1/Bcl3/AA467197/Mx2/Trim12a/Rnf31/Lyz2/Sbno2/Peli1/Ifi205/Cx3cr1/Acod1/Gsdmd/Eif2ak2/Enpp1/Ptafr/Vim/Tnf/Tbk1/Wfdc17/Stap1/Syk/Lcn2/Sp110/Irak2/Axl/Mpeg1/Cebpb/Pde4b/Ednrb/Nfkbib/Tnfrsf1a/Fkbp5/Gch1/Rnase4/Ly6a/Nos2/Zc3h12a/Gbp4/Lpl/Pf4/Cyba/Nod2/Abcd2/Oas1a/Tnfrsf14/Hmgb2/Stab1/Trf/Mt2/Cxcl9/Oas1g/Acp5/Tnip1/Saa3/Gbp8/Trim12c/Naip2/Colec12/Cd84/Hck/Bnip3/Trem2/Cdk4/Tgtp1/Plscr4/Fcer1g/Mef2c/Notch2/Gbp6/Scarb1/Slc17a5/Chd7/Tent5a/Ticam1/Il12b/Serpina3f/Irak3/Foxp1/Pmp22/Spi1/Resf1/Adgrl2/Nlrc3/Zfp36/Pdcd4/Nlrc4/Tlr6/Ang/Serpine1/Il1a/Il1b/Trim5/Rela/Ptger2/Cfb/Sirpa/Ccl12/Spn/Hmgb1/H2-M3/Il18/Ncf1/Gbp2b/Cd36/Naip1/Mapkbp1/Trim6/Tnip3/Noct/Ptpn22/Prkce/Oas1c/Il6/Gbp10/Gbp11/Myd88/Klrk1/Cd80/Trib1/Cfh/Dusp10/Tnfrsf11a/Il36g/Pdcd1lg2/Tmem229b/Ccr7/Gpr31b/Slpi/Ptgir/Stab2/Klhl6/Ociad2/Bank1/Scimp/Cd24a/Tnfsf4/Cxcl2/Rps6ka3/Fgr/Adm/Cd79b/Ppard/Plac8/Raet1a/Raet1b/Raet1c/Mlh1/Bcr/Il23r/Mgst1/Nr1h3/Arg2/Zfp983/Rab34/Hamp/Gstp1/Lsm5/Car5b/Mir29a/Mir146/Mir23b/Ifnb1/Lta/Cav1/Ido1/Il12rb2/Pde4d/Cd4/Ssc5d/Dab2ip/Cx3cl1/Mir30b/Mir210/Tlr5/Mpo/S100a8/Pde2a/Hmcn1/Fabp4
## GO:0031347 Irf1/C3/Tap2/Irf7/Samhd1/Trim30a/Slc7a2/Parp14/Isg15/Nod1/Stat2/Trafd1/Ifi209/Irgm1/Slamf8/Trim30d/Ifi204/Igtp/Ifi211/Serpinb9/Daglb/Dtx3l/Cd74/Nlrc5/Socs3/Socs1/Ifi206/Ifi203/Tap1/Ninj1/Mndal/Tnfrsf1b/Abr/Jak2/Oas3/Gbp5/Casp4/Ccl5/Stat1/Bst1/Dhx58/N4bp1/Znfx1/Fcgr1/Zbp1/Pml/Nfkbia/Casp1/Tlr9/Trim56/Ctsc/Rigi/Zc3hav1/Ifi35/Adar/Il15/Aoah/Riok3/Anxa1/Lgals9/Fgl2/Cd47/Tnfaip3/Cst7/Ptgs2/Ripk2/Il27/Apobec3/Ifi208/Trim12a/Pik3cg/Rnf31/Mmp12/Sbno2/Arel1/Ifi205/Igf1/Cx3cr1/Cd200/Acod1/Nupr1/Clec2d/Parp9/Fancd2/Tnf/Tbk1/Pla2g4a/Stap1/Syk/Rnf135/Ptges/Abhd12/Cd28/Mefv/Ilrun/Fanca/Cebpb/Ednrb/Tnfrsf1a/Zc3h12a/Gbp4/Lpl/Nfe2l1/Cyba/Nod2/Oas1a/Traf3/Hmgb2/Serpinb1a/Rabgef1/Il16/Pvr/Clec4n/Oas1g/Acp5/Tnip1/Casp8/Creb3/Trim12c/Ier3/Il12rb1/Adrb2/Xrcc6/Calcrl/Nmi/Trim21/Ifi214/Trem2/Pbk/Smpdl3b/Fcer1g/Mef2c/Ets1/Hpse/Cdk19/Pparg/Il12b/Prkdc/Aim2/Irak3/Foxp1/Pmp22/C1qtnf12/Spi1/Mul1/Ccr2/Dnase1l3/Tlr8/Nlrc3/Zfp36/Pdcd4/Il17ra/Nlrc4/Ifi207/Fgfr1/Tlr6/Metrnl/Ptpn2/Serpine1/Uaca/Il1b/Trim5/Rela/Spn/Hmgb1/H2-M3/Cd200r4/Adora2b/Il18/Ncf1/Rasgrp1/Tlr3/Mapkbp1/Trim6/Ptpn22/Ttll12/Oas1c/Esr1/Il6/Adora2a/Dagla/Slamf6/Cebpa/Myd88/C1qbp/Klrk1/Treml4/Il1rl1/Cadm1/Ccr1/Pik3ap1/Cfh/Dusp10/Tnfrsf11a/Ccr7/Clcf1/Ctla2a/Lgals1/Hsp90aa1/Polr3g/Gpr31b/Serpinb9b/Pspc1/Serping1/Tnc/Osm/Sema7a/Banf1/Lacc1/Cd24a/Mcph1/Tnfsf4/Hgf/Cd81/Fgr/Ppard/Mmp8/Ceacam1/Nfkbiz/Il33/Raet1b/Casp12/Bcr/Il23r/Nr1h3/Arg2/Htra1/Lrrk2/Rab34/Xrcc5/Gprc5b/Hamp/Gstp1/Ikbke/Sphk1/Npas2/Il2ra/Ccl24/Alox5/Fut7/Lta/Bcl6b/Ido1/Cd276/Cx3cl1/Ada/Stk39/Pla2g3/Lag3/S100a8/Pde2a/Arg1/Fabp4/Il1rl2
## GO:0002460                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Cd274/Cd40/Irf1/Rsad2/C3/Tap2/Irf7/Ctsh/H2-T24/H2-T22/H2-K1/H2-T23/Pnp/Serpinb9/H2-Ab1/Cd74/Il18bp/H2-Aa/H2-D1/Icam1/Tnfrsf1b/Jak2/H2-Eb1/B2m/Stx11/H2-DMa/Fcgr1/Pirb/Il4ra/Arid5a/Ctsc/Malt1/Sema4a/Parp3/Anxa1/Fgl2/H2-Q4/Tnfaip3/Ripk2/Slc11a1/Il27/Bcl3/Unc13d/Fas/Vsir/Enpp1/Stat3/H2-DMb1/Tnf/C1ra/Pla2g4a/Csf2rb/Msh2/Vegfa/Emp2/Cd28/C1rl/Atad5/Msh6/Zc3h12a/Nod2/Hfe/Traf2/Il21r/Ly9/Relb/Pvr/Tcirg1/Clec4n/Nfkb2/Il12rb1/Jak3/Trem2/Hprt/Zbtb7b/Cd1d1/Fcer1g/Exo1/Mef2c/Nsd2/Rif1/Tyk2/Il12b/Csf2rb2/Swap70/Ccr2/Ung/Il17ra/Il1b/Hspa8/Batf/Spn/Hmgb1/H2-M3/Gadd45g/Il18/Slamf1/H2-Q6/Unc93b1/Il6/Slamf6/Myd88/C1qbp/H2-M2/Treml4/Il1rl1/Cfh/Jag1/Ccr7/Clcf1/Stard7/Pms2/Serpinb9b/Ptpn6/Klhl6/Serping1/C4b/Ephb6/H2-Oa/Cracr2a/Tfrc/Hmces/H2-Q7/H2-Q9/Cd24a/Tnfsf4/Cd81/C1s1/Nfkbiz/H2-Q1/Il33/Mlh1/Il23r/Irf4/Tnfsf13/Gm8909/Aplf/Raet1e/Smad7/Cd46/Gapt/Ifnb1/Gm7030/Rab27a/Fut7/Lta/Cd4/H2-DMb2/Ada/H2-Q2/Arg1
## GO:0006935                                                                                                                                                                                                                                                                                                                                                                                                                                       Gas6/Slamf8/Cd74/Ccl2/Cxcl16/Pla2g7/Ninj1/Ccl5/Bst1/C5ar1/Slc8b1/St6gal1/Sema4a/Stx3/Itgb2/Ccl7/Anxa1/Lgals9/Cxcl10/Lrp1/Rhoh/Ccrl2/Prex1/Flrt2/Nrp2/Pik3cg/Met/Cxcr4/Cx3cr1/Gpr183/Coro1b/Ptafr/Fpr2/Stap1/Sema4d/Csf3r/Syk/Vegfa/Pde4b/Ednrb/Lama3/Plxna1/Dock4/Pf4/Nod2/Hmgb2/Padi2/Cxcl9/Il16/Mtus1/Ccl8/Saa3/Creb3/Ccl6/Ccr3/Lpar1/Plxna2/Csf1/Coro1a/Trem2/Cmklr1/Lhx2/Nrcam/Fcer1g/Mef2c/Apbb2/C5ar2/Bin2/Cysltr1/Plxnd1/Sbds/Nrp1/Gpr18/Lamb2/Swap70/Foxp1/Eng/Spi1/Egr2/Ccr2/Dusp1/Ccl22/Il17ra/Nup85/Fgfr1/Etv1/Ptpn2/Serpine1/Thbs1/Aif1/Il1b/Sema3c/Fn1/Ccl12/Flot1/Ccl9/Kif5c/Hmgb1/P2rx4/Smo/Rin3/Lsp1/Slamf1/Ptk2/Pdgfa/Itga9/Sell/Cxcr3/Plxna4/Unc5b/Plgrkt/Vasp/F7/Nr4a1/Vav3/Ptch1/Kalrn/Ch25h/Kdr/Sema6b/C1qbp/Fpr1/Celsr3/Amot/Vegfb/Ccr1/Mcu/Rac2/Emb/Plxnc1/Gpr35/L1cam/Agrn/Rtn4rl1/Cttn/Ccr7/Matn2/Sema4c/S1pr1/Ptpro/Ephb6/Sema7a/Kif5a/Bcar1/Plxna3/Runx3/Hgf/B3gnt2/Cxcl2/Ryk/Plekhg5/Egr3/Flrt3/Sema6d/Unc5a/Hsd3b7/Cxcl14/Ntn1/Rtn4r/Klf7/Thbs4/Cdk5r1/Gstp1/Itga1/Xcr1/Tnfsf14/Nectin1/Neo1/Cxcr6/Ccl4/Apbb1/Pgf/Ccl24/Alox5/Crppa/Tpbg/Pde4d/Slit1/Edn1/Nrg1/Cx3cl1/Hoxa1/Lox/Stk39/P2ry12/Fpr3/Vcam1/Sema6c/S100a8/Ccr1l1/Plxnb1/Sema3f
## GO:0042330                                                                                                                                                                                                                                                                                                                                                                                                                                       Gas6/Slamf8/Cd74/Ccl2/Cxcl16/Pla2g7/Ninj1/Ccl5/Bst1/C5ar1/Slc8b1/St6gal1/Sema4a/Stx3/Itgb2/Ccl7/Anxa1/Lgals9/Cxcl10/Lrp1/Rhoh/Ccrl2/Prex1/Flrt2/Nrp2/Pik3cg/Met/Cxcr4/Cx3cr1/Gpr183/Coro1b/Ptafr/Fpr2/Stap1/Sema4d/Csf3r/Syk/Vegfa/Pde4b/Ednrb/Lama3/Plxna1/Dock4/Pf4/Nod2/Hmgb2/Padi2/Cxcl9/Il16/Mtus1/Ccl8/Saa3/Creb3/Ccl6/Ccr3/Lpar1/Plxna2/Csf1/Coro1a/Trem2/Cmklr1/Lhx2/Nrcam/Fcer1g/Mef2c/Apbb2/C5ar2/Bin2/Cysltr1/Plxnd1/Sbds/Nrp1/Gpr18/Lamb2/Swap70/Foxp1/Eng/Spi1/Egr2/Ccr2/Dusp1/Ccl22/Il17ra/Nup85/Fgfr1/Etv1/Ptpn2/Serpine1/Thbs1/Aif1/Il1b/Sema3c/Fn1/Ccl12/Flot1/Ccl9/Kif5c/Hmgb1/P2rx4/Smo/Rin3/Lsp1/Slamf1/Ptk2/Pdgfa/Itga9/Sell/Cxcr3/Plxna4/Unc5b/Plgrkt/Vasp/F7/Nr4a1/Vav3/Ptch1/Kalrn/Ch25h/Kdr/Sema6b/C1qbp/Fpr1/Celsr3/Amot/Vegfb/Ccr1/Mcu/Rac2/Emb/Plxnc1/Gpr35/L1cam/Agrn/Rtn4rl1/Cttn/Ccr7/Matn2/Sema4c/S1pr1/Ptpro/Ephb6/Sema7a/Kif5a/Bcar1/Plxna3/Runx3/Hgf/B3gnt2/Cxcl2/Ryk/Plekhg5/Egr3/Flrt3/Sema6d/Unc5a/Hsd3b7/Cxcl14/Ntn1/Rtn4r/Klf7/Thbs4/Cdk5r1/Gstp1/Itga1/Xcr1/Tnfsf14/Nectin1/Neo1/Cxcr6/Ccl4/Apbb1/Pgf/Ccl24/Alox5/Crppa/Tpbg/Pde4d/Slit1/Edn1/Nrg1/Cx3cl1/Hoxa1/Lox/Stk39/P2ry12/Fpr3/Vcam1/Sema6c/S100a8/Ccr1l1/Plxnb1/Sema3f
## GO:0001819                                                                                                                                                                                                                                                                                                                                                                                                     Cd274/Cd40/Irf1/Rsad2/C3/Irf7/Clec4e/Isg15/Nod1/Ifi204/Pnp/Ifi211/Cd74/Ccl2/Irf8/Ifi206/Mndal/Jak2/Oas3/Gbp5/Casp4/Ccl5/B2m/Hif1a/Stat1/C5ar1/Dhx58/Irf5/Il4ra/Sulf2/Oas2/Casp1/Tlr9/Arid5a/Trim56/Rigi/Malt1/Ifih1/Zc3hav1/Il15/Riok3/Anxa1/Lgals9/Ptgs2/Lrp1/Ripk2/Slc11a1/Il27/Bcl3/Ifngr1/Mmp12/Peli1/Ifi205/Cd200/Gsdmd/Eif2ak2/Stat3/Ptafr/Tnf/Tbk1/Syk/Rnf135/Cd28/Mefv/Cebpb/Pde4b/Scamp5/Tnfrsf1a/Mapk9/Nos2/Ticam2/Pnp2/Lpl/Cyba/Nod2/Oas1a/Brca1/Tnfrsf14/Src/Sphk2/Traf2/Hmgb2/Ly9/Il16/Gapdh/Tarm1/Clec4n/Oas1g/Casp8/Il12rb1/Nfam1/Rtn4/Cd84/Trem2/Atf4/Zbtb7b/Cd1d1/Fcer1g/Ccdc88b/Hpse/Psen1/Ticam1/Tyk2/Il12b/Aim2/Irak3/Foxp1/Hk1/Mbp/Ccr2/Tlr8/Il17ra/Nlrc4/Ifi207/Rgcc/Tlr6/Serpine1/Il1a/Thbs1/Mcoln2/Aif1/Il1b/Rela/Batf/Flot1/Spn/Hmgb1/H2-M3/Adora2b/Il18/Slamf1/Rasgrp1/Cd36/Hilpda/Tlr3/Trim6/Ddit3/Unc93b1/Ptpn22/Oas1c/Tnfsf15/Panx1/Il6/Tnfrsf8/Slamf6/Myd88/Egr1/Klrk1/Il1rl1/Cadm1/Htr2b/Flt4/Il36g/Ccr7/Hsp90aa1/Polr3g/Eif2ak3/Il1rap/Osm/Clec9a/Sema7a/Scimp/Lacc1/H2-Q9/Tnfsf4/Hgf/Cd81/Fgr/Mmp8/Il33/Raet1b/Il23r/Irf4/Chil3/Lrrk2/Gprc5b/Cd244a/Il7/Sphk1/Cd34/Card11/Cd46/Ccl4/Akap12/Cyp2j6/Mapk11/Lta/Ido1/Il12rb2/Cd276/Pde4d/Cx3cl1/Tlr5/Pla2g3/Postn/F3/Il1rl2
##            Count
## GO:0009617   250
## GO:0031347   254
## GO:0002460   151
## GO:0006935   192
## GO:0042330   192
## GO:0001819   197

Visualizing clusterProfiler results

clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the barplot(), dotplot() firtst.

barplot <- barplot(go_enrich, showCategory = 10,
                   title = "GO Biological Pathways",
                   font.size = 8)

barplot

Let’s make y-axis labels readable!

barplot <- barplot(go_enrich, showCategory = 15,
                   title = "GO Biological Pathways",
                   font.size = 12,
                   label_format = 50)
barplot

Next, we will represent the pathways as dotplot to display the top 10 GO terms by gene ratio (# genes related to GO term / total number of sig genes).

dot <- dotplot(go_enrich)
dot

Again, let’s make the y-axis label readable!

dot <- dotplot(go_enrich, label_format = 60, font.size = 12)
dot

pathway <- goplot(go_enrich, showCategory = 5)
pathway

Gene Concept Network

Both the barplot() and dotplot() only displayed most significant terms, but some users may want to know which genes are involved in these significant terms.

In order to consider the potentially biological complexities in which a gene may belong to multiple categories, we can use the cnetplot() function. The cnetplot() depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.

cnetplot(go_enrich, showCategory = 3, 
         categorySize="pvalue", foldChange=gene_list, 
         vertex.label.font=15, cex_label_gene = 0.6, cex_label_category= 0.9)

cnetplot(go_enrich, showCategory = 3, 
         categorySize="pvalue", foldChange=gene_list, 
         vertex.label.font=6, cex_label_category = 0.9,
         node_label="category")

> Select which labels to be displayed. Options include ‘category’, ‘gene’, ‘all’(the default) and ‘none’.

cnetplot(go_enrich, showCategory = 2, 
         categorySize="pvalue", foldChange=gene_list, 
         circular = TRUE, colorEdge = TRUE,
         cex_label_gene = 0.55, vertex.label.font=12)

Circular plots can be quite powerful, if the list was smaller, here things get a bit crowded - overall makes it hard to interpret. Moral of the story, just because you can make it doesn’t mean you should!

library(ggplot2)
plot_color <- cnetplot(go_enrich, showCategory = 5, 
         categorySize="pvalue", foldChange=gene_list, 
         vertex.label.font=10, cex_label_gene = 0.55, cex_label_category= 0.75) + 
  scale_color_gradient2(name='fold change', low='darkgreen', high='purple1')

plot_color

Gene Set Enrichment Analysis (GSEA)

Within clusterprofiler we can take advantage of the gseGO function which will allow us to perform Gene Set Enrichment Analysis. This package is great, because you can select which pathways you would like to represent specifically. GSEA works by aggregating the per gene statistics across genes within the given gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way.

Some GSEA statistics:

Calculation of an Enrichment Score. The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The ES is the maximum deviation from zero encountered in walking the list. A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list.
Normalized Enrichment Score. The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset; therefore, the normalized enrichment scores (NES) can be used to compare analysis results across gene sets.
False Discovery Rate (FDR) The false discovery rate (FDR) is the estimated probability that a gene set with a given NES represents a false positive finding. An FDR cutoff of 5% is standard.
Nominal P Value The nominal p value estimates the statistical significance of the enrichment score for a single gene set.

organism = get("org.Mm.eg.db") #load genome 

gsea <- gseGO(geneList=gene_list, 
             ont ="BP", 
             keyType = "ENTREZID", 
             nPerm = 10000, 
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = organism, 
             pAdjustMethod = "BH")

write.csv(gsea,"gsea_BP_M0 VS M1andM1T863.csv")
head(gsea)

##                    ID                          Description setSize
## GO:0051304 GO:0051304                chromosome separation     120
## GO:0034341 GO:0034341         response to interferon-gamma     122
## GO:0051983 GO:0051983 regulation of chromosome segregation     115
## GO:1905818 GO:1905818  regulation of chromosome separation      99
## GO:0035456 GO:0035456          response to interferon-beta      57
## GO:0045087 GO:0045087               innate immune response     573
##            enrichmentScore       NES       pvalue    p.adjust      qvalue rank
## GO:0051304      -0.7519754 -2.725219 0.0002513826 0.005815937 0.004906671 1003
## GO:0034341       0.7913420  2.701156 0.0001652346 0.005815937 0.004906671 1610
## GO:0051983      -0.7480332 -2.690859 0.0002511301 0.005815937 0.004906671 1486
## GO:1905818      -0.7558384 -2.662508 0.0002474635 0.005815937 0.004906671 1486
## GO:0035456       0.8685041  2.641859 0.0001748863 0.005815937 0.004906671 1375
## GO:0045087       0.6596275  2.632016 0.0001370802 0.005815937 0.004906671 2090
##                              leading_edge
## GO:0051304  tags=34%, list=6%, signal=32%
## GO:0034341  tags=51%, list=9%, signal=46%
## GO:0051983  tags=34%, list=9%, signal=31%
## GO:1905818  tags=34%, list=9%, signal=32%
## GO:0035456  tags=70%, list=8%, signal=65%
## GO:0045087 tags=36%, list=12%, signal=33%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             core_enrichment
## GO:0051304                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  75430/70099/78658/21335/104806/52276/69716/56150/74335/14211/76044/77045/69928/68014/83558/66442/67141/268465/68298/209334/23834/67052/229841/12236/20877/105988/11799/12704/215387/22137/268697/21973/12235/54392/208628/67629/18817/107995/66977/234396/68612
## GO:0034341                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   18126/16365/17472/14468/21822/100702/20304/229898/16160/14469/24108/16161/626578/20307/634650/12265/12703/70789/55932/229900/76074/209590/16145/54483/20299/21939/20293/14961/14969/16362/14960/15944/54396/56221/104252/236573/16149/15018/110558/66102/547253/434341/74732/20296/20846/20306/16452/22793/19332/69550/80285/12362/64685/20912/68713/15900/66141/20684/20821/69834/16423/18173
## GO:0051983                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                75430/70099/78658/58186/21335/52276/69716/56150/74335/14211/76044/77045/68014/66442/67141/214901/68298/209334/23834/67052/229841/12236/20877/17345/11799/12704/215387/108912/22137/268697/12235/54392/208628/67629/18817/107995/66977/73804/68612
## GO:1905818                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                75430/70099/78658/21335/52276/69716/56150/74335/14211/76044/77045/68014/66442/67141/68298/209334/23834/67052/229841/12236/20877/11799/12704/215387/22137/268697/12235/54392/208628/67629/18817/107995/66977/68612
## GO:0035456                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    16365/14468/21822/100702/60440/225594/14469/100039796/226695/240328/545384/55932/102639543/623121/100033459/15957/240327/229900/15959/620913/236312/16145/15953/15977/16362/15944/54396/381308/327959/20846/15951/15950/432555/100040462/69550/94088/23960/68713/66141/246730
## GO:0045087 18126/16365/17472/14468/21822/100702/60440/20304/229898/16160/19419/17167/14962/14469/24108/16161/626578/27007/20307/12266/226695/634650/12265/12703/15958/70789/231655/545384/55932/102639543/100033459/58185/245126/384059/15957/434219/12258/229900/56619/76074/100038882/20811/22164/15959/215257/209590/244183/50908/54123/246779/236312/16145/27218/11847/58203/54483/20299/74748/21939/56620/57444/20293/14961/14969/15977/16818/16362/17858/16666/14960/81897/20723/243659/107607/15944/20459/225825/54396/56221/381308/104252/12642/56631/236573/67138/16149/232371/56045/15018/110558/75345/214855/21354/66102/20568/75234/246727/23962/547253/12977/434341/30925/12363/74732/503550/209387/224840/16819/20296/20846/71586/50909/21355/13482/20847/20779/16768/15951/20706/15950/20128/20306/257632/192656/18477/16452/239081/209200/100040462/22793/19332/230979/14129/21929/66107/69550/65221/231712/80750/94088/230073/80287/80285/109032/23960/12362/16859/434218/66222/64685/12479/20912/68713/80861/70110/76681/98999/17174/246728/12010/18854/15900/73670/71956/66141/20684/14191/246730/269799/100340/13058/56417/94094/240354/63953/20821/69834/19698/210808/68279/69146/15162/16423/18173/217069/26362/14528/384309/14991/286940/27056/52118/100034251/319236/11846/70450/16173/224647/93694/19106/56489/269855/78781/142980/225471/12263/19255/211550/27361

gsea1 <- gseaplot2(gsea, title = gsea$Description[10], geneSetID = 1)
gsea1

gsea2 <- gseaplot2(gsea, title = gsea$Description[20], geneSetID = 1)
gsea2

final_gsea <- gseaplot2(gsea, geneSetID = 1:5)
final_gsea

Session info

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] enrichplot_1.18.4           ggupset_0.3.0              
##  [3] pathview_1.38.0             ggnewscale_0.4.8           
##  [5] DOSE_3.24.2                 org.Mm.eg.db_3.16.0        
##  [7] AnnotationDbi_1.60.2        clusterProfiler_4.6.2      
##  [9] ggplotify_0.1.0             cowplot_1.1.1              
## [11] biomaRt_2.54.1              EnhancedVolcano_1.16.0     
## [13] ggrepel_0.9.3               lubridate_1.9.2            
## [15] forcats_1.0.0               stringr_1.5.0              
## [17] dplyr_1.1.1                 purrr_1.0.1                
## [19] readr_2.1.4                 tidyr_1.3.0                
## [21] tibble_3.2.1                tidyverse_2.0.0            
## [23] ggplot2_3.4.2               pheatmap_1.0.12            
## [25] RColorBrewer_1.1-3          DESeq2_1.38.3              
## [27] SummarizedExperiment_1.28.0 Biobase_2.58.0             
## [29] MatrixGenerics_1.10.0       matrixStats_0.63.0         
## [31] GenomicRanges_1.50.2        GenomeInfoDb_1.34.9        
## [33] IRanges_2.32.0              S4Vectors_0.36.2           
## [35] BiocGenerics_0.44.0         knitr_1.42                 
## 
## loaded via a namespace (and not attached):
##   [1] shadowtext_0.1.2       fastmatch_1.1-3        BiocFileCache_2.6.1   
##   [4] systemfonts_1.0.4      plyr_1.8.8             igraph_1.4.2          
##   [7] lazyeval_0.2.2         splines_4.2.2          BiocParallel_1.32.6   
##  [10] digest_0.6.31          yulab.utils_0.0.6      htmltools_0.5.5       
##  [13] GOSemSim_2.24.0        viridis_0.6.2          GO.db_3.16.0          
##  [16] fansi_1.0.4            magrittr_2.0.3         memoise_2.0.1         
##  [19] tzdb_0.3.0             Biostrings_2.66.0      annotate_1.76.0       
##  [22] graphlayouts_0.8.4     timechange_0.2.0       prettyunits_1.1.1     
##  [25] colorspace_2.1-0       blob_1.2.4             rappdirs_0.3.3        
##  [28] textshaping_0.3.6      xfun_0.38              crayon_1.5.2          
##  [31] RCurl_1.98-1.12        jsonlite_1.8.4         graph_1.76.0          
##  [34] scatterpie_0.1.8       ape_5.7-1              glue_1.6.2            
##  [37] polyclip_1.10-4        gtable_0.3.3           zlibbioc_1.44.0       
##  [40] XVector_0.38.0         DelayedArray_0.24.0    Rgraphviz_2.42.0      
##  [43] scales_1.2.1           DBI_1.1.3              Rcpp_1.0.10           
##  [46] viridisLite_0.4.1      xtable_1.8-4           progress_1.2.2        
##  [49] tidytree_0.4.2         gridGraphics_0.5-1     bit_4.0.5             
##  [52] httr_1.4.5             fgsea_1.24.0           pkgconfig_2.0.3       
##  [55] XML_3.99-0.14          farver_2.1.1           sass_0.4.5            
##  [58] dbplyr_2.3.2           locfit_1.5-9.7         utf8_1.2.3            
##  [61] tidyselect_1.2.0       labeling_0.4.2         rlang_1.1.0           
##  [64] reshape2_1.4.4         munsell_0.5.0          tools_4.2.2           
##  [67] cachem_1.0.7           downloader_0.4         cli_3.6.1             
##  [70] generics_0.1.3         RSQLite_2.3.1          gson_0.1.0            
##  [73] evaluate_0.20          fastmap_1.1.1          yaml_2.3.7            
##  [76] ragg_1.2.5             ggtree_3.6.2           org.Hs.eg.db_3.16.0   
##  [79] bit64_4.0.5            tidygraph_1.2.3        KEGGREST_1.38.0       
##  [82] ggraph_2.1.0           nlme_3.1-160           KEGGgraph_1.58.3      
##  [85] aplot_0.1.10           xml2_1.3.3             compiler_4.2.2        
##  [88] rstudioapi_0.14        filelock_1.0.2         curl_5.0.0            
##  [91] png_0.1-8              treeio_1.22.0          tweenr_2.0.2          
##  [94] geneplotter_1.76.0     bslib_0.4.2            stringi_1.7.12        
##  [97] highr_0.10             lattice_0.20-45        Matrix_1.5-4          
## [100] vctrs_0.6.1            pillar_1.9.0           lifecycle_1.0.3       
## [103] jquerylib_0.1.4        data.table_1.14.8      bitops_1.0-7          
## [106] patchwork_1.1.2        qvalue_2.30.0          R6_2.5.1              
## [109] gridExtra_2.3          codetools_0.2-18       MASS_7.3-58.1         
## [112] withr_2.5.0            GenomeInfoDbData_1.2.9 parallel_4.2.2        
## [115] hms_1.1.3              grid_4.2.2             ggfun_0.0.9           
## [118] HDO.db_0.99.1          rmarkdown_2.21         ggforce_0.4.1

Part II : Hypothesis testing #2 (M1 with/without T863 VS M0)

Soyeon Gullickson

05/02/2023

Please note that There are total three .Rmd scripts for my Final Project (Part I, Part II, and Part III).

GSE145523

Original article

Introduction

Goal of this .Rmd script

Analysis process plan:

A. DESeq2 Analysis

Install the required R libraries

Code chunk options

Load the required libraries for DESeq2

DESeq2 Introduction

DESeq2 Overview

htseq-count input

Running DESeq2 requires DESeqDataSet

Pre-filtering

Note on factor levels

Now let’s run the Differential expression analysis

Data transformation and visualization

Principal components analysis (PCA)

B. Volcano plot

Volcano Plot Introduction

Load the required libraries for Volcano Plot

Volcano plot w/ EnhancedVolcano

Data input

BioMart id conversion

C.Heatmap

Introduction to Heatmaps

Loading packages for Heatmap

Loading data

Set up prior to using pheatmap

Z score

Adding annotation

D. Functional Analysis

Install and load required ackages for Functional Analysis

Functional analysis

GO Ontologies

clusterProfiler

Prepping the data

BioMart id conversion

Create Gene lists

first lets prep background gene set

now lets prep significant gene set

Visualizing clusterProfiler results

Gene Concept Network

Gene Set Enrichment Analysis (GSEA)

Session info

Set up prior to using `pheatmap`