The data being analyzed in this project is a bulk RNA-seq of 8 different human tumor samples.These samples were surgically removed from various patients with hepatocellular carcinoma (HCC) primary liver cancer. These samples are divided into 2 groups, tumors that were able to successfully form tumor organoids and those that could not, with 4 biological replicates for each. The goal of this analysis is to determine what molecular mechanisms differ between successful and failed tumor samples
Raw RNA-seq data can be found with the SRA accession PRJNA841062 and is sourced from the following scientific paper:
Xian, L., Zhao, P., Chen, X., Wei, Z., Ji, H., Zhao, J., Liu, W., Li, Z., Liu, D., Han, X., Qian, Y., Dong, H., Zhou, X., Fan, J., Zhu, X., Yin, J., Tan, X., Jiang, D., Yu, H., & Cao, G. (2022). Heterogeneity, inherent and acquired drug resistance in patient-derived organoid models of primary liver cancer. Cellular oncology (Dordrecht), 45(5), 10191036. https://doi.org/10.1007/s13402-022-00707-3
The following libraries are necessary for data analysis and visualization.
library(knitr)
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
library(cowplot)
library(tidyr)
library(dplyr)
library(patchwork)
library(ggplotify)
library(clusterProfiler)
library(msigdbr)
library(pathview)
library(enrichplot)
library(msigdbr)
library(GSEABase)
library(genekitr)
library(org.Hs.eg.db)
library(DOSE)
library(ggnewscale)
The file ‘PRJNA841062_HCC_anno.csv’ is loaded as the input annotation file into an object called ‘annotationTable’. This object is converted into a data frame format.
#load annotation file
sampleTable <- read.csv("PRJNA841062_HCC_anno.csv", header = TRUE)
sampleTable <- data.frame(sampleTable)
head(sampleTable)
## samplename filename presentation
## 1 HCC_1 HCC_1.ENSG.count Failure
## 2 HCC_125 HCC_125.ENSG.count Failure
## 3 HCC_101 HCC_101.ENSG.count Failure
## 4 HCC_109 HCC_109.ENSG.count Failure
## 5 HCC_118 HCC_118.ENSG.count Success
## 6 HCC_25 HCC_25.ENSG.count Success
The “presentation” column is specified as a factor by first creating a ‘presentation’ object (allowing categorization of data).
#specify the presentation column as factor
presentation <- factor(sampleTable$presentation)
The design is set as ‘~presentation’, where the presentation is identified as the distinguishing parameter in data comparison. ‘dds’ is created as the DESEQ2 object
# perform DESeqDataSet
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = ".",
design = ~presentation)
dds
## class: DESeqDataSet
## dim: 69222 8
## metadata(1): version
## assays(1): counts
## rownames(69222): ENSG00000000003.15 ENSG00000000005.6 ...
## ENSG00000291316.1 ENSG00000291317.1
## rowData names(0):
## colnames(8): HCC_1 HCC_125 ... HCC_10 HCC_52
## colData names(1): presentation
Prefiltering is performed to filter out any insignificant rows that have less than 10 reads. A new object is created for this filtering.
#filter out insignificant data
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
The reference level is identified as the ‘Failure’ presentation group, which will be considered the control group by R.
dds$presentation <- relevel(dds$presentation, ref = "Failure")
The DESEQ2 Object is saved/loaded as RDS file
# saveRDS(dds, file = "dds.rds")
# load RDS file of dds if necessary
dds <- readRDS("dds.rds")
Differential expression analysis is conducted using ‘DESeq()’.
#perform differentioal expression analysis
DESeq_dds <- DESeq(dds)
Differential expression analysis output is saved/loaded as RDS file
#saveRDS(DESeq_dds, file = "DESeq_dds.rds")
# load RDS file if necessary
DESeq_dds <- readRDS("DESeq_dds.rds")
The results of the analysis are extracted in a tabular format into a new object, with a comparison between the two presentation groups. Interpretations of the table columns are as follows: ‘baseMean’ = average of normalized counts of specific gene across all samples ‘log2FoldChange’ = log 2 fold change between the Success and Failure groups of HCC organoids ‘pvalue’ = P value (from Wald test) ‘padj’ = P adjusted value (Benjamini-Hochberg adjusted)
#extract results from DESeq
DESeq_result <- results(DESeq_dds, contrast = c("presentation", "Success", "Failure"))
head(DESeq_result)
## log2 fold change (MLE): presentation Success vs Failure
## Wald test p-value: presentation Success vs Failure
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.15 2624.8871 -0.415806 0.491689 -0.845669 0.397737253
## ENSG00000000005.6 2.3538 -3.122146 2.335452 -1.336849 0.181272018
## ENSG00000000419.14 1550.9591 0.497281 0.383389 1.297067 0.194608247
## ENSG00000000457.14 618.2154 -0.364450 0.543778 -0.670218 0.502718570
## ENSG00000000460.17 353.3706 0.447382 0.382170 1.170638 0.241744226
## ENSG00000000938.13 1077.7274 2.421342 0.626762 3.863259 0.000111884
## padj
## <numeric>
## ENSG00000000003.15 0.74283710
## ENSG00000000005.6 NA
## ENSG00000000419.14 0.54942984
## ENSG00000000457.14 0.80791547
## ENSG00000000460.17 0.60408150
## ENSG00000000938.13 0.00677819
The results are sorted by P adjusted values and placed in a table
#sort by padj
DESeq_sorted_result <- DESeq_result [order(DESeq_result $padj), ]
head(DESeq_sorted_result)
## log2 fold change (MLE): presentation Success vs Failure
## Wald test p-value: presentation Success vs Failure
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000266964.6 137.612 -6.96926 0.671256 -10.38243 2.98102e-25
## ENSG00000174358.16 1066.969 12.08246 1.289632 9.36892 7.32776e-21
## ENSG00000019169.11 592.666 6.89127 0.778820 8.84836 8.88171e-19
## ENSG00000073734.10 541.499 -5.87127 0.683452 -8.59062 8.65002e-18
## ENSG00000183549.10 1676.780 -4.81096 0.561967 -8.56093 1.11960e-17
## ENSG00000106541.12 5203.983 10.16415 1.279555 7.94350 1.96554e-15
## padj
## <numeric>
## ENSG00000266964.6 6.26670e-21
## ENSG00000174358.16 7.70220e-17
## ENSG00000019169.11 6.22371e-15
## ENSG00000073734.10 4.54602e-14
## ENSG00000183549.10 4.70724e-14
## ENSG00000106541.12 6.88658e-12
table(DESeq_sorted_result$padj<0.05)
##
## FALSE TRUE
## 19943 1079
The sorted results are saved/loaded as RDS file
#saveRDS(DESeq_sorted_result, file = "DESeq_sorted_result.rds")
# load RDS file if necessary
DESeq_sorted_result <- readRDS("DESeq_sorted_result.rds")
The DESeq results are merged with the counts data. The first column is also renamed with the Gene IDs.
#merge counts and DESeq
resdata <- merge(as.data.frame(DESeq_sorted_result), as.data.frame(counts(DESeq_dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene_id"
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000266964.6 137.6119 -6.969261 0.6712556 -10.382426 2.981022e-25
## 2 ENSG00000174358.16 1066.9688 12.082459 1.2896319 9.368921 7.327756e-21
## 3 ENSG00000019169.11 592.6665 6.891274 0.7788197 8.848357 8.881707e-19
## 4 ENSG00000073734.10 541.4995 -5.871274 0.6834516 -8.590621 8.650017e-18
## 5 ENSG00000183549.10 1676.7798 -4.810957 0.5619666 -8.560931 1.119598e-17
## 6 ENSG00000106541.12 5203.9835 10.164150 1.2795554 7.943501 1.965536e-15
## padj HCC_1 HCC_125 HCC_101 HCC_109 HCC_118
## 1 6.266704e-21 206.358710 336.43623 195.103552 353.728185 0.811056
## 2 7.702204e-17 0.000000 0.00000 0.000000 1.618893 1231.182993
## 3 6.223708e-15 6.094983 11.50966 3.363854 17.807826 466.357194
## 4 4.546017e-14 622.558977 1158.93427 1283.310433 1193.124358 13.787952
## 5 4.707238e-14 1749.260117 2362.13646 4305.733562 4534.520119 48.663359
## 6 6.886584e-12 4.353559 27.44611 1.681927 2.428340 1180.897522
## HCC_25 HCC_10 HCC_52
## 1 2.870737 3.917407 1.669219
## 2 4330.028455 425.691546 2547.228247
## 3 656.441883 680.322992 2899.433463
## 4 9.569124 45.703080 5.007657
## 5 116.743309 203.705157 93.476266
## 6 162.675102 14652.407463 25599.977727
The results are written as a new file.
write.csv(resdata, file = "HCCSuccessvsFailure_ unfiltered_matrix.csv")
A list of differentially expressed genes are acquired from the previous data. These DEGs are filtered by a p adjusted value of < 0.05 and a log 2 fold change of 1.
#filter out insignificant data by padj and log2FC
resdata <- as.data.frame(resdata) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(resdata)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000266964.6 137.6119 -6.969261 0.6712556 -10.382426 2.981022e-25
## 2 ENSG00000174358.16 1066.9688 12.082459 1.2896319 9.368921 7.327756e-21
## 3 ENSG00000019169.11 592.6665 6.891274 0.7788197 8.848357 8.881707e-19
## 4 ENSG00000073734.10 541.4995 -5.871274 0.6834516 -8.590621 8.650017e-18
## 5 ENSG00000183549.10 1676.7798 -4.810957 0.5619666 -8.560931 1.119598e-17
## 6 ENSG00000106541.12 5203.9835 10.164150 1.2795554 7.943501 1.965536e-15
## padj HCC_1 HCC_125 HCC_101 HCC_109 HCC_118
## 1 6.266704e-21 206.358710 336.43623 195.103552 353.728185 0.811056
## 2 7.702204e-17 0.000000 0.00000 0.000000 1.618893 1231.182993
## 3 6.223708e-15 6.094983 11.50966 3.363854 17.807826 466.357194
## 4 4.546017e-14 622.558977 1158.93427 1283.310433 1193.124358 13.787952
## 5 4.707238e-14 1749.260117 2362.13646 4305.733562 4534.520119 48.663359
## 6 6.886584e-12 4.353559 27.44611 1.681927 2.428340 1180.897522
## HCC_25 HCC_10 HCC_52
## 1 2.870737 3.917407 1.669219
## 2 4330.028455 425.691546 2547.228247
## 3 656.441883 680.322992 2899.433463
## 4 9.569124 45.703080 5.007657
## 5 116.743309 203.705157 93.476266
## 6 162.675102 14652.407463 25599.977727
The results are written as a new file.
write.csv(resdata, file = "HCCSuccessvsFailure_ filtered_matrix.csv")
Padj threshold and vector are added to results table.
#significant padj adjustment
threshold_padj <- resdata$padj < 0.05
length(which(threshold_padj))
## [1] 1053
resdata$threshold <- threshold_padj
str(resdata)
## 'data.frame': 1053 obs. of 16 variables:
## $ Gene_id : 'AsIs' chr "ENSG00000266964.6" "ENSG00000174358.16" "ENSG00000019169.11" "ENSG00000073734.10" ...
## $ baseMean : num 138 1067 593 541 1677 ...
## $ log2FoldChange: num -6.97 12.08 6.89 -5.87 -4.81 ...
## $ lfcSE : num 0.671 1.29 0.779 0.683 0.562 ...
## $ stat : num -10.38 9.37 8.85 -8.59 -8.56 ...
## $ pvalue : num 2.98e-25 7.33e-21 8.88e-19 8.65e-18 1.12e-17 ...
## $ padj : num 6.27e-21 7.70e-17 6.22e-15 4.55e-14 4.71e-14 ...
## $ HCC_1 : num 206.36 0 6.09 622.56 1749.26 ...
## $ HCC_125 : num 336.4 0 11.5 1158.9 2362.1 ...
## $ HCC_101 : num 195.1 0 3.36 1283.31 4305.73 ...
## $ HCC_109 : num 353.73 1.62 17.81 1193.12 4534.52 ...
## $ HCC_118 : num 0.811 1231.183 466.357 13.788 48.663 ...
## $ HCC_25 : num 2.87 4330.03 656.44 9.57 116.74 ...
## $ HCC_10 : num 3.92 425.69 680.32 45.7 203.71 ...
## $ HCC_52 : num 1.67 2547.23 2899.43 5.01 93.48 ...
## $ threshold : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
The dataframe is altered to add gene symbols and upregulation/downregulation based on padj and log2FC
#add column that includes gene symbols
resdata$Gene_id <- sub("\\.\\d+", "", resdata$Gene_id)
ids <- resdata$Gene_id
gene_symbol_map <- transId(ids, transTo = "symbol", org = "human")
names(resdata)[1] = "input_id"
resdata_gene <- merge(resdata, gene_symbol_map, by.x = "input_id", by.y = "input_id")
#specify up or downregulation based on padj and log2FC values, and add as new column
resdata_gene$diffexpressed <- "Unchanged"
resdata_gene$diffexpressed[resdata_gene$padj < 0.05 & resdata_gene$log2FoldChange > 1] <- "Upregulated"
resdata_gene$diffexpressed[resdata_gene$padj < 0.05 & resdata_gene$log2FoldChange < -1] <- "Downregulated"
str(resdata_gene)
## 'data.frame': 980 obs. of 18 variables:
## $ input_id : 'AsIs' chr "ENSG00000000938" "ENSG00000001036" "ENSG00000001626" "ENSG00000005059" ...
## $ baseMean : num 1077.7 4643 66.3 425.3 1420.5 ...
## $ log2FoldChange: num 2.42 1.34 5.86 1.8 -1.82 ...
## $ lfcSE : num 0.627 0.442 1.757 0.576 0.42 ...
## $ stat : num 3.86 3.02 3.33 3.13 -4.33 ...
## $ pvalue : num 1.12e-04 2.49e-03 8.54e-04 1.76e-03 1.46e-05 ...
## $ padj : num 0.00678 0.04885 0.02495 0.03866 0.00161 ...
## $ HCC_1 : num 311.71 3630 4.35 111.45 2133.24 ...
## $ HCC_125 : num 432.94 2100.96 4.43 202.75 1781.34 ...
## $ HCC_101 : num 496 2472 0 215 2486 ...
## $ HCC_109 : num 117 2330 0 229 2453 ...
## $ HCC_118 : num 1581 3830 204 444 759 ...
## $ HCC_25 : num 1171 5018 0 272 506 ...
## $ HCC_10 : num 912.76 6954.7 5.22 588.92 919.28 ...
## $ HCC_52 : num 3600 10808 313 1339 326 ...
## $ threshold : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ symbol : chr "FGR" "FUCA2" "CFTR" "MCUB" ...
## $ diffexpressed : chr "Upregulated" "Upregulated" "Upregulated" "Upregulated" ...
The results are written as a new file.
write.csv(resdata_gene, "HCC_Success_vs_Failure_DEG.csv", row.names = TRUE)
Differential expression analysis data is used to create a PCA plot.
vstcounts <- vst(DESeq_dds, blind = FALSE)
PCAplot <- plotPCA(vstcounts, intgroup=c("presentation"), returnData=TRUE)
percentVar <- round(100 * attr(PCAplot, "percentVar"))
options(bitmapType="cairo")
#use ggplot to edit
HCCpca <- ggplot(PCAplot, aes(PC1, PC2, color=presentation)) +
geom_point(size=2) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
HCCpca
The PCA plot is saved as RDS file.
#saveRDS(HCCpca, file = "PCAplot.rds")
The previously written .csv files are loaded as newly named objects.
#load DEG file
all_matrix <- read.csv("HCC_Success_vs_Failure_DEG.csv", header = TRUE)
all_matrix <- data_frame(all_matrix)
Dataframe for the data required for heatmap visualization is loaded. Gene symbols are altered in consideration of multiple gene IDs matching to the same symbol.
#create dataframe with only sample columns from DEG dataframe
heatmap_matrix <- dplyr:::select.data.frame(all_matrix,symbol, HCC_1:HCC_52)
#change duplicated gene symbols
uq_symbol_names <- make.names(heatmap_matrix[[1]], unique = T)
heatmap_matrix <- dplyr::select(heatmap_matrix, -symbol)
heatmap_matrix$symbol <- uq_symbol_names
#make gene symbols the row names
heatmap_matrix <- data.frame(heatmap_matrix, row.names = 9)
head(heatmap_matrix)
## HCC_1 HCC_125 HCC_101 HCC_109 HCC_118 HCC_25
## FGR 311.714844 432.940303 496.1685 116.5603 1580.7481 1171.2607
## FUCA2 3629.997725 2100.955702 2472.4329 2329.5875 3829.8064 5018.0484
## CFTR 4.353559 4.426792 0.0000 0.0000 203.5751 0.0000
## MCUB 111.451117 202.747095 215.2867 229.0734 444.4587 271.7631
## ABCB4 2133.244046 1781.341286 2485.8884 2452.6233 759.1484 506.2066
## RALA 1100.579785 1284.655172 750.1395 863.6796 1629.4115 2245.8733
## HCC_10 HCC_52
## FGR 912.755799 3599.6708
## FUCA2 6954.702981 10808.1933
## CFTR 5.223209 312.9786
## MCUB 588.916831 1338.7137
## ABCB4 919.284810 326.3323
## RALA 1943.033803 2197.5269
Save/load heatmap_matrix as RDS file
#saveRDS(heatmap_matrix, file = "heatmap_matrix.rds")
# load RDS file of dds if necessary
heatmap_matrix <- readRDS("heatmap_matrix.rds")
Annotation file for data is loaded.
#load annotation file
heatmap_meta <- read.csv("HCCSuccesvsFailure_anno.csv", header = TRUE)
The row names are changed to the sample names within annotation file.
#change row names
row.names(heatmap_meta) <- heatmap_meta$samplename
Select only the conditions to be included.
#select on the presentation column
heatmap_meta <- dplyr::select(heatmap_meta, presentation)
Save/load heatmap_meta as RDS file
#saveRDS(heatmap_meta, file = "heatmap_meta.rds")
# load RDS file if necessary
heatmap_meta <- readRDS("heatmap_meta.rds")
Colors are assigned to the heatmap expression values.
#expression value colors range from green to red
heatmap_colors <- colorRampPalette(colors = c("green", "black", "red"))(30)
#rev(brewer.pal(10, "RdBu"))
Colors are assigned to annotations.
#group colors are assigned red and blue
ann_colors <- list(presentation=c(Success="#fd0000", Failure="#0049bb"))
pheatmap
The heatmap is plotted with additional major arguments.
HCC_heatmap <- as.ggplot(pheatmap(heatmap_matrix,
color = heatmap_colors,
annotation_col = heatmap_meta,
clustering_method = 'ward.D',
show_colnames = T,
show_rownames = F,
scale = "row",
annotation_names_col = F,
fontsize_col = 7,
cellwidth = 12,
cellheight = 0.25,
treeheight_row = 0,
treeheight_col = 20,
))
The heatmap plot is saved as a RDS file.
#saveRDS(HCC_heatmap, file = "HCC_heatmap.rds")
The background gene set is prepared prior to creating gene list.
#get the log2 fold change
original_gene_list <- all_matrix$log2FoldChange
#name the vector
names(original_gene_list) <- all_matrix$input_id
#omit any NA values
background_gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order
background_gene_list = sort(background_gene_list, decreasing = TRUE)
head(background_gene_list)
## ENSG00000172016 ENSG00000174358 ENSG00000126890 ENSG00000079112 ENSG00000106541
## 12.952985 12.082459 11.984914 10.790671 10.164150
## ENSG00000166206
## 9.751104
Data is prepared for significant gene set.
# Extract significant results based on padj
sig_genes_df = subset(all_matrix, padj < 0.05)
#filter using log2FC
sig_genes <- sig_genes_df$log2FoldChange
# Name the vector
names(sig_genes) <- sig_genes_df$input_id
# omit NA values
sig_genes <- na.omit(sig_genes)
# significant results based on log2foldchange
sig_genes <- names(sig_genes)[abs(sig_genes) > 1]
head(sig_genes)
## [1] "ENSG00000000938" "ENSG00000001036" "ENSG00000001626" "ENSG00000005059"
## [5] "ENSG00000005471" "ENSG00000006451"
enrichGO object is created.
go_enrich <- enrichGO(gene = sig_genes,
universe = background_gene_list,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
readable = T,
ont = "BP",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05)
enrichGO object is saved/loaded as RDS.
#saveRDS(go_enrich, file = "go_enrich.rds")
#load RDS if necessary
go_enrich <- readRDS("go_enrich.rds")
The 10 most enriched biological processes are displayed in a bargraph
bp_barplot <- as.ggplot(barplot(go_enrich, showCategory = 10, title = "GO Biological Pathways in HCC tumor samples" , font.size = 6 ))
bp_barplot
Save biological pathways barplot as RDS file
#saveRDS(bp_barplot, file = "bp_barplot.rds")
Results are extracted into a table
## Output results from enirchgo in a table
cluster_summary <- data.frame(go_enrich)
write.csv(cluster_summary, "clusterProfiler_HCC.csv")
head(cluster_summary)
## ID Description GeneRatio BgRatio
## GO:0044282 GO:0044282 small molecule catabolic process 60/758 403/21261
## GO:0016054 GO:0016054 organic acid catabolic process 42/758 276/21261
## GO:0046395 GO:0046395 carboxylic acid catabolic process 42/758 276/21261
## GO:0046394 GO:0046394 carboxylic acid biosynthetic process 44/758 361/21261
## GO:0016053 GO:0016053 organic acid biosynthetic process 44/758 364/21261
## GO:0015711 GO:0015711 organic anion transport 51/758 488/21261
## pvalue p.adjust qvalue
## GO:0044282 3.750416e-21 1.839204e-17 1.605178e-17
## GO:0016054 1.693721e-15 2.768670e-12 2.416376e-12
## GO:0046395 1.693721e-15 2.768670e-12 2.416376e-12
## GO:0046394 1.126820e-12 1.381481e-09 1.205697e-09
## GO:0016053 1.500622e-12 1.471810e-09 1.284533e-09
## GO:0015711 6.705804e-12 5.480877e-09 4.783474e-09
## geneID
## GO:0044282 HSD17B6/OAT/PFKP/PKM/ABCB11/ACACB/ECHDC1/UPB1/MLYCD/QPRT/IL4I1/SULT2A1/GCK/ASPA/ENO3/DAO/ALDH2/ENO2/ALDH5A1/EHHADH/FAAH/ALDH8A1/ECHDC2/ACADS/SARDH/TST/APOBEC3A/CDO1/APOE/CYP27A1/AMDHD1/SLC16A3/ABHD1/ADHFE1/IDNK/MAT1A/PFKM/HK2/UROC1/HAAO/LDHD/NUDT8/CYP7A1/GLYCTK/DCXR/AGXT/LEP/PIPOX/ABAT/CYP4F2/CYP4F3/SULT1A1/SULT1A2/ECI2/GALT/ACAD11/LOC102724560/ABCB11/PRODH/FTCD
## GO:0016054 OAT/ABCB11/ACACB/ECHDC1/MLYCD/QPRT/IL4I1/SULT2A1/ASPA/DAO/ALDH5A1/EHHADH/FAAH/ALDH8A1/ECHDC2/ACADS/SARDH/TST/CDO1/AMDHD1/SLC16A3/ABHD1/ADHFE1/IDNK/MAT1A/UROC1/HAAO/LDHD/NUDT8/DCXR/AGXT/LEP/PIPOX/ABAT/CYP4F2/CYP4F3/ECI2/ACAD11/LOC102724560/ABCB11/PRODH/FTCD
## GO:0046395 OAT/ABCB11/ACACB/ECHDC1/MLYCD/QPRT/IL4I1/SULT2A1/ASPA/DAO/ALDH5A1/EHHADH/FAAH/ALDH8A1/ECHDC2/ACADS/SARDH/TST/CDO1/AMDHD1/SLC16A3/ABHD1/ADHFE1/IDNK/MAT1A/UROC1/HAAO/LDHD/NUDT8/DCXR/AGXT/LEP/PIPOX/ABAT/CYP4F2/CYP4F3/ECI2/ACAD11/LOC102724560/ABCB11/PRODH/FTCD
## GO:0046394 MLXIPL/BCAT1/OAT/ACSM2B/ABCB11/PTGS2/ACACB/SLC27A5/SCD/UPB1/MTHFD1/MLYCD/APOA5/APOC3/PECR/ALDH8A1/MTHFD1L/CDO1/APOC1/ACSS2/ALOX5AP/CYP27A1/BAAT/ABHD1/BHMT/NAGS/HAAO/FABP5/AVPR1A/ACSM1/CYP7A1/GATM/AGXT/SEPHS2/CYP8B1/ABAT/ACSM5/ACSM2A/INSIG1/LOC102724560/ACACA/BAAT/ABCB11/ADI1
## GO:0016053 MLXIPL/BCAT1/OAT/ACSM2B/ABCB11/PTGS2/ACACB/SLC27A5/SCD/UPB1/MTHFD1/MLYCD/APOA5/APOC3/PECR/ALDH8A1/MTHFD1L/CDO1/APOC1/ACSS2/ALOX5AP/CYP27A1/BAAT/ABHD1/BHMT/NAGS/HAAO/FABP5/AVPR1A/ACSM1/CYP7A1/GATM/AGXT/SEPHS2/CYP8B1/ABAT/ACSM5/ACSM2A/INSIG1/LOC102724560/ACACA/BAAT/ABCB11/ADI1
## GO:0015711 CFTR/ABCB4/SLC11A1/SLC2A3/ABCB11/PTGS2/SLC27A5/ABCC6/SLCO4A1/AQP9/MFSD10/SLC6A12/SLC17A2/SLC16A10/RGS4/SLC2A1/APOE/SLC6A8/LRRC8A/SLC22A7/SLC7A1/SLC39A6/SLC16A3/CRABP2/SNCA/SLC17A4/SLC5A12/SLC6A1/SV2A/SLC1A7/SLC2A2/SLC51A/FABP5/AVPR1A/CYP7A1/FABP6/AGXT/SLC16A13/SLC6A19/LEP/SLC25A42/SLC25A18/ABAT/SLC25A10/SLC22A10/CYP4F2/SLC22A25/NMB/SLC10A5/ABCC6/ABCB11
## Count
## GO:0044282 60
## GO:0016054 42
## GO:0046395 42
## GO:0046394 44
## GO:0016053 44
## GO:0015711 51
The identification is converted to entrezID and duplicates are removed.
# Convert gene IDs
ids<-bitr(names(original_gene_list),
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb="org.Hs.eg.db")
# remove duplicate IDS
dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),]
A new dataframe is created of only genes that were converted in previous step.
# Create a new dataframe
entrez_df = all_matrix[all_matrix$input_id %in% dedup_ids$ENSEMBL,]
# Create a new column for entrezIDs
entrez_df <- merge(all_matrix, dedup_ids, by.x = "input_id", by.y = "ENSEMBL")
str(entrez_df)
## 'data.frame': 936 obs. of 20 variables:
## $ input_id : chr "ENSG00000000938" "ENSG00000001036" "ENSG00000001626" "ENSG00000005059" ...
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ baseMean : num 1077.7 4643 66.3 425.3 1420.5 ...
## $ log2FoldChange: num 2.42 1.34 5.86 1.8 -1.82 ...
## $ lfcSE : num 0.627 0.442 1.757 0.576 0.42 ...
## $ stat : num 3.86 3.02 3.33 3.13 -4.33 ...
## $ pvalue : num 1.12e-04 2.49e-03 8.54e-04 1.76e-03 1.46e-05 ...
## $ padj : num 0.00678 0.04885 0.02495 0.03866 0.00161 ...
## $ HCC_1 : num 311.71 3630 4.35 111.45 2133.24 ...
## $ HCC_125 : num 432.94 2100.96 4.43 202.75 1781.34 ...
## $ HCC_101 : num 496 2472 0 215 2486 ...
## $ HCC_109 : num 117 2330 0 229 2453 ...
## $ HCC_118 : num 1581 3830 204 444 759 ...
## $ HCC_25 : num 1171 5018 0 272 506 ...
## $ HCC_10 : num 912.76 6954.7 5.22 588.92 919.28 ...
## $ HCC_52 : num 3600 10808 313 1339 326 ...
## $ threshold : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ symbol : chr "FGR" "FUCA2" "CFTR" "MCUB" ...
## $ diffexpressed : chr "Upregulated" "Upregulated" "Upregulated" "Upregulated" ...
## $ ENTREZID : chr "2268" "2519" "1080" "55013" ...
This dataframe is saved/loaded as RDS.
#saveRDS(entrez_df, file = "entrez_df.rds")
#load RDS if necessary
entrez_df <- readRDS("entrez_df.rds")
This dataframe is also written into a new file.
write.csv(entrez_df, "HCC_Success_vs_Failure_entrezid.csv", row.names = TRUE)
Dataframe will be read in.
gsea_data <- read.csv("HCC_Success_vs_Failure_entrezid.csv", header = TRUE)
#remove extra columns
gsea_data <- dplyr::select(gsea_data, -X)
gsea_data <- dplyr::select(gsea_data, -X.1)
Remove duplicates and insignificant values
## Remove NA values
res_entrez <- dplyr::filter(gsea_data, ENTREZID != "NA")
## Remove entrez duplicates
res_entrez <- res_entrez[which(duplicated(res_entrez$ENTREZID) == F), ]
Fold changes will be extracted.
## Extract foldchanges
foldchanges <- res_entrez$log2FoldChange
## Name fold change with the Entrez ID
names(foldchanges) <- res_entrez$ENTREZID
## Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)
Save these as RDS files
#saveRDS(foldchanges, "foldchanges.RDS")
foldchanges <- readRDS("foldchanges.RDS")
gseaKEGG <- gseKEGG(geneList = foldchanges,
organism = "hsa",
nPerm = 1000,
minGSSize = 20,
pvalueCutoff = 0.05,
verbose = FALSE)
Save this output as a RDS file
#saveRDS(gseaKEGG, file = "gseaKEGG.RDS")
gseaKEGG <- readRDS("gseaKEGG.RDS")
Results are extracted.
gseaKEGG_results <- gseaKEGG@result
GSEA plot visualization is performed.
enrich_gsea1 <- as.ggplot(gseaplot2(gseaKEGG,
title = gseaKEGG_results$Description[1],
geneSetID = 1,
rel_heights = 2, base_size = 10))
enrich_gsea1
enrich_gsea2 <- as.ggplot(gseaplot2(gseaKEGG,
title = gseaKEGG_results$Description[2],
geneSetID = 2,
rel_heights = 2, base_size = 10))
enrich_gsea2
GSEA plots are saved as RDS files
#saveRDS(enrich_gsea1, file = "GSEA_1.rds")
#saveRDS(enrich_gsea2, file = "GSEA_2.rds")
Download MSigDB gene sets from human category chemical and genetic perturbations
CGP_gene_sets <- msigdbr(species = "human", category = "C2",
subcategory = "CGP") %>%
dplyr::select(gs_name, entrez_gene)
head(CGP_gene_sets)
## # A tibble: 6 × 2
## gs_name entrez_gene
## <chr> <int>
## 1 ABBUD_LIF_SIGNALING_1_DN 79026
## 2 ABBUD_LIF_SIGNALING_1_DN 214
## 3 ABBUD_LIF_SIGNALING_1_DN 91369
## 4 ABBUD_LIF_SIGNALING_1_DN 8289
## 5 ABBUD_LIF_SIGNALING_1_DN 594
## 6 ABBUD_LIF_SIGNALING_1_DN 146556
Save gene sets in RDS file.
#saveRDS(CGP_gene_sets, file= "CGP_gene_sets.rds")
CGP_gene_sets <- readRDS("CGP_gene_sets.rds")
Run GSEA on this.
msig_CGP_GSEA <- GSEA(foldchanges, TERM2GENE = CGP_gene_sets, verbose = FALSE)
Save as a RDS file
#saveRDS(msig_CGP_GSEA, "msig_CGP_GSEA.rds")
msig_CGP_GSEA <- readRDS("msig_CGP_GSEA.rds")
Extract GSEA results
gseaCGP_results <- msig_CGP_GSEA@result
# Write to file
write.csv(gseaCGP_results, "gseaCGP_results.csv", quote=F)
#View(gseaGOBP_results)
Compare to gene sets identified by literature.
## gsea plot for single pathway
prol_GSEA <- as.ggplot(gseaplot2(msig_CGP_GSEA, title = 'CHIANG_LIVER_CANCER_SUBCLASS_PROLIFERATION_UP', geneSetID = 'CHIANG_LIVER_CANCER_SUBCLASS_PROLIFERATION_UP', base_size = 10))
prol_GSEA
## gsea plot for single pathway
HCC_GSEA <- as.ggplot(gseaplot2(msig_CGP_GSEA, title = 'DESERT_STEM_CELL_HEPATOCELLULAR_CARCINOMA_SUBCLASS_UP', geneSetID = 'DESERT_STEM_CELL_HEPATOCELLULAR_CARCINOMA_SUBCLASS_UP', base_size = 10))
HCC_GSEA
Save as RDS files
saveRDS(prol_GSEA, file = "prol_GSEA.rds")
saveRDS(HCC_GSEA, file = "HCC_GSEA.rds")
Reload individual figures.
heatmap <- readRDS("HCC_heatmap.rds")
pca <- readRDS("PCAplot.rds")
bp_bar <- readRDS("bp_barplot.rds")
gsea_1 <- readRDS("GSEA_1.rds")
gsea_2 <- readRDS("GSEA_2.rds")
prol_GSEA <- readRDS("prol_GSEA.rds")
HCC_GSEA <- readRDS("HCC_GSEA.rds")
Cowplot is used to combine separate figures.
#for orginal figure
Fig1A <- plot_grid(heatmap, labels = "A")
Fig1B <- plot_grid(HCC_GSEA, prol_GSEA, labels = "B", nrow = 2)
orig_fig <- plot_grid(Fig1A, Fig1B, align = "hv")
ggsave(filename = "original_figure.png",
plot = orig_fig,
units = "in",
width = 15,
height = 6,
dpi = 300)
orig_fig
#additional figure
FigAB <- plot_grid(gsea_1, gsea_2, labels = c("A","B"))
FigC <- plot_grid(bp_bar, labels = "C")
comb_fig <- plot_grid(FigAB,FigC, nrow = 2)
ggsave(filename = "additional_figure.png",
plot = comb_fig,
units = "in",
width = 5,
height = 6,
dpi = 300)
comb_fig
Save each individual figures as their own png files
ggsave(filename = "heatmap.png",
plot =Fig1A,
units = "in",
width = 5,
height = 6,
dpi = 300)
ggsave(filename = "lit_GSEA.png",
plot = Fig1B,
units = "in",
width = 12,
height = 6,
dpi = 300)
ggsave(filename = "add_GSEA.png",
plot = FigAB,
units = "in",
width = 10,
height = 6,
dpi = 300)
ggsave(filename = "bp_bar.png",
plot = FigC,
units = "in",
width = 8,
height = 6,
dpi = 300)
ggsave(filename = "PCA.png",
plot = pca,
units = "in",
width = 8,
height = 6,
dpi = 300)
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)
##
## Matrix products: default
## BLAS: /gpfs1/sw/x86_64/rh7/pkgs/stacks/gcc/10.5.0/R/4.3.2-1/lib64/R/lib/libRblas.so
## LAPACK: /gpfs1/sw/x86_64/rh7/pkgs/stacks/gcc/10.5.0/R/4.3.2-1/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] fstcore_0.9.18 ggnewscale_0.4.10
## [3] DOSE_3.28.2 org.Hs.eg.db_3.18.0
## [5] genekitr_1.2.5 GSEABase_1.64.0
## [7] graph_1.80.0 annotate_1.80.0
## [9] XML_3.99-0.16.1 AnnotationDbi_1.64.1
## [11] enrichplot_1.22.0 pathview_1.42.0
## [13] msigdbr_7.5.1 clusterProfiler_4.10.1
## [15] ggplotify_0.1.2 patchwork_1.2.0
## [17] cowplot_1.1.3 lubridate_1.9.3
## [19] forcats_1.0.0 stringr_1.5.1
## [21] dplyr_1.1.4 purrr_1.0.2
## [23] readr_2.1.4 tidyr_1.3.0
## [25] tibble_3.2.1 tidyverse_2.0.0
## [27] ggplot2_3.5.1 pheatmap_1.0.12
## [29] RColorBrewer_1.1-3 DESeq2_1.42.1
## [31] SummarizedExperiment_1.32.0 Biobase_2.62.0
## [33] MatrixGenerics_1.14.0 matrixStats_1.3.0
## [35] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
## [37] IRanges_2.36.0 S4Vectors_0.40.2
## [39] BiocGenerics_0.48.1 knitr_1.45
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.2 bitops_1.0-7 urltools_1.7.3
## [4] triebeard_0.4.1 polyclip_1.10-6 lifecycle_1.0.4
## [7] lattice_0.21-9 MASS_7.3-60 magrittr_2.0.3
## [10] openxlsx_4.2.5.2 sass_0.4.7 rmarkdown_2.25
## [13] jquerylib_0.1.4 yaml_2.3.7 zip_2.3.1
## [16] ggvenn_0.1.10 DBI_1.2.2 abind_1.4-5
## [19] zlibbioc_1.48.2 ggraph_2.2.1 RCurl_1.98-1.14
## [22] yulab.utils_0.1.4 tweenr_2.0.3 GenomeInfoDbData_1.2.11
## [25] ggrepel_0.9.5 tidytree_0.4.6 codetools_0.2-19
## [28] DelayedArray_0.28.0 xml2_1.3.5 ggforce_0.4.2
## [31] tidyselect_1.2.0 aplot_0.2.2 farver_2.1.1
## [34] viridis_0.6.5 jsonlite_1.8.7 fst_0.9.8
## [37] tidygraph_1.3.1 systemfonts_1.0.5 tools_4.3.2
## [40] progress_1.2.2 ragg_1.2.6 treeio_1.26.0
## [43] Rcpp_1.0.11 glue_1.6.2 gridExtra_2.3
## [46] SparseArray_1.2.4 xfun_0.41 qvalue_2.34.0
## [49] withr_2.5.2 fastmap_1.1.1 fansi_1.0.5
## [52] digest_0.6.33 timechange_0.2.0 R6_2.5.1
## [55] gridGraphics_0.5-1 textshaping_0.3.7 colorspace_2.1-0
## [58] GO.db_3.18.0 RSQLite_2.3.6 utf8_1.2.4
## [61] generics_0.1.3 data.table_1.14.8 prettyunits_1.2.0
## [64] graphlayouts_1.1.1 httr_1.4.7 S4Arrays_1.2.1
## [67] scatterpie_0.2.2 pkgconfig_2.0.3 gtable_0.3.4
## [70] blob_1.2.4 XVector_0.42.0 shadowtext_0.1.3
## [73] htmltools_0.5.7 geneset_0.2.7 fgsea_1.28.0
## [76] scales_1.3.0 png_0.1-8 ggfun_0.1.4
## [79] rstudioapi_0.15.0 tzdb_0.4.0 reshape2_1.4.4
## [82] nlme_3.1-163 cachem_1.0.8 parallel_4.3.2
## [85] HDO.db_0.99.1 pillar_1.9.0 grid_4.3.2
## [88] vctrs_0.6.4 xtable_1.8-4 Rgraphviz_2.46.0
## [91] evaluate_0.23 KEGGgraph_1.62.0 cli_3.6.1
## [94] locfit_1.5-9.9 compiler_4.3.2 rlang_1.1.2
## [97] crayon_1.5.2 labeling_0.4.3 plyr_1.8.9
## [100] fs_1.6.3 stringi_1.8.2 viridisLite_0.4.2
## [103] BiocParallel_1.36.0 babelgene_22.9 munsell_0.5.0
## [106] Biostrings_2.70.3 lazyeval_0.2.2 GOSemSim_2.28.1
## [109] Matrix_1.6-1.1 europepmc_0.4.3 hms_1.1.3
## [112] bit64_4.0.5 KEGGREST_1.42.0 highr_0.10
## [115] igraph_2.0.3 memoise_2.0.1 bslib_0.6.1
## [118] ggtree_3.10.1 fastmatch_1.1-4 bit_4.0.5
## [121] ape_5.8 gson_0.1.0