Dataset background

The data being analyzed in this project is a bulk RNA-seq of 8 different human tumor samples.These samples were surgically removed from various patients with hepatocellular carcinoma (HCC) primary liver cancer. These samples are divided into 2 groups, tumors that were able to successfully form tumor organoids and those that could not, with 4 biological replicates for each. The goal of this analysis is to determine what molecular mechanisms differ between successful and failed tumor samples

Raw RNA-seq data can be found with the SRA accession PRJNA841062 and is sourced from the following scientific paper:

Xian, L., Zhao, P., Chen, X., Wei, Z., Ji, H., Zhao, J., Liu, W., Li, Z., Liu, D., Han, X., Qian, Y., Dong, H., Zhou, X., Fan, J., Zhu, X., Yin, J., Tan, X., Jiang, D., Yu, H., & Cao, G. (2022). Heterogeneity, inherent and acquired drug resistance in patient-derived organoid models of primary liver cancer. Cellular oncology (Dordrecht), 45(5), 10191036. https://doi.org/10.1007/s13402-022-00707-3

Load the required libraries

The following libraries are necessary for data analysis and visualization.

library(knitr)
library(DESeq2) 
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
library(cowplot)
library(tidyr)
library(dplyr)
library(patchwork)
library(ggplotify)
library(clusterProfiler)
library(msigdbr)
library(pathview)
library(enrichplot) 
library(msigdbr)
library(GSEABase)
library(genekitr)
library(org.Hs.eg.db)
library(DOSE) 
library(ggnewscale)

htseq-count input

The file ‘PRJNA841062_HCC_anno.csv’ is loaded as the input annotation file into an object called ‘annotationTable’. This object is converted into a data frame format.

#load annotation file
sampleTable <- read.csv("PRJNA841062_HCC_anno.csv", header = TRUE)
sampleTable <- data.frame(sampleTable)
head(sampleTable)
##   samplename           filename presentation
## 1      HCC_1   HCC_1.ENSG.count      Failure
## 2    HCC_125 HCC_125.ENSG.count      Failure
## 3    HCC_101 HCC_101.ENSG.count      Failure
## 4    HCC_109 HCC_109.ENSG.count      Failure
## 5    HCC_118 HCC_118.ENSG.count      Success
## 6     HCC_25  HCC_25.ENSG.count      Success

The “presentation” column is specified as a factor by first creating a ‘presentation’ object (allowing categorization of data).

#specify the presentation column as factor 
presentation <- factor(sampleTable$presentation)

Running DESeq2 with DESeqDataSet

The design is set as ‘~presentation’, where the presentation is identified as the distinguishing parameter in data comparison. ‘dds’ is created as the DESEQ2 object

# perform DESeqDataSet 
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = ".",
                                  design = ~presentation)
dds
## class: DESeqDataSet 
## dim: 69222 8 
## metadata(1): version
## assays(1): counts
## rownames(69222): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000291316.1 ENSG00000291317.1
## rowData names(0):
## colnames(8): HCC_1 HCC_125 ... HCC_10 HCC_52
## colData names(1): presentation

Pre-filtering

Prefiltering is performed to filter out any insignificant rows that have less than 10 reads. A new object is created for this filtering.

#filter out insignificant data
keep <- rowSums(counts(dds)) >= 10 
dds <- dds[keep,]

The reference level is identified as the ‘Failure’ presentation group, which will be considered the control group by R.

dds$presentation <- relevel(dds$presentation, ref = "Failure")

The DESEQ2 Object is saved/loaded as RDS file

# saveRDS(dds, file = "dds.rds")
# load RDS file of dds if necessary 
dds <- readRDS("dds.rds")

Differential expression analysis

Differential expression analysis is conducted using ‘DESeq()’.

#perform differentioal expression analysis
DESeq_dds <- DESeq(dds)

Differential expression analysis output is saved/loaded as RDS file

#saveRDS(DESeq_dds, file = "DESeq_dds.rds")
# load RDS file if necessary 
DESeq_dds <- readRDS("DESeq_dds.rds")

The results of the analysis are extracted in a tabular format into a new object, with a comparison between the two presentation groups. Interpretations of the table columns are as follows: ‘baseMean’ = average of normalized counts of specific gene across all samples ‘log2FoldChange’ = log 2 fold change between the Success and Failure groups of HCC organoids ‘pvalue’ = P value (from Wald test) ‘padj’ = P adjusted value (Benjamini-Hochberg adjusted)

#extract results from DESeq
DESeq_result <- results(DESeq_dds, contrast = c("presentation", "Success", "Failure"))
head(DESeq_result)
## log2 fold change (MLE): presentation Success vs Failure 
## Wald test p-value: presentation Success vs Failure 
## DataFrame with 6 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat      pvalue
##                    <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000000003.15 2624.8871      -0.415806  0.491689 -0.845669 0.397737253
## ENSG00000000005.6     2.3538      -3.122146  2.335452 -1.336849 0.181272018
## ENSG00000000419.14 1550.9591       0.497281  0.383389  1.297067 0.194608247
## ENSG00000000457.14  618.2154      -0.364450  0.543778 -0.670218 0.502718570
## ENSG00000000460.17  353.3706       0.447382  0.382170  1.170638 0.241744226
## ENSG00000000938.13 1077.7274       2.421342  0.626762  3.863259 0.000111884
##                          padj
##                     <numeric>
## ENSG00000000003.15 0.74283710
## ENSG00000000005.6          NA
## ENSG00000000419.14 0.54942984
## ENSG00000000457.14 0.80791547
## ENSG00000000460.17 0.60408150
## ENSG00000000938.13 0.00677819

The results are sorted by P adjusted values and placed in a table

#sort by padj
DESeq_sorted_result <- DESeq_result [order(DESeq_result $padj), ]
head(DESeq_sorted_result)
## log2 fold change (MLE): presentation Success vs Failure 
## Wald test p-value: presentation Success vs Failure 
## DataFrame with 6 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat      pvalue
##                    <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000266964.6    137.612       -6.96926  0.671256 -10.38243 2.98102e-25
## ENSG00000174358.16  1066.969       12.08246  1.289632   9.36892 7.32776e-21
## ENSG00000019169.11   592.666        6.89127  0.778820   8.84836 8.88171e-19
## ENSG00000073734.10   541.499       -5.87127  0.683452  -8.59062 8.65002e-18
## ENSG00000183549.10  1676.780       -4.81096  0.561967  -8.56093 1.11960e-17
## ENSG00000106541.12  5203.983       10.16415  1.279555   7.94350 1.96554e-15
##                           padj
##                      <numeric>
## ENSG00000266964.6  6.26670e-21
## ENSG00000174358.16 7.70220e-17
## ENSG00000019169.11 6.22371e-15
## ENSG00000073734.10 4.54602e-14
## ENSG00000183549.10 4.70724e-14
## ENSG00000106541.12 6.88658e-12
table(DESeq_sorted_result$padj<0.05)
## 
## FALSE  TRUE 
## 19943  1079

The sorted results are saved/loaded as RDS file

#saveRDS(DESeq_sorted_result, file = "DESeq_sorted_result.rds")
# load RDS file if necessary 
DESeq_sorted_result <- readRDS("DESeq_sorted_result.rds")

The DESeq results are merged with the counts data. The first column is also renamed with the Gene IDs.

#merge counts and DESeq
resdata <- merge(as.data.frame(DESeq_sorted_result), as.data.frame(counts(DESeq_dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene_id"
head(resdata)
##              Gene_id  baseMean log2FoldChange     lfcSE       stat       pvalue
## 1  ENSG00000266964.6  137.6119      -6.969261 0.6712556 -10.382426 2.981022e-25
## 2 ENSG00000174358.16 1066.9688      12.082459 1.2896319   9.368921 7.327756e-21
## 3 ENSG00000019169.11  592.6665       6.891274 0.7788197   8.848357 8.881707e-19
## 4 ENSG00000073734.10  541.4995      -5.871274 0.6834516  -8.590621 8.650017e-18
## 5 ENSG00000183549.10 1676.7798      -4.810957 0.5619666  -8.560931 1.119598e-17
## 6 ENSG00000106541.12 5203.9835      10.164150 1.2795554   7.943501 1.965536e-15
##           padj       HCC_1    HCC_125     HCC_101     HCC_109     HCC_118
## 1 6.266704e-21  206.358710  336.43623  195.103552  353.728185    0.811056
## 2 7.702204e-17    0.000000    0.00000    0.000000    1.618893 1231.182993
## 3 6.223708e-15    6.094983   11.50966    3.363854   17.807826  466.357194
## 4 4.546017e-14  622.558977 1158.93427 1283.310433 1193.124358   13.787952
## 5 4.707238e-14 1749.260117 2362.13646 4305.733562 4534.520119   48.663359
## 6 6.886584e-12    4.353559   27.44611    1.681927    2.428340 1180.897522
##        HCC_25       HCC_10       HCC_52
## 1    2.870737     3.917407     1.669219
## 2 4330.028455   425.691546  2547.228247
## 3  656.441883   680.322992  2899.433463
## 4    9.569124    45.703080     5.007657
## 5  116.743309   203.705157    93.476266
## 6  162.675102 14652.407463 25599.977727

The results are written as a new file.

write.csv(resdata, file = "HCCSuccessvsFailure_ unfiltered_matrix.csv")

DEGs are filtered by significance

A list of differentially expressed genes are acquired from the previous data. These DEGs are filtered by a p adjusted value of < 0.05 and a log 2 fold change of 1.

#filter out insignificant data by padj and log2FC
resdata <- as.data.frame(resdata) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(resdata)
##              Gene_id  baseMean log2FoldChange     lfcSE       stat       pvalue
## 1  ENSG00000266964.6  137.6119      -6.969261 0.6712556 -10.382426 2.981022e-25
## 2 ENSG00000174358.16 1066.9688      12.082459 1.2896319   9.368921 7.327756e-21
## 3 ENSG00000019169.11  592.6665       6.891274 0.7788197   8.848357 8.881707e-19
## 4 ENSG00000073734.10  541.4995      -5.871274 0.6834516  -8.590621 8.650017e-18
## 5 ENSG00000183549.10 1676.7798      -4.810957 0.5619666  -8.560931 1.119598e-17
## 6 ENSG00000106541.12 5203.9835      10.164150 1.2795554   7.943501 1.965536e-15
##           padj       HCC_1    HCC_125     HCC_101     HCC_109     HCC_118
## 1 6.266704e-21  206.358710  336.43623  195.103552  353.728185    0.811056
## 2 7.702204e-17    0.000000    0.00000    0.000000    1.618893 1231.182993
## 3 6.223708e-15    6.094983   11.50966    3.363854   17.807826  466.357194
## 4 4.546017e-14  622.558977 1158.93427 1283.310433 1193.124358   13.787952
## 5 4.707238e-14 1749.260117 2362.13646 4305.733562 4534.520119   48.663359
## 6 6.886584e-12    4.353559   27.44611    1.681927    2.428340 1180.897522
##        HCC_25       HCC_10       HCC_52
## 1    2.870737     3.917407     1.669219
## 2 4330.028455   425.691546  2547.228247
## 3  656.441883   680.322992  2899.433463
## 4    9.569124    45.703080     5.007657
## 5  116.743309   203.705157    93.476266
## 6  162.675102 14652.407463 25599.977727

The results are written as a new file.

write.csv(resdata, file = "HCCSuccessvsFailure_ filtered_matrix.csv")

Padj threshold and vector are added to results table.

#significant padj adjustment
threshold_padj <- resdata$padj < 0.05 
length(which(threshold_padj))
## [1] 1053
resdata$threshold <- threshold_padj
str(resdata)
## 'data.frame':    1053 obs. of  16 variables:
##  $ Gene_id       : 'AsIs' chr  "ENSG00000266964.6" "ENSG00000174358.16" "ENSG00000019169.11" "ENSG00000073734.10" ...
##  $ baseMean      : num  138 1067 593 541 1677 ...
##  $ log2FoldChange: num  -6.97 12.08 6.89 -5.87 -4.81 ...
##  $ lfcSE         : num  0.671 1.29 0.779 0.683 0.562 ...
##  $ stat          : num  -10.38 9.37 8.85 -8.59 -8.56 ...
##  $ pvalue        : num  2.98e-25 7.33e-21 8.88e-19 8.65e-18 1.12e-17 ...
##  $ padj          : num  6.27e-21 7.70e-17 6.22e-15 4.55e-14 4.71e-14 ...
##  $ HCC_1         : num  206.36 0 6.09 622.56 1749.26 ...
##  $ HCC_125       : num  336.4 0 11.5 1158.9 2362.1 ...
##  $ HCC_101       : num  195.1 0 3.36 1283.31 4305.73 ...
##  $ HCC_109       : num  353.73 1.62 17.81 1193.12 4534.52 ...
##  $ HCC_118       : num  0.811 1231.183 466.357 13.788 48.663 ...
##  $ HCC_25        : num  2.87 4330.03 656.44 9.57 116.74 ...
##  $ HCC_10        : num  3.92 425.69 680.32 45.7 203.71 ...
##  $ HCC_52        : num  1.67 2547.23 2899.43 5.01 93.48 ...
##  $ threshold     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

The dataframe is altered to add gene symbols and upregulation/downregulation based on padj and log2FC

#add column that includes gene symbols 
resdata$Gene_id <- sub("\\.\\d+", "", resdata$Gene_id)
ids <- resdata$Gene_id
gene_symbol_map <- transId(ids, transTo = "symbol", org = "human")

names(resdata)[1] = "input_id"
resdata_gene <- merge(resdata, gene_symbol_map, by.x = "input_id", by.y = "input_id")

#specify up or downregulation based on padj and log2FC values, and add as new column
resdata_gene$diffexpressed <- "Unchanged"
resdata_gene$diffexpressed[resdata_gene$padj < 0.05 & resdata_gene$log2FoldChange > 1] <- "Upregulated"
resdata_gene$diffexpressed[resdata_gene$padj < 0.05 & resdata_gene$log2FoldChange < -1] <- "Downregulated"

str(resdata_gene)
## 'data.frame':    980 obs. of  18 variables:
##  $ input_id      : 'AsIs' chr  "ENSG00000000938" "ENSG00000001036" "ENSG00000001626" "ENSG00000005059" ...
##  $ baseMean      : num  1077.7 4643 66.3 425.3 1420.5 ...
##  $ log2FoldChange: num  2.42 1.34 5.86 1.8 -1.82 ...
##  $ lfcSE         : num  0.627 0.442 1.757 0.576 0.42 ...
##  $ stat          : num  3.86 3.02 3.33 3.13 -4.33 ...
##  $ pvalue        : num  1.12e-04 2.49e-03 8.54e-04 1.76e-03 1.46e-05 ...
##  $ padj          : num  0.00678 0.04885 0.02495 0.03866 0.00161 ...
##  $ HCC_1         : num  311.71 3630 4.35 111.45 2133.24 ...
##  $ HCC_125       : num  432.94 2100.96 4.43 202.75 1781.34 ...
##  $ HCC_101       : num  496 2472 0 215 2486 ...
##  $ HCC_109       : num  117 2330 0 229 2453 ...
##  $ HCC_118       : num  1581 3830 204 444 759 ...
##  $ HCC_25        : num  1171 5018 0 272 506 ...
##  $ HCC_10        : num  912.76 6954.7 5.22 588.92 919.28 ...
##  $ HCC_52        : num  3600 10808 313 1339 326 ...
##  $ threshold     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ symbol        : chr  "FGR" "FUCA2" "CFTR" "MCUB" ...
##  $ diffexpressed : chr  "Upregulated" "Upregulated" "Upregulated" "Upregulated" ...

The results are written as a new file.

write.csv(resdata_gene, "HCC_Success_vs_Failure_DEG.csv", row.names = TRUE)

PCA Plot Visualization

Differential expression analysis data is used to create a PCA plot.

vstcounts <- vst(DESeq_dds, blind = FALSE)
PCAplot <- plotPCA(vstcounts, intgroup=c("presentation"), returnData=TRUE)
percentVar <- round(100 * attr(PCAplot, "percentVar"))
options(bitmapType="cairo")
#use ggplot to edit 
HCCpca <- ggplot(PCAplot, aes(PC1, PC2, color=presentation)) +
  geom_point(size=2) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
HCCpca

The PCA plot is saved as RDS file.

#saveRDS(HCCpca, file = "PCAplot.rds")

Heatmap visualization

Loading data

The previously written .csv files are loaded as newly named objects.

#load DEG file
all_matrix <- read.csv("HCC_Success_vs_Failure_DEG.csv", header = TRUE)
all_matrix <- data_frame(all_matrix)

Dataframe for the data required for heatmap visualization is loaded. Gene symbols are altered in consideration of multiple gene IDs matching to the same symbol.

#create dataframe with only sample columns from DEG dataframe
heatmap_matrix <- dplyr:::select.data.frame(all_matrix,symbol, HCC_1:HCC_52)

#change duplicated gene symbols
uq_symbol_names <- make.names(heatmap_matrix[[1]], unique = T)
heatmap_matrix <- dplyr::select(heatmap_matrix, -symbol)
heatmap_matrix$symbol <- uq_symbol_names

#make gene symbols the row names
heatmap_matrix <- data.frame(heatmap_matrix, row.names = 9)

head(heatmap_matrix)
##             HCC_1     HCC_125   HCC_101   HCC_109   HCC_118    HCC_25
## FGR    311.714844  432.940303  496.1685  116.5603 1580.7481 1171.2607
## FUCA2 3629.997725 2100.955702 2472.4329 2329.5875 3829.8064 5018.0484
## CFTR     4.353559    4.426792    0.0000    0.0000  203.5751    0.0000
## MCUB   111.451117  202.747095  215.2867  229.0734  444.4587  271.7631
## ABCB4 2133.244046 1781.341286 2485.8884 2452.6233  759.1484  506.2066
## RALA  1100.579785 1284.655172  750.1395  863.6796 1629.4115 2245.8733
##            HCC_10     HCC_52
## FGR    912.755799  3599.6708
## FUCA2 6954.702981 10808.1933
## CFTR     5.223209   312.9786
## MCUB   588.916831  1338.7137
## ABCB4  919.284810   326.3323
## RALA  1943.033803  2197.5269

Save/load heatmap_matrix as RDS file

#saveRDS(heatmap_matrix, file = "heatmap_matrix.rds")
# load RDS file of dds if necessary 
heatmap_matrix <- readRDS("heatmap_matrix.rds")

Annotation file for data is loaded.

#load annotation file
heatmap_meta <- read.csv("HCCSuccesvsFailure_anno.csv", header = TRUE)

Set up parameters for pheatmap

The row names are changed to the sample names within annotation file.

#change row names
row.names(heatmap_meta) <- heatmap_meta$samplename

Select only the conditions to be included.

#select on the presentation column
heatmap_meta <- dplyr::select(heatmap_meta, presentation)

Save/load heatmap_meta as RDS file

#saveRDS(heatmap_meta, file = "heatmap_meta.rds")
# load RDS file if necessary 
heatmap_meta <- readRDS("heatmap_meta.rds")

Colors are assigned to the heatmap expression values.

#expression value colors range from green to red 
heatmap_colors <- colorRampPalette(colors = c("green", "black", "red"))(30)
#rev(brewer.pal(10, "RdBu"))

Colors are assigned to annotations.

#group colors are assigned red and blue
ann_colors <- list(presentation=c(Success="#fd0000", Failure="#0049bb"))

Plotting with pheatmap

The heatmap is plotted with additional major arguments.

HCC_heatmap <- as.ggplot(pheatmap(heatmap_matrix,
         color = heatmap_colors,
         annotation_col = heatmap_meta,
         clustering_method = 'ward.D',
         show_colnames = T,
         show_rownames = F,
         scale = "row",
         annotation_names_col = F,
         fontsize_col = 7,
         cellwidth = 12,
         cellheight = 0.25,
         treeheight_row = 0,
         treeheight_col = 20,
        
         ))

The heatmap plot is saved as a RDS file.

#saveRDS(HCC_heatmap, file = "HCC_heatmap.rds")

Figures for Biological Signifcance

Set up gene lists

The background gene set is prepared prior to creating gene list.

#get the log2 fold change
original_gene_list <- all_matrix$log2FoldChange

#name the vector
names(original_gene_list) <- all_matrix$input_id

#omit any NA values
background_gene_list<-na.omit(original_gene_list)

# sort the list in decreasing order
background_gene_list = sort(background_gene_list, decreasing = TRUE)

head(background_gene_list)
## ENSG00000172016 ENSG00000174358 ENSG00000126890 ENSG00000079112 ENSG00000106541 
##       12.952985       12.082459       11.984914       10.790671       10.164150 
## ENSG00000166206 
##        9.751104

Data is prepared for significant gene set.

# Extract significant results based on padj
sig_genes_df = subset(all_matrix, padj < 0.05)

#filter using log2FC
sig_genes <- sig_genes_df$log2FoldChange

# Name the vector 
names(sig_genes) <- sig_genes_df$input_id

# omit NA values
sig_genes <- na.omit(sig_genes)

# significant results based on log2foldchange
sig_genes <- names(sig_genes)[abs(sig_genes) > 1]
head(sig_genes)
## [1] "ENSG00000000938" "ENSG00000001036" "ENSG00000001626" "ENSG00000005059"
## [5] "ENSG00000005471" "ENSG00000006451"

enrichGO object is created.

go_enrich <- enrichGO(gene = sig_genes,
                      universe = background_gene_list,
                      OrgDb = org.Hs.eg.db, 
                      keyType = 'ENSEMBL',
                      readable = T,
                      ont = "BP",
                      pvalueCutoff = 0.05, 
                      qvalueCutoff = 0.05)

enrichGO object is saved/loaded as RDS.

#saveRDS(go_enrich, file = "go_enrich.rds")
#load RDS if necessary
go_enrich <- readRDS("go_enrich.rds")

Visualize GO Pathways

The 10 most enriched biological processes are displayed in a bargraph

bp_barplot <- as.ggplot(barplot(go_enrich, showCategory = 10, title = "GO Biological Pathways in HCC tumor samples" , font.size = 6  ))
bp_barplot

Save biological pathways barplot as RDS file

#saveRDS(bp_barplot, file = "bp_barplot.rds")

Data is continued to be prepared for additional visualization

Results are extracted into a table

## Output results from enirchgo in a table
cluster_summary <- data.frame(go_enrich)
write.csv(cluster_summary, "clusterProfiler_HCC.csv")

head(cluster_summary)
##                    ID                          Description GeneRatio   BgRatio
## GO:0044282 GO:0044282     small molecule catabolic process    60/758 403/21261
## GO:0016054 GO:0016054       organic acid catabolic process    42/758 276/21261
## GO:0046395 GO:0046395    carboxylic acid catabolic process    42/758 276/21261
## GO:0046394 GO:0046394 carboxylic acid biosynthetic process    44/758 361/21261
## GO:0016053 GO:0016053    organic acid biosynthetic process    44/758 364/21261
## GO:0015711 GO:0015711              organic anion transport    51/758 488/21261
##                  pvalue     p.adjust       qvalue
## GO:0044282 3.750416e-21 1.839204e-17 1.605178e-17
## GO:0016054 1.693721e-15 2.768670e-12 2.416376e-12
## GO:0046395 1.693721e-15 2.768670e-12 2.416376e-12
## GO:0046394 1.126820e-12 1.381481e-09 1.205697e-09
## GO:0016053 1.500622e-12 1.471810e-09 1.284533e-09
## GO:0015711 6.705804e-12 5.480877e-09 4.783474e-09
##                                                                                                                                                                                                                                                                                                                                                                                  geneID
## GO:0044282 HSD17B6/OAT/PFKP/PKM/ABCB11/ACACB/ECHDC1/UPB1/MLYCD/QPRT/IL4I1/SULT2A1/GCK/ASPA/ENO3/DAO/ALDH2/ENO2/ALDH5A1/EHHADH/FAAH/ALDH8A1/ECHDC2/ACADS/SARDH/TST/APOBEC3A/CDO1/APOE/CYP27A1/AMDHD1/SLC16A3/ABHD1/ADHFE1/IDNK/MAT1A/PFKM/HK2/UROC1/HAAO/LDHD/NUDT8/CYP7A1/GLYCTK/DCXR/AGXT/LEP/PIPOX/ABAT/CYP4F2/CYP4F3/SULT1A1/SULT1A2/ECI2/GALT/ACAD11/LOC102724560/ABCB11/PRODH/FTCD
## GO:0016054                                                                                                             OAT/ABCB11/ACACB/ECHDC1/MLYCD/QPRT/IL4I1/SULT2A1/ASPA/DAO/ALDH5A1/EHHADH/FAAH/ALDH8A1/ECHDC2/ACADS/SARDH/TST/CDO1/AMDHD1/SLC16A3/ABHD1/ADHFE1/IDNK/MAT1A/UROC1/HAAO/LDHD/NUDT8/DCXR/AGXT/LEP/PIPOX/ABAT/CYP4F2/CYP4F3/ECI2/ACAD11/LOC102724560/ABCB11/PRODH/FTCD
## GO:0046395                                                                                                             OAT/ABCB11/ACACB/ECHDC1/MLYCD/QPRT/IL4I1/SULT2A1/ASPA/DAO/ALDH5A1/EHHADH/FAAH/ALDH8A1/ECHDC2/ACADS/SARDH/TST/CDO1/AMDHD1/SLC16A3/ABHD1/ADHFE1/IDNK/MAT1A/UROC1/HAAO/LDHD/NUDT8/DCXR/AGXT/LEP/PIPOX/ABAT/CYP4F2/CYP4F3/ECI2/ACAD11/LOC102724560/ABCB11/PRODH/FTCD
## GO:0046394                                                                                          MLXIPL/BCAT1/OAT/ACSM2B/ABCB11/PTGS2/ACACB/SLC27A5/SCD/UPB1/MTHFD1/MLYCD/APOA5/APOC3/PECR/ALDH8A1/MTHFD1L/CDO1/APOC1/ACSS2/ALOX5AP/CYP27A1/BAAT/ABHD1/BHMT/NAGS/HAAO/FABP5/AVPR1A/ACSM1/CYP7A1/GATM/AGXT/SEPHS2/CYP8B1/ABAT/ACSM5/ACSM2A/INSIG1/LOC102724560/ACACA/BAAT/ABCB11/ADI1
## GO:0016053                                                                                          MLXIPL/BCAT1/OAT/ACSM2B/ABCB11/PTGS2/ACACB/SLC27A5/SCD/UPB1/MTHFD1/MLYCD/APOA5/APOC3/PECR/ALDH8A1/MTHFD1L/CDO1/APOC1/ACSS2/ALOX5AP/CYP27A1/BAAT/ABHD1/BHMT/NAGS/HAAO/FABP5/AVPR1A/ACSM1/CYP7A1/GATM/AGXT/SEPHS2/CYP8B1/ABAT/ACSM5/ACSM2A/INSIG1/LOC102724560/ACACA/BAAT/ABCB11/ADI1
## GO:0015711           CFTR/ABCB4/SLC11A1/SLC2A3/ABCB11/PTGS2/SLC27A5/ABCC6/SLCO4A1/AQP9/MFSD10/SLC6A12/SLC17A2/SLC16A10/RGS4/SLC2A1/APOE/SLC6A8/LRRC8A/SLC22A7/SLC7A1/SLC39A6/SLC16A3/CRABP2/SNCA/SLC17A4/SLC5A12/SLC6A1/SV2A/SLC1A7/SLC2A2/SLC51A/FABP5/AVPR1A/CYP7A1/FABP6/AGXT/SLC16A13/SLC6A19/LEP/SLC25A42/SLC25A18/ABAT/SLC25A10/SLC22A10/CYP4F2/SLC22A25/NMB/SLC10A5/ABCC6/ABCB11
##            Count
## GO:0044282    60
## GO:0016054    42
## GO:0046395    42
## GO:0046394    44
## GO:0016053    44
## GO:0015711    51

The identification is converted to entrezID and duplicates are removed.

# Convert gene IDs
ids<-bitr(names(original_gene_list), 
          fromType = "ENSEMBL", 
          toType = "ENTREZID", 
          OrgDb="org.Hs.eg.db") 

# remove duplicate IDS 
dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),]

A new dataframe is created of only genes that were converted in previous step.

# Create a new dataframe
entrez_df = all_matrix[all_matrix$input_id %in% dedup_ids$ENSEMBL,]

# Create a new column for entrezIDs
entrez_df <- merge(all_matrix, dedup_ids, by.x = "input_id", by.y = "ENSEMBL")

str(entrez_df)
## 'data.frame':    936 obs. of  20 variables:
##  $ input_id      : chr  "ENSG00000000938" "ENSG00000001036" "ENSG00000001626" "ENSG00000005059" ...
##  $ X             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ baseMean      : num  1077.7 4643 66.3 425.3 1420.5 ...
##  $ log2FoldChange: num  2.42 1.34 5.86 1.8 -1.82 ...
##  $ lfcSE         : num  0.627 0.442 1.757 0.576 0.42 ...
##  $ stat          : num  3.86 3.02 3.33 3.13 -4.33 ...
##  $ pvalue        : num  1.12e-04 2.49e-03 8.54e-04 1.76e-03 1.46e-05 ...
##  $ padj          : num  0.00678 0.04885 0.02495 0.03866 0.00161 ...
##  $ HCC_1         : num  311.71 3630 4.35 111.45 2133.24 ...
##  $ HCC_125       : num  432.94 2100.96 4.43 202.75 1781.34 ...
##  $ HCC_101       : num  496 2472 0 215 2486 ...
##  $ HCC_109       : num  117 2330 0 229 2453 ...
##  $ HCC_118       : num  1581 3830 204 444 759 ...
##  $ HCC_25        : num  1171 5018 0 272 506 ...
##  $ HCC_10        : num  912.76 6954.7 5.22 588.92 919.28 ...
##  $ HCC_52        : num  3600 10808 313 1339 326 ...
##  $ threshold     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ symbol        : chr  "FGR" "FUCA2" "CFTR" "MCUB" ...
##  $ diffexpressed : chr  "Upregulated" "Upregulated" "Upregulated" "Upregulated" ...
##  $ ENTREZID      : chr  "2268" "2519" "1080" "55013" ...

This dataframe is saved/loaded as RDS.

#saveRDS(entrez_df, file = "entrez_df.rds")
#load RDS if necessary
entrez_df <- readRDS("entrez_df.rds")

This dataframe is also written into a new file.

write.csv(entrez_df, "HCC_Success_vs_Failure_entrezid.csv", row.names = TRUE)

GSEA data preparation

Dataframe will be read in.

gsea_data <- read.csv("HCC_Success_vs_Failure_entrezid.csv", header = TRUE)
#remove extra columns
gsea_data <- dplyr::select(gsea_data, -X)
gsea_data <- dplyr::select(gsea_data, -X.1)

Remove duplicates and insignificant values

## Remove NA values 
res_entrez <- dplyr::filter(gsea_data, ENTREZID != "NA")
## Remove entrez duplicates
res_entrez <- res_entrez[which(duplicated(res_entrez$ENTREZID) == F), ]

Fold changes will be extracted.

## Extract foldchanges
foldchanges <- res_entrez$log2FoldChange

## Name fold change with the Entrez ID
names(foldchanges) <- res_entrez$ENTREZID

## Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)

Save these as RDS files

#saveRDS(foldchanges, "foldchanges.RDS")
foldchanges <- readRDS("foldchanges.RDS")

Perform GSEA

gseaKEGG <- gseKEGG(geneList = foldchanges,
                    organism = "hsa", 
                    nPerm = 1000, 
                    minGSSize = 20, 
                    pvalueCutoff = 0.05, 
                    verbose = FALSE)

Save this output as a RDS file

#saveRDS(gseaKEGG, file = "gseaKEGG.RDS")
gseaKEGG <- readRDS("gseaKEGG.RDS")

Results are extracted.

gseaKEGG_results <- gseaKEGG@result

GSEA plot visualization is performed.

enrich_gsea1 <- as.ggplot(gseaplot2(gseaKEGG,
                         title = gseaKEGG_results$Description[1],
                         geneSetID = 1,
                         rel_heights = 2, base_size = 10))
enrich_gsea1

enrich_gsea2 <- as.ggplot(gseaplot2(gseaKEGG,
                         title = gseaKEGG_results$Description[2],
                         geneSetID = 2,
                         rel_heights = 2, base_size = 10))
enrich_gsea2

GSEA plots are saved as RDS files

#saveRDS(enrich_gsea1, file = "GSEA_1.rds")
#saveRDS(enrich_gsea2, file = "GSEA_2.rds")

Investigate MSigDB gene sets

Download MSigDB gene sets from human category chemical and genetic perturbations

CGP_gene_sets <- msigdbr(species = "human", category = "C2",
                            subcategory = "CGP") %>% 
  dplyr::select(gs_name, entrez_gene)
head(CGP_gene_sets)
## # A tibble: 6 × 2
##   gs_name                  entrez_gene
##   <chr>                          <int>
## 1 ABBUD_LIF_SIGNALING_1_DN       79026
## 2 ABBUD_LIF_SIGNALING_1_DN         214
## 3 ABBUD_LIF_SIGNALING_1_DN       91369
## 4 ABBUD_LIF_SIGNALING_1_DN        8289
## 5 ABBUD_LIF_SIGNALING_1_DN         594
## 6 ABBUD_LIF_SIGNALING_1_DN      146556

Save gene sets in RDS file.

#saveRDS(CGP_gene_sets, file= "CGP_gene_sets.rds")
CGP_gene_sets <- readRDS("CGP_gene_sets.rds")

Run GSEA on this.

msig_CGP_GSEA <- GSEA(foldchanges, TERM2GENE = CGP_gene_sets, verbose = FALSE)

Save as a RDS file

#saveRDS(msig_CGP_GSEA, "msig_CGP_GSEA.rds")
msig_CGP_GSEA <- readRDS("msig_CGP_GSEA.rds")

Extract GSEA results

gseaCGP_results <- msig_CGP_GSEA@result
# Write to file
write.csv(gseaCGP_results, "gseaCGP_results.csv", quote=F)
#View(gseaGOBP_results)

Compare to gene sets identified by literature.

## gsea plot for single pathway 
prol_GSEA <- as.ggplot(gseaplot2(msig_CGP_GSEA, title = 'CHIANG_LIVER_CANCER_SUBCLASS_PROLIFERATION_UP', geneSetID = 'CHIANG_LIVER_CANCER_SUBCLASS_PROLIFERATION_UP', base_size = 10))
prol_GSEA

## gsea plot for single pathway 
HCC_GSEA <- as.ggplot(gseaplot2(msig_CGP_GSEA, title = 'DESERT_STEM_CELL_HEPATOCELLULAR_CARCINOMA_SUBCLASS_UP', geneSetID = 'DESERT_STEM_CELL_HEPATOCELLULAR_CARCINOMA_SUBCLASS_UP', base_size = 10))
HCC_GSEA

Save as RDS files

saveRDS(prol_GSEA, file = "prol_GSEA.rds")
saveRDS(HCC_GSEA, file = "HCC_GSEA.rds")

Create a multipanel figure

Reload individual figures.

heatmap <- readRDS("HCC_heatmap.rds")
pca <- readRDS("PCAplot.rds")
bp_bar <- readRDS("bp_barplot.rds")
gsea_1 <- readRDS("GSEA_1.rds")
gsea_2 <- readRDS("GSEA_2.rds")
prol_GSEA <- readRDS("prol_GSEA.rds")
HCC_GSEA <- readRDS("HCC_GSEA.rds")

Cowplot is used to combine separate figures.

#for orginal figure
Fig1A <- plot_grid(heatmap, labels = "A")
Fig1B <- plot_grid(HCC_GSEA, prol_GSEA, labels = "B", nrow = 2)
orig_fig <- plot_grid(Fig1A, Fig1B, align = "hv")
ggsave(filename = "original_figure.png",
       plot = orig_fig,
       units = "in", 
       width = 15, 
       height = 6, 
       dpi = 300)
orig_fig

#additional figure
FigAB <- plot_grid(gsea_1, gsea_2, labels = c("A","B"))
FigC <- plot_grid(bp_bar, labels = "C")
comb_fig <- plot_grid(FigAB,FigC, nrow = 2)
ggsave(filename = "additional_figure.png",
       plot = comb_fig,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300) 
comb_fig

Save each individual figures as their own png files

ggsave(filename = "heatmap.png",
       plot =Fig1A,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300) 
ggsave(filename = "lit_GSEA.png",
       plot = Fig1B,
       units = "in", 
       width = 12, 
       height = 6, 
       dpi = 300) 
ggsave(filename = "add_GSEA.png",
       plot = FigAB,
       units = "in", 
       width = 10, 
       height = 6, 
       dpi = 300) 
ggsave(filename = "bp_bar.png",
       plot = FigC,
       units = "in", 
       width = 8, 
       height = 6, 
       dpi = 300) 
ggsave(filename = "PCA.png",
       plot = pca,
       units = "in", 
       width = 8, 
       height = 6, 
       dpi = 300) 

Session Information

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)
## 
## Matrix products: default
## BLAS:   /gpfs1/sw/x86_64/rh7/pkgs/stacks/gcc/10.5.0/R/4.3.2-1/lib64/R/lib/libRblas.so 
## LAPACK: /gpfs1/sw/x86_64/rh7/pkgs/stacks/gcc/10.5.0/R/4.3.2-1/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] fstcore_0.9.18              ggnewscale_0.4.10          
##  [3] DOSE_3.28.2                 org.Hs.eg.db_3.18.0        
##  [5] genekitr_1.2.5              GSEABase_1.64.0            
##  [7] graph_1.80.0                annotate_1.80.0            
##  [9] XML_3.99-0.16.1             AnnotationDbi_1.64.1       
## [11] enrichplot_1.22.0           pathview_1.42.0            
## [13] msigdbr_7.5.1               clusterProfiler_4.10.1     
## [15] ggplotify_0.1.2             patchwork_1.2.0            
## [17] cowplot_1.1.3               lubridate_1.9.3            
## [19] forcats_1.0.0               stringr_1.5.1              
## [21] dplyr_1.1.4                 purrr_1.0.2                
## [23] readr_2.1.4                 tidyr_1.3.0                
## [25] tibble_3.2.1                tidyverse_2.0.0            
## [27] ggplot2_3.5.1               pheatmap_1.0.12            
## [29] RColorBrewer_1.1-3          DESeq2_1.42.1              
## [31] SummarizedExperiment_1.32.0 Biobase_2.62.0             
## [33] MatrixGenerics_1.14.0       matrixStats_1.3.0          
## [35] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
## [37] IRanges_2.36.0              S4Vectors_0.40.2           
## [39] BiocGenerics_0.48.1         knitr_1.45                 
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.2           bitops_1.0-7            urltools_1.7.3         
##   [4] triebeard_0.4.1         polyclip_1.10-6         lifecycle_1.0.4        
##   [7] lattice_0.21-9          MASS_7.3-60             magrittr_2.0.3         
##  [10] openxlsx_4.2.5.2        sass_0.4.7              rmarkdown_2.25         
##  [13] jquerylib_0.1.4         yaml_2.3.7              zip_2.3.1              
##  [16] ggvenn_0.1.10           DBI_1.2.2               abind_1.4-5            
##  [19] zlibbioc_1.48.2         ggraph_2.2.1            RCurl_1.98-1.14        
##  [22] yulab.utils_0.1.4       tweenr_2.0.3            GenomeInfoDbData_1.2.11
##  [25] ggrepel_0.9.5           tidytree_0.4.6          codetools_0.2-19       
##  [28] DelayedArray_0.28.0     xml2_1.3.5              ggforce_0.4.2          
##  [31] tidyselect_1.2.0        aplot_0.2.2             farver_2.1.1           
##  [34] viridis_0.6.5           jsonlite_1.8.7          fst_0.9.8              
##  [37] tidygraph_1.3.1         systemfonts_1.0.5       tools_4.3.2            
##  [40] progress_1.2.2          ragg_1.2.6              treeio_1.26.0          
##  [43] Rcpp_1.0.11             glue_1.6.2              gridExtra_2.3          
##  [46] SparseArray_1.2.4       xfun_0.41               qvalue_2.34.0          
##  [49] withr_2.5.2             fastmap_1.1.1           fansi_1.0.5            
##  [52] digest_0.6.33           timechange_0.2.0        R6_2.5.1               
##  [55] gridGraphics_0.5-1      textshaping_0.3.7       colorspace_2.1-0       
##  [58] GO.db_3.18.0            RSQLite_2.3.6           utf8_1.2.4             
##  [61] generics_0.1.3          data.table_1.14.8       prettyunits_1.2.0      
##  [64] graphlayouts_1.1.1      httr_1.4.7              S4Arrays_1.2.1         
##  [67] scatterpie_0.2.2        pkgconfig_2.0.3         gtable_0.3.4           
##  [70] blob_1.2.4              XVector_0.42.0          shadowtext_0.1.3       
##  [73] htmltools_0.5.7         geneset_0.2.7           fgsea_1.28.0           
##  [76] scales_1.3.0            png_0.1-8               ggfun_0.1.4            
##  [79] rstudioapi_0.15.0       tzdb_0.4.0              reshape2_1.4.4         
##  [82] nlme_3.1-163            cachem_1.0.8            parallel_4.3.2         
##  [85] HDO.db_0.99.1           pillar_1.9.0            grid_4.3.2             
##  [88] vctrs_0.6.4             xtable_1.8-4            Rgraphviz_2.46.0       
##  [91] evaluate_0.23           KEGGgraph_1.62.0        cli_3.6.1              
##  [94] locfit_1.5-9.9          compiler_4.3.2          rlang_1.1.2            
##  [97] crayon_1.5.2            labeling_0.4.3          plyr_1.8.9             
## [100] fs_1.6.3                stringi_1.8.2           viridisLite_0.4.2      
## [103] BiocParallel_1.36.0     babelgene_22.9          munsell_0.5.0          
## [106] Biostrings_2.70.3       lazyeval_0.2.2          GOSemSim_2.28.1        
## [109] Matrix_1.6-1.1          europepmc_0.4.3         hms_1.1.3              
## [112] bit64_4.0.5             KEGGREST_1.42.0         highr_0.10             
## [115] igraph_2.0.3            memoise_2.0.1           bslib_0.6.1            
## [118] ggtree_3.10.1           fastmatch_1.1-4         bit_4.0.5              
## [121] ape_5.8                 gson_0.1.0