Belli, S., Esposito, D., Allotta, A., & others. (2023). Pak1 pathway hyper-activation mediates resistance to endocrine therapy and CDK4/6 inhibitors in ER+ breast cancer. NPJ Breast Cancer, 9(1), 48. https://doi.org/10.1038/s41523-023-00556-9
Packages installed are needed to perform NGS analysis.
#install packages
#install.packages(c("knitr", "RColorBrewer", "pheatmap"))
#BiocManager::install("apeglm") # Update all/some/none? [a/s/n] - type n
#install.packages("pacman")
#BiocManager::install("pathview")
#BiocManager::install("enrichplot")
#BiocManager::install("AnnotationDbi")
#BiocManager::install("org.Hs.eg.db")
#install.packages("VennDiagram")
#if (!requireNamespace("biomaRt", quietly = TRUE)) {
#install.packages("BiocManager")
#BiocManager::install("biomaRt")}
#BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "msigdbr", "enrichplot"))
Libraries loaded are needed to perform NGS visual analysis.
#load libraries
pacman::p_load(DESeq2, dplyr, tidyr, ggplot2, genekitr, pheatmap,
RColorBrewer, EnhancedVolcano, clusterProfiler, DOSE,
ggnewscale, enrichplot, knitr, cowplot, msigdbr, pathview,
patchwork, ggplotify)
library(dplyr)
library(DESeq2)
library(knitr)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
library(patchwork)
library(ggplotify)
library(ggpubr)
library(ashr)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(ggVennDiagram)
library(biomaRt)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
library(dplyr)
library(cowplot)
Data is loaded into object countdata from the file final_meta_table.csv. The table is then converted into a matrix.
#load data
countdata <- read.csv("final_meta_table.csv",
header=TRUE,
row.names=1)
countdata <- as.matrix(countdata)
Column names of data are specified and extracted from countdata to ensure they match columns in final_meta_table.csv.
colnames(countdata)
## [1] "file_name" "cell_line" "Genotype" "replicate"
Sample names, genotype, and cell lines are specified as a factor by creating an object for each.
sample_names <- rep(c("T47D_control_rep1","T47D_control_rep2","T47D_control_rep3", "MCF7_control_rep1", "MCF7_control_rep2", "MCF7_control_rep3", "T47D_FAR_rep1", "T47D_FAR_rep2", "T47D_FAR_rep3", "MCF7_FAR_rep1", "MCF7_FAR_rep2", "MCF7_FAR_rep3"), each = 1, times = 1)
genotype <- rep(c("WT", "FAR"), each = 3, times = 2)
cell_line<- rep(c("T47D", "MCF7"), each = 3,times = 2)
Determine length of vectors to ensure that a data frame can be formed.
length(sample_names)
## [1] 12
length(genotype)
## [1] 12
length(cell_line)
## [1] 12
sampleTable is created from data in final_meta_table.csv to create a data frame. Cell line and genotype are specified as factors. Group is created as a factor by combining cell line and genotype. This will be used to distinguish cell lines and genotypes from each other.
sampleTable <- read.csv("final_meta_table.csv", header = T)
sampleTable <- data.frame(
sampleTable)
cell_line <- factor(sampleTable$cell_line)
genotype <- factor(sampleTable$genotype)
sampleTable$Group <- factor(paste(sampleTable$cell_line, sampleTable$Genotype, sep = "_"))
head(sampleTable)
## sample_name file_name cell_line Genotype
## 1 T47D_control_rep1 SRR23803508_sorted.gene_id.count.txt T47D WT
## 2 T47D_control_rep2 SRR23803507_sorted.gene_id.count.txt T47D WT
## 3 T47D_control_rep3 SRR23803506_sorted.gene_id.count.txt T47D WT
## 4 MCF7_control_rep1 SRR23803514_sorted.gene_id.count.txt MCF7 WT
## 5 MCF7_control_rep2 SRR23803513_sorted.gene_id.count.txt MCF7 WT
## 6 MCF7_control_rep3 SRR23803512_sorted.gene_id.count.txt MCF7 WT
## replicate Group
## 1 1 T47D_WT
## 2 2 T47D_WT
## 3 3 T47D_WT
## 4 1 MCF7_WT
## 5 2 MCF7_WT
## 6 3 MCF7_WT
DESeq2 analysis is stored in the object dds. Design is set as Group as it distinguishes the cell lines and genotypes at the same time. Three separate DESeq analyses are performed as dds_MCF7 and dds_T47D will be used for later downstream analysis.
#DESeq2
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = "counts/",
design = ~Group)
dds
## class: DESeqDataSet
## dim: 78932 12
## metadata(1): version
## assays(1): counts
## rownames(78932): ENSG00000000003 ENSG00000000005 ... ENSG00000310556
## ENSG00000310557
## rowData names(0):
## colnames(12): T47D_control_rep1 T47D_control_rep2 ... MCF7_FAR_rep2
## MCF7_FAR_rep3
## colData names(4): cell_line Genotype replicate Group
dds_MCF7 <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = "counts/",
design = ~Group)
dds_MCF7
## class: DESeqDataSet
## dim: 78932 12
## metadata(1): version
## assays(1): counts
## rownames(78932): ENSG00000000003 ENSG00000000005 ... ENSG00000310556
## ENSG00000310557
## rowData names(0):
## colnames(12): T47D_control_rep1 T47D_control_rep2 ... MCF7_FAR_rep2
## MCF7_FAR_rep3
## colData names(4): cell_line Genotype replicate Group
dds_T47D <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = "counts/",
design = ~Group)
dds_T47D
## class: DESeqDataSet
## dim: 78932 12
## metadata(1): version
## assays(1): counts
## rownames(78932): ENSG00000000003 ENSG00000000005 ... ENSG00000310556
## ENSG00000310557
## rowData names(0):
## colnames(12): T47D_control_rep1 T47D_control_rep2 ... MCF7_FAR_rep2
## MCF7_FAR_rep3
## colData names(4): cell_line Genotype replicate Group
Releveing is done to determine which genes are upregulated in later analyses. The WT samples for MCF7 and T47D are used as the control for each respective cell line.
dds_MCF7$Group <- relevel(dds$Group, ref = "MCF7_WT")
dds_T47D$Group <- relevel(dds$Group, ref = "T47D_WT")
Insignificant counts are considered those less than or equal to 10. These are filtered out and stored in keep.
#prefiltering
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
DESeq is used to determine differential gene expression among samples.
dds <- DESeq(dds)
dds_MCF7 <- DESeq(dds_MCF7)
dds_T47D <- DESeq(dds_T47D)
Testing for genes that are significant between treatment conditons. In Belli et al. comparisons were made between cell lines and genotypes. These comparisons can be used farther down the analysis pipeline.
#compare MCF7 control to MCF7 FAR
res_MCF7 <- results(dds_MCF7, contrast = c("Group", "MCF7_WT", "MCF7_FAR"))
#compare T47D control to T47D FAR
res_T47D <- results(dds_T47D, contrast = c('Group', 'T47D_WT', 'T47D_FAR'))
#compare controls
res_Control <- results(dds, contrast = c('Group', 'MCF7_WT', 'T47D_WT'))
#Compare FAR
res_FAR <- results(dds, contrast = c('Group','T47D_FAR', 'MCF7_FAR'))
Results are extracted from dds
res <- results(dds)
Results from DESeq are sorted by p-adjusted values.
dds_padj_sorted <- res[order(res$padj), ]
head(dds_padj_sorted)
## log2 fold change (MLE): Group T47D WT vs MCF7 FAR
## Wald test p-value: Group T47D WT vs MCF7 FAR
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000051523 4816.17 -10.05022 0.2564514 -39.1896 0
## ENSG00000070087 10004.72 -3.72653 0.0779138 -47.8288 0
## ENSG00000076864 4596.45 -5.28440 0.1087166 -48.6071 0
## ENSG00000092929 6227.59 4.14130 0.0850386 48.6990 0
## ENSG00000099864 1361.92 -4.92428 0.1213377 -40.5833 0
## ENSG00000100605 9014.81 -2.85575 0.0703275 -40.6064 0
## padj
## <numeric>
## ENSG00000051523 0
## ENSG00000070087 0
## ENSG00000076864 0
## ENSG00000092929 0
## ENSG00000099864 0
## ENSG00000100605 0
table(dds_padj_sorted$padj<0.05)
##
## FALSE TRUE
## 12122 14369
Results of counts and dds_padj_sorted are merged to create an output of results that can be used for analysis downstream. Statistics of the results of DESeq analysis were merged with the counts from dds. Gene id is specified for ENSEMBL names.
#merge results
result_PAK1_merged <- merge(as.data.frame(dds_padj_sorted),
as.data.frame(counts(dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(result_PAK1_merged)[1] <- "Gene_id"
str(result_PAK1_merged)
## 'data.frame': 28139 obs. of 19 variables:
## $ Gene_id : chr "ENSG00000051523" "ENSG00000070087" "ENSG00000076864" "ENSG00000092929" ...
## $ baseMean : num 4816 10005 4596 6228 1362 ...
## $ log2FoldChange : num -10.05 -3.73 -5.28 4.14 -4.92 ...
## $ lfcSE : num 0.2565 0.0779 0.1087 0.085 0.1213 ...
## $ stat : num -39.2 -47.8 -48.6 48.7 -40.6 ...
## $ pvalue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ padj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ T47D_control_rep1: num 16.3 1761.9 184.1 8142.4 102.7 ...
## $ T47D_control_rep2: num 4.32 1842.08 174.85 8624.66 118.01 ...
## $ T47D_control_rep3: num 17.7 2078.3 149.3 8158.5 138.7 ...
## $ MCF7_control_rep1: num 5819 7599 11032 12288 1561 ...
## $ MCF7_control_rep2: num 6604 7589 11450 12664 1510 ...
## $ MCF7_control_rep3: num 6780 7870 11596 12079 1529 ...
## $ T47D_FAR_rep1 : num 19.9 5485.1 233.1 3779.3 141.2 ...
## $ T47D_FAR_rep2 : num 21.3 5281.9 285.9 3842.8 156.4 ...
## $ T47D_FAR_rep3 : num 23.4 5317.3 262.1 3740.6 141 ...
## $ MCF7_FAR_rep1 : num 12571 24667 6687 509 3838 ...
## $ MCF7_FAR_rep2 : num 12585 24788 6538 483 3439 ...
## $ MCF7_FAR_rep3 : num 13333 25777 6565 419 3667 ...
Results of merged data are saved to PAK1_merged_matrix_final.csv
#write results
write.csv(result_PAK1_merged,
file = "output/PAK1_merged_matrix_final.csv",
row.names = FALSE)
DEGs are filtered by significance and stored in results_PAK1_merged_filtered_FC_P.csv. Results that are considered significant are considered to have a log2FoldChange greater than 1 and a p adjusted value of 0.05.
#DEG filtered by significance
#filter out insignificant data by padj and log2FC
result_PAK1_merged <- as.data.frame(result_PAK1_merged) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(result_PAK1_merged)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue padj
## 1 ENSG00000051523 4816.174 -10.050217 0.25645135 -39.18956 0 0
## 2 ENSG00000070087 10004.719 -3.726529 0.07791384 -47.82884 0 0
## 3 ENSG00000076864 4596.445 -5.284401 0.10871661 -48.60712 0 0
## 4 ENSG00000092929 6227.595 4.141297 0.08503856 48.69905 0 0
## 5 ENSG00000099864 1361.919 -4.924282 0.12133766 -40.58329 0 0
## 6 ENSG00000100605 9014.812 -2.855748 0.07032746 -40.60645 0 0
## T47D_control_rep1 T47D_control_rep2 T47D_control_rep3 MCF7_control_rep1
## 1 16.2766 4.317365 17.66491 5818.700
## 2 1761.9417 1842.075873 2078.27716 7599.097
## 3 184.1290 174.853296 149.26852 11032.411
## 4 8142.3682 8624.656804 8158.54062 12287.763
## 5 102.7460 118.007986 138.66958 1561.088
## 6 1905.3793 2055.785457 2137.45462 14654.784
## MCF7_control_rep2 MCF7_control_rep3 T47D_FAR_rep1 T47D_FAR_rep2 T47D_FAR_rep3
## 1 6603.949 6779.623 19.85222 21.32671 23.35168
## 2 7588.997 7870.444 5485.09593 5281.91454 5317.26353
## 3 11449.817 11595.766 233.07981 285.93585 262.05772
## 4 12664.274 12079.474 3779.27522 3842.75672 3740.59284
## 5 1510.189 1529.354 141.17137 156.39585 140.97494
## 6 14718.087 15337.615 4414.54638 4231.37673 4557.03668
## MCF7_FAR_rep1 MCF7_FAR_rep2 MCF7_FAR_rep3
## 1 12570.666 12584.8934 13333.4726
## 2 24666.765 24787.6518 25777.1067
## 3 6686.752 6538.4700 6564.8023
## 4 509.059 483.1004 419.2755
## 5 3838.162 3439.2148 3667.0576
## 6 15004.158 14802.2987 14359.2229
write.csv(result_PAK1_merged, file = "results_PAK1_merged_filtered_FC_P.csv")
Threshold for p adjusted values is set to less than 0.05.
#significant padj adjustment
threshold_padj <- result_PAK1_merged$padj < 0.05
length(which(threshold_padj))
## [1] 10281
Threshold is applied to merged results to filter p adjusted values less than 0.05.
result_PAK1_merged$threshold <- threshold_padj
str(result_PAK1_merged)
## 'data.frame': 10281 obs. of 20 variables:
## $ Gene_id : chr "ENSG00000051523" "ENSG00000070087" "ENSG00000076864" "ENSG00000092929" ...
## $ baseMean : num 4816 10005 4596 6228 1362 ...
## $ log2FoldChange : num -10.05 -3.73 -5.28 4.14 -4.92 ...
## $ lfcSE : num 0.2565 0.0779 0.1087 0.085 0.1213 ...
## $ stat : num -39.2 -47.8 -48.6 48.7 -40.6 ...
## $ pvalue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ padj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ T47D_control_rep1: num 16.3 1761.9 184.1 8142.4 102.7 ...
## $ T47D_control_rep2: num 4.32 1842.08 174.85 8624.66 118.01 ...
## $ T47D_control_rep3: num 17.7 2078.3 149.3 8158.5 138.7 ...
## $ MCF7_control_rep1: num 5819 7599 11032 12288 1561 ...
## $ MCF7_control_rep2: num 6604 7589 11450 12664 1510 ...
## $ MCF7_control_rep3: num 6780 7870 11596 12079 1529 ...
## $ T47D_FAR_rep1 : num 19.9 5485.1 233.1 3779.3 141.2 ...
## $ T47D_FAR_rep2 : num 21.3 5281.9 285.9 3842.8 156.4 ...
## $ T47D_FAR_rep3 : num 23.4 5317.3 262.1 3740.6 141 ...
## $ MCF7_FAR_rep1 : num 12571 24667 6687 509 3838 ...
## $ MCF7_FAR_rep2 : num 12585 24788 6538 483 3439 ...
## $ MCF7_FAR_rep3 : num 13333 25777 6565 419 3667 ...
## $ threshold : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Table is further edited to make downstream analysis easier. Gene symbols are added to table based on ENSEMBL ID. Genes are specified whether they were up-regulated or down-regulated based on log2FoldChange.
#add gene symbols and up regulation/down regulation
result_PAK1_merged$Gene_id <- sub("\\.\\d+", "", result_PAK1_merged$Gene_id)
ids <- result_PAK1_merged$Gene_id
gene_symbol_map <- transId(ids, transTo = "symbol", org = "human")
names(result_PAK1_merged)[1] = "input_id"
PAK1_gene <- merge(result_PAK1_merged, gene_symbol_map, by.x = "input_id", by.y = "input_id")
#specify up or downregulation based on padj and log2FC values, and add as new column
PAK1_gene$diffexpressed <- "Unchanged"
PAK1_gene$diffexpressed[PAK1_gene$padj < 0.05 & PAK1_gene$log2FoldChange > 1] <- "Upregulated"
PAK1_gene$diffexpressed[PAK1_gene$padj < 0.05 & PAK1_gene$log2FoldChange < -1] <- "Downregulated"
str(PAK1_gene)
## 'data.frame': 8415 obs. of 22 variables:
## $ input_id : chr "ENSG00000000460" "ENSG00000001461" "ENSG00000001497" "ENSG00000001617" ...
## $ baseMean : num 714 3075 2160 5399 257 ...
## $ log2FoldChange : num -1.35 1.1 -1.01 -1.6 1.47 ...
## $ lfcSE : num 0.1603 0.1451 0.1126 0.0777 0.2572 ...
## $ stat : num -8.44 7.61 -8.99 -20.63 5.71 ...
## $ pvalue : num 3.12e-17 2.84e-14 2.42e-19 1.50e-94 1.12e-08 ...
## $ padj : num 2.27e-16 1.76e-13 1.94e-18 8.50e-93 4.66e-08 ...
## $ T47D_control_rep1: num 289 5877 1464 1762 460 ...
## $ T47D_control_rep2: num 324 6278 1568 1735 461 ...
## $ T47D_control_rep3: num 212 4217 1207 1520 245 ...
## $ MCF7_control_rep1: num 171 1258 1844 9544 215 ...
## $ MCF7_control_rep2: num 133 1260 1774 9400 239 ...
## $ MCF7_control_rep3: num 144 1157 1819 9706 247 ...
## $ T47D_FAR_rep1 : num 1671 2953 2709 5433 249 ...
## $ T47D_FAR_rep2 : num 1797 3098 2473 5140 271 ...
## $ T47D_FAR_rep3 : num 1717 3184 2505 5312 283 ...
## $ MCF7_FAR_rep1 : num 682 2532 2913 5185 158 ...
## $ MCF7_FAR_rep2 : num 697 2648 2663 5112 104 ...
## $ MCF7_FAR_rep3 : num 732 2440 2979 4936 159 ...
## $ threshold : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ symbol : chr "FIRRM" "NIPAL3" "LAS1L" "SEMA3F" ...
## $ diffexpressed : chr "Downregulated" "Upregulated" "Downregulated" "Downregulated" ...
Results are writen as .csv file.
#write results as csv
write.csv(PAK1_gene, "PAK1_DEG.csv", row.names = TRUE)
PCA plot is made to ensure samples are clumping together and no outliers are present.
vstcounts <- vst(dds, blind = FALSE)
pcaData <- plotPCA(vstcounts, intgroup=c("Group"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=Group)) +
geom_point(size=2) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
The first goal of this NGS analysis is to re-create the heatmap seen in Figure 2a. The one parameter that will be changed is the color of the grouping of samples. ### Load Data for Heatmap Merged results with gene symbols and up regulation and down regulation of genes specified is loaded into the object all_matrix.
#creating heatmap
#load DEG file
all_matrix <- read.csv("PAK1_DEG.csv", header = TRUE)
all_matrix <- tibble(all_matrix)
A data frame is constructed and edited by selecting samples and gene symbols from the DEG matrix .
#create dataframe with only sample columns from DEG dataframe
heatmap_matrix <- dplyr:::select.data.frame(all_matrix,symbol, T47D_control_rep1:MCF7_FAR_rep3)
#change duplicated gene symbols
uq_symbol_names <- make.names(heatmap_matrix[[1]], unique = T)
heatmap_matrix <- dplyr::select(heatmap_matrix, -symbol)
heatmap_matrix$symbol <- uq_symbol_names
#make gene symbols the row names
heatmap_matrix <- data.frame(heatmap_matrix, row.names = 13)
Matrix is saved as .rds.
saveRDS(heatmap_matrix, file = "heatmap_matrix_PAK1.rds")
# load RDS file of dds if necessary
heatmap_matrix <- readRDS("heatmap_matrix_PAK1.rds")
Annotation file is loaded for heatmap.
final_anno <- read.csv("final_anno.csv", header = TRUE)
str(final_anno)
## 'data.frame': 12 obs. of 3 variables:
## $ sample_name: chr "T47D_control_rep1" "T47D_control_rep2" "T47D_control_rep3" "MCF7_control_rep1" ...
## $ cell_line : chr "T47D" "T47D" "T47D" "MCF7" ...
## $ Genotype : chr "WT" "WT" "WT" "WT" ...
Heatmap matrix and annotation file are visualized to ensure they match.
head(heatmap_matrix)
## T47D_control_rep1 T47D_control_rep2 T47D_control_rep3 MCF7_control_rep1
## FIRRM 288.909615 323.802400 211.978970 170.693390
## NIPAL3 5876.869173 6278.168748 4217.498265 1257.513326
## LAS1L 1463.876535 1567.923175 1207.396885 1844.136810
## SEMA3F 1761.941736 1734.861301 1520.065867 9543.705086
## RAD52 459.813894 460.518968 244.659062 214.987244
## MAD1L1 5.086437 2.878244 8.832457 1.080338
## MCF7_control_rep2 MCF7_control_rep3 T47D_FAR_rep1 T47D_FAR_rep2
## FIRRM 132.644354 144.34106 1671.263145 1796.97257
## NIPAL3 1260.121365 1156.93219 2952.834487 3097.90172
## LAS1L 1774.390050 1819.13814 2708.725660 2473.10818
## SEMA3F 9400.353171 9706.11014 5432.891938 5139.73649
## RAD52 239.194737 246.81220 249.255700 270.92817
## MAD1L1 2.174498 1.10184 1.470535 3.94939
## T47D_FAR_rep3 MCF7_FAR_rep1 MCF7_FAR_rep2 MCF7_FAR_rep3
## FIRRM 1716.7808 681.52063 696.5337 732.1293
## NIPAL3 3184.4770 2532.21189 2648.1059 2440.0037
## LAS1L 2504.6837 2912.81678 2663.4424 2978.5226
## SEMA3F 5312.0743 5184.55220 5112.1736 4936.4236
## RAD52 282.8148 158.18891 103.5215 158.9913
## MAD1L1 0.0000 1.18939 0.0000 0.0000
head(final_anno)
## sample_name cell_line Genotype
## 1 T47D_control_rep1 T47D WT
## 2 T47D_control_rep2 T47D WT
## 3 T47D_control_rep3 T47D WT
## 4 MCF7_control_rep1 MCF7 WT
## 5 MCF7_control_rep2 MCF7 WT
## 6 MCF7_control_rep3 MCF7 WT
colnames(heatmap_matrix)
## [1] "T47D_control_rep1" "T47D_control_rep2" "T47D_control_rep3"
## [4] "MCF7_control_rep1" "MCF7_control_rep2" "MCF7_control_rep3"
## [7] "T47D_FAR_rep1" "T47D_FAR_rep2" "T47D_FAR_rep3"
## [10] "MCF7_FAR_rep1" "MCF7_FAR_rep2" "MCF7_FAR_rep3"
Row names are specified as column names and colors are specified for gene regulation and genotype.
#parameters for pheatmap
#change row names
row.names(final_anno) <- final_anno$sample_name
#expression value colors range from green to red
heatmap_colors <- colorRampPalette(colors = c("green", "black", "red"))(30)
annotation_colors=list(
Genotype=c(WT="blue4", FAR="goldenrod"))
Color gradient for expression is changed compared to original figure. Clustering has also been altered as Belli et al. used either a different clustering method or used ComplexHeatmap to construct heatmap.
PAK1_heatmap <- as.ggplot(pheatmap(heatmap_matrix,
color = heatmap_colors,
annotation_colors = annotation_colors,
annotation_col = final_anno[, "Genotype", drop = FALSE],
cluster_cols = T,
clustering_method = 'average',
show_colnames = T,
show_rownames = F,
scale = "row",
annotation_names_col = F,
fontsize_col = 7,
cellwidth = 12,
cellheight = 0.02,
treeheight_row = 0,
treeheight_col = 10,
))
Figure 2a. Heatmap comparing expression of genes among WT and FAR cell lines.
saveRDS(PAK1_heatmap, file = "PAK1_heatmap_draft_fianl.rds")
Selecting and filtering to have plot only contain genes upregulated in T47D-FAR and MCF7-FAR.
FAR_GO_terms <- all_matrix[ , -c(9:14) ]
FAR_GO_upreg <- FAR_GO_terms[FAR_GO_terms$log2FoldChange > 0, ]
head(FAR_GO_upreg)
## # A tibble: 6 × 17
## X input_id baseMean log2FoldChange lfcSE stat pvalue padj
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 ENSG00000001461 3075. 1.10 0.145 7.61 2.84e-14 1.76e-13
## 2 5 ENSG00000002016 257. 1.47 0.257 5.71 1.12e- 8 4.66e- 8
## 3 6 ENSG00000002822 2.31 3.64 1.57 2.31 2.08e- 2 3.95e- 2
## 4 7 ENSG00000003096 253. 1.61 0.232 6.93 4.20e-12 2.27e-11
## 5 9 ENSG00000003147 2056. 1.12 0.0981 11.4 3.41e-30 4.28e-29
## 6 11 ENSG00000003756 933. 1.01 0.104 9.70 2.87e-22 2.65e-21
## # ℹ 9 more variables: T47D_FAR_rep1 <dbl>, T47D_FAR_rep2 <dbl>,
## # T47D_FAR_rep3 <dbl>, MCF7_FAR_rep1 <dbl>, MCF7_FAR_rep2 <dbl>,
## # MCF7_FAR_rep3 <dbl>, threshold <lgl>, symbol <chr>, diffexpressed <chr>
Filter genes by p adjusted value and extract log2FoldChange
# Extract significant results based on padj
sig_genes_df = subset(all_matrix, padj < 0.05)
#filter using log2FoldChange
sig_genes <- sig_genes_df$log2FoldChange
# Name the vector
names(sig_genes) <- sig_genes_df$input_id
# omit NA values
sig_genes <- na.omit(sig_genes)
# significant results based on log2foldchange
sig_genes <- names(sig_genes)[abs(sig_genes) > 1]
head(sig_genes)
## [1] "ENSG00000000460" "ENSG00000001461" "ENSG00000001497" "ENSG00000001617"
## [5] "ENSG00000002016" "ENSG00000002822"
EnrichGO analysis
go_enrich <- enrichGO(gene = sig_genes,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
readable = T,
ont = "MF",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05)
Plotting results as a bar plot. Seven total terms are plotted, similaryly to Figure 2c.
GO_barplot <- as.ggplot(barplot(go_enrich, showCategory = 7, title = "GO Molecular Function", font.size = 6 ))
GO_barplot
Figure 2c. Bar plot of GO Enrichment terms for molecular function in genes that are up regulated in MCF7-FAR and T47D- FAR cell lines.
saveRDS(GO_barplot, file = "go_barplot.rds")
Figure panel is created using cowplot.
#create figure panel
cowplot::plot_grid(
PAK1_heatmap,
VennDiagram,
GO_barplot,
labels = letters[1:3],
ncol = 2,
rel_widths = c(6, 4, 12))
Figure 2. a) Heatmap comparing expression of genes among WT and FAR cell lines. b) Venn Diagram of shared upregulated genes in T47D-FAR and MCF7-FAR cells. c) Bar plot of GO Enrichment terms for molecular function in genes that are up regulated in MCF7-FAR and T47D- FAR cell lines.
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.3.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] fstcore_0.10.0 biomaRt_2.64.0
## [3] ggVennDiagram_1.5.2 org.Hs.eg.db_3.21.0
## [5] AnnotationDbi_1.70.0 ashr_2.2-63
## [7] ggpubr_0.6.0 lubridate_1.9.4
## [9] forcats_1.0.0 stringr_1.5.1
## [11] purrr_1.0.4 readr_2.1.5
## [13] tibble_3.2.1 tidyverse_2.0.0
## [15] ggplotify_0.1.2 patchwork_1.3.0
## [17] pathview_1.48.0 msigdbr_10.0.2
## [19] cowplot_1.1.3 knitr_1.50
## [21] enrichplot_1.28.0 ggnewscale_0.5.1
## [23] DOSE_4.2.0 clusterProfiler_4.16.0
## [25] EnhancedVolcano_1.26.0 ggrepel_0.9.6
## [27] RColorBrewer_1.1-3 pheatmap_1.0.12
## [29] genekitr_1.2.8 ggplot2_3.5.2
## [31] tidyr_1.3.1 dplyr_1.1.4
## [33] DESeq2_1.48.0 SummarizedExperiment_1.38.0
## [35] Biobase_2.68.0 MatrixGenerics_1.20.0
## [37] matrixStats_1.5.0 GenomicRanges_1.60.0
## [39] GenomeInfoDb_1.44.0 IRanges_2.42.0
## [41] S4Vectors_0.46.0 BiocGenerics_0.54.0
## [43] generics_0.1.3
##
## loaded via a namespace (and not attached):
## [1] splines_4.5.0 later_1.4.2 filelock_1.0.3
## [4] bitops_1.0-9 urltools_1.7.3 R.oo_1.27.0
## [7] triebeard_0.4.1 polyclip_1.10-7 graph_1.86.0
## [10] XML_3.99-0.18 httr2_1.1.2 lifecycle_1.0.4
## [13] mixsqp_0.3-54 rstatix_0.7.2 lattice_0.22-7
## [16] MASS_7.3-65 backports_1.5.0 magrittr_2.0.3
## [19] openxlsx_4.2.8 sass_0.4.10 rmarkdown_2.29
## [22] jquerylib_0.1.4 yaml_2.3.10 remotes_2.5.0
## [25] httpuv_1.6.16 ggtangle_0.0.6 zip_2.3.2
## [28] sessioninfo_1.2.3 pkgbuild_1.4.7 ggvenn_0.1.10
## [31] DBI_1.2.3 abind_1.4-8 pkgload_1.4.0
## [34] R.utils_2.13.0 RCurl_1.98-1.17 ggraph_2.2.1
## [37] yulab.utils_0.2.0 rappdirs_0.3.3 tweenr_2.0.3
## [40] GenomeInfoDbData_1.2.14 irlba_2.3.5.1 tidytree_0.4.6
## [43] codetools_0.2-20 DelayedArray_0.34.1 xml2_1.3.8
## [46] ggforce_0.4.2 tidyselect_1.2.1 aplot_0.2.5
## [49] UCSC.utils_1.4.0 farver_2.1.2 viridis_0.6.5
## [52] BiocFileCache_2.16.0 jsonlite_2.0.0 fst_0.9.8
## [55] Formula_1.2-5 ellipsis_0.3.2 tidygraph_1.3.1
## [58] tools_4.5.0 progress_1.2.3 treeio_1.32.0
## [61] Rcpp_1.0.14 glue_1.8.0 gridExtra_2.3
## [64] SparseArray_1.8.0 xfun_0.52 qvalue_2.40.0
## [67] usethis_3.1.0 withr_3.0.2 fastmap_1.2.0
## [70] truncnorm_1.0-9 digest_0.6.37 timechange_0.3.0
## [73] R6_2.6.1 mime_0.13 gridGraphics_0.5-1
## [76] colorspace_2.1-1 GO.db_3.21.0 RSQLite_2.3.9
## [79] R.methodsS3_1.8.2 utf8_1.2.4 data.table_1.17.0
## [82] prettyunits_1.2.0 graphlayouts_1.2.2 httr_1.4.7
## [85] htmlwidgets_1.6.4 S4Arrays_1.7.2 pkgconfig_2.0.3
## [88] gtable_0.3.6 blob_1.2.4 XVector_0.48.0
## [91] htmltools_0.5.8.1 carData_3.0-5 geneset_0.2.7
## [94] profvis_0.4.0 fgsea_1.34.0 msigdbdf_24.1.1
## [97] scales_1.4.0 png_0.1-8 ggfun_0.1.8
## [100] rstudioapi_0.17.1 tzdb_0.5.0 reshape2_1.4.4
## [103] curl_6.2.2 nlme_3.1-168 cachem_1.1.0
## [106] parallel_4.5.0 miniUI_0.1.2 pillar_1.10.2
## [109] grid_4.5.0 vctrs_0.6.5 urlchecker_1.0.1
## [112] promises_1.3.2 car_3.1-3 dbplyr_2.5.0
## [115] xtable_1.8-4 Rgraphviz_2.52.0 evaluate_1.0.3
## [118] KEGGgraph_1.68.0 invgamma_1.1 cli_3.6.5
## [121] locfit_1.5-9.12 compiler_4.5.0 rlang_1.1.6
## [124] crayon_1.5.3 SQUAREM_2021.1 ggsignif_0.6.4
## [127] labeling_0.4.3 plyr_1.8.9 fs_1.6.6
## [130] stringi_1.8.7 viridisLite_0.4.2 BiocParallel_1.42.0
## [133] assertthat_0.2.1 babelgene_22.9 Biostrings_2.76.0
## [136] lazyeval_0.2.2 devtools_2.4.5 pacman_0.5.1
## [139] GOSemSim_2.34.0 Matrix_1.7-3 europepmc_0.4.3
## [142] hms_1.1.3 bit64_4.6.0-1 KEGGREST_1.48.0
## [145] shiny_1.10.0 broom_1.0.8 igraph_2.1.4
## [148] memoise_2.0.1 bslib_0.9.0 ggtree_3.16.0
## [151] fastmatch_1.1-6 bit_4.6.0 ape_5.8-1
## [154] gson_0.1.0