About the Data Set

Belli et al. aims to investigate PAK1 as a possible biomarker of resistance in ER+ breast cancer cell lines MCF7 and T47D. Cells resistant to endocrine therapies and CDK4/6 inhibitors have been seen to have higher PAK1 expression, leading to Belli et al. to establish the hypothesis: PAK1 is a mediator of resistance in anti estrogen therapy and CDK4/6 inhibitors. In order to test this hypothesis, Belli et al. conducted a series of spheroid growth assays, dose dependent viability assays, and RNA- seq analysis on parental and fulvestrant and abemaciclib resistant (FAR) cell lines.
Samples used in the NGS analysis below were used on samples obtained from GEO using the acession number GSE227102. Samples are comprised of three trials of each: MCF7-FAR, MCF7 parental, T47D-FAR, and T47D parental for a total of twelve samples. In order for FAR cell lines to confer resistance cells were treated with increasing doses of FA over a 6 month period.
The trail I attempted for my final project was the Green Trail. My goal was to recreate Figure 2a-c.
APA citation:

Belli, S., Esposito, D., Allotta, A., & others. (2023). Pak1 pathway hyper-activation mediates resistance to endocrine therapy and CDK4/6 inhibitors in ER+ breast cancer. NPJ Breast Cancer, 9(1), 48. https://doi.org/10.1038/s41523-023-00556-9

Install Packages

Packages installed are needed to perform NGS analysis.

#install packages
#install.packages(c("knitr", "RColorBrewer", "pheatmap"))
#BiocManager::install("apeglm") # Update all/some/none? [a/s/n] - type n 
#install.packages("pacman")

#BiocManager::install("pathview")
#BiocManager::install("enrichplot")
#BiocManager::install("AnnotationDbi")
#BiocManager::install("org.Hs.eg.db")
#install.packages("VennDiagram")
#if (!requireNamespace("biomaRt", quietly = TRUE)) {
    #install.packages("BiocManager")
    #BiocManager::install("biomaRt")}
#BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "msigdbr", "enrichplot"))

Load Libraries

Libraries loaded are needed to perform NGS visual analysis.

#load libraries
pacman::p_load(DESeq2, dplyr, tidyr, ggplot2, genekitr, pheatmap, 
               RColorBrewer, EnhancedVolcano, clusterProfiler, DOSE, 
               ggnewscale, enrichplot, knitr, cowplot, msigdbr, pathview, 
               patchwork, ggplotify)
library(dplyr)
library(DESeq2)
library(knitr)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
library(patchwork)
library(ggplotify)
library(ggpubr)
library(ashr)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(ggVennDiagram)
library(biomaRt)
library(clusterProfiler)
library(org.Hs.eg.db) 
library(enrichplot)    
library(dplyr)
library(cowplot)

Load Data

Data is loaded into object countdata from the file final_meta_table.csv. The table is then converted into a matrix.

#load data
countdata <- read.csv("final_meta_table.csv", 
                      header=TRUE, 
                      row.names=1)

countdata <- as.matrix(countdata)

Column names of data are specified and extracted from countdata to ensure they match columns in final_meta_table.csv.

colnames(countdata)

## [1] "file_name" "cell_line" "Genotype"  "replicate"

Sample names, genotype, and cell lines are specified as a factor by creating an object for each.

sample_names <- rep(c("T47D_control_rep1","T47D_control_rep2","T47D_control_rep3", "MCF7_control_rep1", "MCF7_control_rep2", "MCF7_control_rep3", "T47D_FAR_rep1", "T47D_FAR_rep2", "T47D_FAR_rep3", "MCF7_FAR_rep1", "MCF7_FAR_rep2", "MCF7_FAR_rep3"), each = 1, times = 1)
genotype <- rep(c("WT", "FAR"), each = 3, times = 2)
cell_line<- rep(c("T47D", "MCF7"), each = 3,times = 2)

Determine length of vectors to ensure that a data frame can be formed.

length(sample_names)

## [1] 12

length(genotype)

## [1] 12

length(cell_line)

## [1] 12

sampleTable is created from data in final_meta_table.csv to create a data frame. Cell line and genotype are specified as factors. Group is created as a factor by combining cell line and genotype. This will be used to distinguish cell lines and genotypes from each other.

sampleTable <- read.csv("final_meta_table.csv", header = T)
sampleTable <- data.frame(
  sampleTable)
cell_line <- factor(sampleTable$cell_line)
genotype <- factor(sampleTable$genotype)
sampleTable$Group <- factor(paste(sampleTable$cell_line, sampleTable$Genotype, sep = "_"))
head(sampleTable)

##         sample_name                            file_name cell_line Genotype
## 1 T47D_control_rep1 SRR23803508_sorted.gene_id.count.txt      T47D       WT
## 2 T47D_control_rep2 SRR23803507_sorted.gene_id.count.txt      T47D       WT
## 3 T47D_control_rep3 SRR23803506_sorted.gene_id.count.txt      T47D       WT
## 4 MCF7_control_rep1 SRR23803514_sorted.gene_id.count.txt      MCF7       WT
## 5 MCF7_control_rep2 SRR23803513_sorted.gene_id.count.txt      MCF7       WT
## 6 MCF7_control_rep3 SRR23803512_sorted.gene_id.count.txt      MCF7       WT
##   replicate   Group
## 1         1 T47D_WT
## 2         2 T47D_WT
## 3         3 T47D_WT
## 4         1 MCF7_WT
## 5         2 MCF7_WT
## 6         3 MCF7_WT

Running DESeq2

DESeq2 analysis is stored in the object dds. Design is set as Group as it distinguishes the cell lines and genotypes at the same time. Three separate DESeq analyses are performed as dds_MCF7 and dds_T47D will be used for later downstream analysis.

#DESeq2
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = "counts/",
                                  design = ~Group)

dds

## class: DESeqDataSet 
## dim: 78932 12 
## metadata(1): version
## assays(1): counts
## rownames(78932): ENSG00000000003 ENSG00000000005 ... ENSG00000310556
##   ENSG00000310557
## rowData names(0):
## colnames(12): T47D_control_rep1 T47D_control_rep2 ... MCF7_FAR_rep2
##   MCF7_FAR_rep3
## colData names(4): cell_line Genotype replicate Group

dds_MCF7 <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = "counts/",
                                  design = ~Group)

dds_MCF7

## class: DESeqDataSet 
## dim: 78932 12 
## metadata(1): version
## assays(1): counts
## rownames(78932): ENSG00000000003 ENSG00000000005 ... ENSG00000310556
##   ENSG00000310557
## rowData names(0):
## colnames(12): T47D_control_rep1 T47D_control_rep2 ... MCF7_FAR_rep2
##   MCF7_FAR_rep3
## colData names(4): cell_line Genotype replicate Group

dds_T47D <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = "counts/",
                                  design = ~Group)

dds_T47D

## class: DESeqDataSet 
## dim: 78932 12 
## metadata(1): version
## assays(1): counts
## rownames(78932): ENSG00000000003 ENSG00000000005 ... ENSG00000310556
##   ENSG00000310557
## rowData names(0):
## colnames(12): T47D_control_rep1 T47D_control_rep2 ... MCF7_FAR_rep2
##   MCF7_FAR_rep3
## colData names(4): cell_line Genotype replicate Group

Releveing is done to determine which genes are upregulated in later analyses. The WT samples for MCF7 and T47D are used as the control for each respective cell line.

dds_MCF7$Group <- relevel(dds$Group, ref = "MCF7_WT")
dds_T47D$Group <- relevel(dds$Group, ref = "T47D_WT")

Pre-filtering

Insignificant counts are considered those less than or equal to 10. These are filtered out and stored in keep.

#prefiltering
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

DESeq: Differential Expression

DESeq is used to determine differential gene expression among samples.

dds <- DESeq(dds)
dds_MCF7 <- DESeq(dds_MCF7)
dds_T47D <- DESeq(dds_T47D)

Testing for genes that are significant between treatment conditons. In Belli et al. comparisons were made between cell lines and genotypes. These comparisons can be used farther down the analysis pipeline.

#compare MCF7 control to MCF7 FAR
res_MCF7 <- results(dds_MCF7, contrast = c("Group", "MCF7_WT", "MCF7_FAR"))
#compare T47D control to T47D FAR
res_T47D <- results(dds_T47D, contrast = c('Group', 'T47D_WT', 'T47D_FAR'))
#compare controls
res_Control <- results(dds, contrast = c('Group', 'MCF7_WT', 'T47D_WT'))
#Compare FAR
res_FAR <- results(dds, contrast = c('Group','T47D_FAR', 'MCF7_FAR'))

Results are extracted from dds

res <- results(dds)

Results from DESeq are sorted by p-adjusted values.

dds_padj_sorted <- res[order(res$padj), ]
head(dds_padj_sorted)

## log2 fold change (MLE): Group T47D WT vs MCF7 FAR 
## Wald test p-value: Group T47D WT vs MCF7 FAR 
## DataFrame with 6 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000051523   4816.17      -10.05022 0.2564514  -39.1896         0
## ENSG00000070087  10004.72       -3.72653 0.0779138  -47.8288         0
## ENSG00000076864   4596.45       -5.28440 0.1087166  -48.6071         0
## ENSG00000092929   6227.59        4.14130 0.0850386   48.6990         0
## ENSG00000099864   1361.92       -4.92428 0.1213377  -40.5833         0
## ENSG00000100605   9014.81       -2.85575 0.0703275  -40.6064         0
##                      padj
##                 <numeric>
## ENSG00000051523         0
## ENSG00000070087         0
## ENSG00000076864         0
## ENSG00000092929         0
## ENSG00000099864         0
## ENSG00000100605         0

table(dds_padj_sorted$padj<0.05)

## 
## FALSE  TRUE 
## 12122 14369

Merge Data

Results of counts and dds_padj_sorted are merged to create an output of results that can be used for analysis downstream. Statistics of the results of DESeq analysis were merged with the counts from dds. Gene id is specified for ENSEMBL names.

#merge results
result_PAK1_merged <- merge(as.data.frame(dds_padj_sorted), 
                 as.data.frame(counts(dds, normalized=TRUE)), 
                 by="row.names", 
                 sort=FALSE)

names(result_PAK1_merged)[1] <- "Gene_id"
str(result_PAK1_merged)

## 'data.frame':    28139 obs. of  19 variables:
##  $ Gene_id          : chr  "ENSG00000051523" "ENSG00000070087" "ENSG00000076864" "ENSG00000092929" ...
##  $ baseMean         : num  4816 10005 4596 6228 1362 ...
##  $ log2FoldChange   : num  -10.05 -3.73 -5.28 4.14 -4.92 ...
##  $ lfcSE            : num  0.2565 0.0779 0.1087 0.085 0.1213 ...
##  $ stat             : num  -39.2 -47.8 -48.6 48.7 -40.6 ...
##  $ pvalue           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ padj             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ T47D_control_rep1: num  16.3 1761.9 184.1 8142.4 102.7 ...
##  $ T47D_control_rep2: num  4.32 1842.08 174.85 8624.66 118.01 ...
##  $ T47D_control_rep3: num  17.7 2078.3 149.3 8158.5 138.7 ...
##  $ MCF7_control_rep1: num  5819 7599 11032 12288 1561 ...
##  $ MCF7_control_rep2: num  6604 7589 11450 12664 1510 ...
##  $ MCF7_control_rep3: num  6780 7870 11596 12079 1529 ...
##  $ T47D_FAR_rep1    : num  19.9 5485.1 233.1 3779.3 141.2 ...
##  $ T47D_FAR_rep2    : num  21.3 5281.9 285.9 3842.8 156.4 ...
##  $ T47D_FAR_rep3    : num  23.4 5317.3 262.1 3740.6 141 ...
##  $ MCF7_FAR_rep1    : num  12571 24667 6687 509 3838 ...
##  $ MCF7_FAR_rep2    : num  12585 24788 6538 483 3439 ...
##  $ MCF7_FAR_rep3    : num  13333 25777 6565 419 3667 ...

Results of merged data are saved to PAK1_merged_matrix_final.csv

#write results
write.csv(result_PAK1_merged,
          file = "output/PAK1_merged_matrix_final.csv",
          row.names = FALSE)

DEGs are filtered by significance and stored in results_PAK1_merged_filtered_FC_P.csv. Results that are considered significant are considered to have a log2FoldChange greater than 1 and a p adjusted value of 0.05.

#DEG filtered by significance
#filter out insignificant data by padj and log2FC
result_PAK1_merged <- as.data.frame(result_PAK1_merged) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(result_PAK1_merged)

##           Gene_id  baseMean log2FoldChange      lfcSE      stat pvalue padj
## 1 ENSG00000051523  4816.174     -10.050217 0.25645135 -39.18956      0    0
## 2 ENSG00000070087 10004.719      -3.726529 0.07791384 -47.82884      0    0
## 3 ENSG00000076864  4596.445      -5.284401 0.10871661 -48.60712      0    0
## 4 ENSG00000092929  6227.595       4.141297 0.08503856  48.69905      0    0
## 5 ENSG00000099864  1361.919      -4.924282 0.12133766 -40.58329      0    0
## 6 ENSG00000100605  9014.812      -2.855748 0.07032746 -40.60645      0    0
##   T47D_control_rep1 T47D_control_rep2 T47D_control_rep3 MCF7_control_rep1
## 1           16.2766          4.317365          17.66491          5818.700
## 2         1761.9417       1842.075873        2078.27716          7599.097
## 3          184.1290        174.853296         149.26852         11032.411
## 4         8142.3682       8624.656804        8158.54062         12287.763
## 5          102.7460        118.007986         138.66958          1561.088
## 6         1905.3793       2055.785457        2137.45462         14654.784
##   MCF7_control_rep2 MCF7_control_rep3 T47D_FAR_rep1 T47D_FAR_rep2 T47D_FAR_rep3
## 1          6603.949          6779.623      19.85222      21.32671      23.35168
## 2          7588.997          7870.444    5485.09593    5281.91454    5317.26353
## 3         11449.817         11595.766     233.07981     285.93585     262.05772
## 4         12664.274         12079.474    3779.27522    3842.75672    3740.59284
## 5          1510.189          1529.354     141.17137     156.39585     140.97494
## 6         14718.087         15337.615    4414.54638    4231.37673    4557.03668
##   MCF7_FAR_rep1 MCF7_FAR_rep2 MCF7_FAR_rep3
## 1     12570.666    12584.8934    13333.4726
## 2     24666.765    24787.6518    25777.1067
## 3      6686.752     6538.4700     6564.8023
## 4       509.059      483.1004      419.2755
## 5      3838.162     3439.2148     3667.0576
## 6     15004.158    14802.2987    14359.2229

write.csv(result_PAK1_merged, file = "results_PAK1_merged_filtered_FC_P.csv")

Threshold for p adjusted values is set to less than 0.05.

#significant padj adjustment
threshold_padj <- result_PAK1_merged$padj < 0.05 
length(which(threshold_padj))

## [1] 10281

Threshold is applied to merged results to filter p adjusted values less than 0.05.

result_PAK1_merged$threshold <- threshold_padj
str(result_PAK1_merged)

## 'data.frame':    10281 obs. of  20 variables:
##  $ Gene_id          : chr  "ENSG00000051523" "ENSG00000070087" "ENSG00000076864" "ENSG00000092929" ...
##  $ baseMean         : num  4816 10005 4596 6228 1362 ...
##  $ log2FoldChange   : num  -10.05 -3.73 -5.28 4.14 -4.92 ...
##  $ lfcSE            : num  0.2565 0.0779 0.1087 0.085 0.1213 ...
##  $ stat             : num  -39.2 -47.8 -48.6 48.7 -40.6 ...
##  $ pvalue           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ padj             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ T47D_control_rep1: num  16.3 1761.9 184.1 8142.4 102.7 ...
##  $ T47D_control_rep2: num  4.32 1842.08 174.85 8624.66 118.01 ...
##  $ T47D_control_rep3: num  17.7 2078.3 149.3 8158.5 138.7 ...
##  $ MCF7_control_rep1: num  5819 7599 11032 12288 1561 ...
##  $ MCF7_control_rep2: num  6604 7589 11450 12664 1510 ...
##  $ MCF7_control_rep3: num  6780 7870 11596 12079 1529 ...
##  $ T47D_FAR_rep1    : num  19.9 5485.1 233.1 3779.3 141.2 ...
##  $ T47D_FAR_rep2    : num  21.3 5281.9 285.9 3842.8 156.4 ...
##  $ T47D_FAR_rep3    : num  23.4 5317.3 262.1 3740.6 141 ...
##  $ MCF7_FAR_rep1    : num  12571 24667 6687 509 3838 ...
##  $ MCF7_FAR_rep2    : num  12585 24788 6538 483 3439 ...
##  $ MCF7_FAR_rep3    : num  13333 25777 6565 419 3667 ...
##  $ threshold        : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Table is further edited to make downstream analysis easier. Gene symbols are added to table based on ENSEMBL ID. Genes are specified whether they were up-regulated or down-regulated based on log2FoldChange.

#add gene symbols and up regulation/down regulation
result_PAK1_merged$Gene_id <- sub("\\.\\d+", "", result_PAK1_merged$Gene_id)
ids <- result_PAK1_merged$Gene_id
gene_symbol_map <- transId(ids, transTo = "symbol", org = "human")

names(result_PAK1_merged)[1] = "input_id"
PAK1_gene <- merge(result_PAK1_merged, gene_symbol_map, by.x = "input_id", by.y = "input_id")

#specify up or downregulation based on padj and log2FC values, and add as new column
PAK1_gene$diffexpressed <- "Unchanged"
PAK1_gene$diffexpressed[PAK1_gene$padj < 0.05 & PAK1_gene$log2FoldChange > 1] <- "Upregulated"
PAK1_gene$diffexpressed[PAK1_gene$padj < 0.05 & PAK1_gene$log2FoldChange < -1] <- "Downregulated"

str(PAK1_gene)

## 'data.frame':    8415 obs. of  22 variables:
##  $ input_id         : chr  "ENSG00000000460" "ENSG00000001461" "ENSG00000001497" "ENSG00000001617" ...
##  $ baseMean         : num  714 3075 2160 5399 257 ...
##  $ log2FoldChange   : num  -1.35 1.1 -1.01 -1.6 1.47 ...
##  $ lfcSE            : num  0.1603 0.1451 0.1126 0.0777 0.2572 ...
##  $ stat             : num  -8.44 7.61 -8.99 -20.63 5.71 ...
##  $ pvalue           : num  3.12e-17 2.84e-14 2.42e-19 1.50e-94 1.12e-08 ...
##  $ padj             : num  2.27e-16 1.76e-13 1.94e-18 8.50e-93 4.66e-08 ...
##  $ T47D_control_rep1: num  289 5877 1464 1762 460 ...
##  $ T47D_control_rep2: num  324 6278 1568 1735 461 ...
##  $ T47D_control_rep3: num  212 4217 1207 1520 245 ...
##  $ MCF7_control_rep1: num  171 1258 1844 9544 215 ...
##  $ MCF7_control_rep2: num  133 1260 1774 9400 239 ...
##  $ MCF7_control_rep3: num  144 1157 1819 9706 247 ...
##  $ T47D_FAR_rep1    : num  1671 2953 2709 5433 249 ...
##  $ T47D_FAR_rep2    : num  1797 3098 2473 5140 271 ...
##  $ T47D_FAR_rep3    : num  1717 3184 2505 5312 283 ...
##  $ MCF7_FAR_rep1    : num  682 2532 2913 5185 158 ...
##  $ MCF7_FAR_rep2    : num  697 2648 2663 5112 104 ...
##  $ MCF7_FAR_rep3    : num  732 2440 2979 4936 159 ...
##  $ threshold        : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ symbol           : chr  "FIRRM" "NIPAL3" "LAS1L" "SEMA3F" ...
##  $ diffexpressed    : chr  "Downregulated" "Upregulated" "Downregulated" "Downregulated" ...

Results are writen as .csv file.

#write results as csv
write.csv(PAK1_gene, "PAK1_DEG.csv", row.names = TRUE)

PCA plot

PCA plot is made to ensure samples are clumping together and no outliers are present.

vstcounts <- vst(dds, blind = FALSE)
pcaData <- plotPCA(vstcounts, intgroup=c("Group"), returnData=TRUE)

percentVar <- round(100 * attr(pcaData, "percentVar"))

ggplot(pcaData, aes(PC1, PC2, color=Group)) +
  geom_point(size=2) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

Viualization

Heatmap

The first goal of this NGS analysis is to re-create the heatmap seen in Figure 2a. The one parameter that will be changed is the color of the grouping of samples. ### Load Data for Heatmap Merged results with gene symbols and up regulation and down regulation of genes specified is loaded into the object all_matrix.

#creating heatmap
#load DEG file
all_matrix <- read.csv("PAK1_DEG.csv", header = TRUE)
all_matrix <- tibble(all_matrix)

A data frame is constructed and edited by selecting samples and gene symbols from the DEG matrix .

#create dataframe with only sample columns from DEG dataframe
heatmap_matrix <- dplyr:::select.data.frame(all_matrix,symbol, T47D_control_rep1:MCF7_FAR_rep3)

#change duplicated gene symbols
uq_symbol_names <- make.names(heatmap_matrix[[1]], unique = T)
heatmap_matrix <- dplyr::select(heatmap_matrix, -symbol)
heatmap_matrix$symbol <- uq_symbol_names

#make gene symbols the row names
heatmap_matrix <- data.frame(heatmap_matrix, row.names = 13)

Matrix is saved as .rds.

saveRDS(heatmap_matrix, file = "heatmap_matrix_PAK1.rds")
# load RDS file of dds if necessary 
heatmap_matrix <- readRDS("heatmap_matrix_PAK1.rds")

Annotation file is loaded for heatmap.

final_anno <- read.csv("final_anno.csv", header = TRUE)
str(final_anno)

## 'data.frame':    12 obs. of  3 variables:
##  $ sample_name: chr  "T47D_control_rep1" "T47D_control_rep2" "T47D_control_rep3" "MCF7_control_rep1" ...
##  $ cell_line  : chr  "T47D" "T47D" "T47D" "MCF7" ...
##  $ Genotype   : chr  "WT" "WT" "WT" "WT" ...

Heatmap matrix and annotation file are visualized to ensure they match.

head(heatmap_matrix)

##        T47D_control_rep1 T47D_control_rep2 T47D_control_rep3 MCF7_control_rep1
## FIRRM         288.909615        323.802400        211.978970        170.693390
## NIPAL3       5876.869173       6278.168748       4217.498265       1257.513326
## LAS1L        1463.876535       1567.923175       1207.396885       1844.136810
## SEMA3F       1761.941736       1734.861301       1520.065867       9543.705086
## RAD52         459.813894        460.518968        244.659062        214.987244
## MAD1L1          5.086437          2.878244          8.832457          1.080338
##        MCF7_control_rep2 MCF7_control_rep3 T47D_FAR_rep1 T47D_FAR_rep2
## FIRRM         132.644354         144.34106   1671.263145    1796.97257
## NIPAL3       1260.121365        1156.93219   2952.834487    3097.90172
## LAS1L        1774.390050        1819.13814   2708.725660    2473.10818
## SEMA3F       9400.353171        9706.11014   5432.891938    5139.73649
## RAD52         239.194737         246.81220    249.255700     270.92817
## MAD1L1          2.174498           1.10184      1.470535       3.94939
##        T47D_FAR_rep3 MCF7_FAR_rep1 MCF7_FAR_rep2 MCF7_FAR_rep3
## FIRRM      1716.7808     681.52063      696.5337      732.1293
## NIPAL3     3184.4770    2532.21189     2648.1059     2440.0037
## LAS1L      2504.6837    2912.81678     2663.4424     2978.5226
## SEMA3F     5312.0743    5184.55220     5112.1736     4936.4236
## RAD52       282.8148     158.18891      103.5215      158.9913
## MAD1L1        0.0000       1.18939        0.0000        0.0000

head(final_anno)

##         sample_name cell_line Genotype
## 1 T47D_control_rep1      T47D       WT
## 2 T47D_control_rep2      T47D       WT
## 3 T47D_control_rep3      T47D       WT
## 4 MCF7_control_rep1      MCF7       WT
## 5 MCF7_control_rep2      MCF7       WT
## 6 MCF7_control_rep3      MCF7       WT

colnames(heatmap_matrix)

##  [1] "T47D_control_rep1" "T47D_control_rep2" "T47D_control_rep3"
##  [4] "MCF7_control_rep1" "MCF7_control_rep2" "MCF7_control_rep3"
##  [7] "T47D_FAR_rep1"     "T47D_FAR_rep2"     "T47D_FAR_rep3"    
## [10] "MCF7_FAR_rep1"     "MCF7_FAR_rep2"     "MCF7_FAR_rep3"

Visualizing Heatmap

Establish parameters for heatmap

Row names are specified as column names and colors are specified for gene regulation and genotype.

#parameters for pheatmap
#change row names
row.names(final_anno) <- final_anno$sample_name
#expression value colors range from green to red 
heatmap_colors <- colorRampPalette(colors = c("green", "black", "red"))(30)
annotation_colors=list(
  Genotype=c(WT="blue4", FAR="goldenrod"))

Heatmap is created using pheatmap.

Color gradient for expression is changed compared to original figure. Clustering has also been altered as Belli et al. used either a different clustering method or used ComplexHeatmap to construct heatmap.

PAK1_heatmap <- as.ggplot(pheatmap(heatmap_matrix,
         color = heatmap_colors,
         annotation_colors = annotation_colors,
         annotation_col = final_anno[, "Genotype", drop = FALSE],
         cluster_cols = T,
         clustering_method = 'average',
         show_colnames = T,
         show_rownames = F,
         scale = "row",
         annotation_names_col = F,
         fontsize_col = 7,
         cellwidth = 12,
         cellheight = 0.02,
         treeheight_row = 0,
         treeheight_col = 10,
        
         ))

Figure 2a. Heatmap comparing expression of genes among WT and FAR cell lines.

Heatmap file is saved as an RDS.

saveRDS(PAK1_heatmap, file = "PAK1_heatmap_draft_fianl.rds")

Venn Diagram of shared upregulated genes between MCF7-FAR and T47D-FAR

Creating a list of up regulated MCF7-FAR genes

Genes that are up regulated in MCF7-FAR compared to MCF7-WT are inputted into the data frame reg_MCF7_upreg. This has been filtered for genes with a p adjusted value of lower than 0.05.

#creating a list of up regulated genes for MCF7-FAR
# Add gene symbols as a column if they are rownames
res_MCF7 <- as.data.frame(res_MCF7)
res_MCF7$gene <- rownames(res_MCF7)

# Create a new column for direction
res_MCF7$regulation <- ifelse(res_MCF7$log2FoldChange > 0, "Upregulated", "Downregulated")

#filter to significant genes (e.g., adjusted p < 0.05)
res_MCF7_filtered <- res_MCF7[res_MCF7$padj < 0.05 & !is.na(res_MCF7$padj), ]
res_MCF7_upreg <- res_MCF7_filtered[res_MCF7_filtered$log2FoldChange > 0 & res_MCF7_filtered$padj < 0.05 & !is.na(res_MCF7_filtered$padj), ]

# Select only the gene and regulation columns
reg_MCF7_upreg <- res_MCF7_upreg[, c("gene", "regulation")]

names(reg_MCF7_upreg)[names(reg_MCF7_upreg) == "regulation"] <- "regulation_MCF7"

# View the result
head(reg_MCF7_upreg)

##                            gene regulation_MCF7
## ENSG00000001617 ENSG00000001617     Upregulated
## ENSG00000002016 ENSG00000002016     Upregulated
## ENSG00000002549 ENSG00000002549     Upregulated
## ENSG00000003147 ENSG00000003147     Upregulated
## ENSG00000003249 ENSG00000003249     Upregulated
## ENSG00000003402 ENSG00000003402     Upregulated

Creating a list of up regulated T47D-FAR genes

Genes that are up regulated in T47D-FAR compared to T47D-WT are inputted into the data frame reg_T47D_upreg. This has been filtered for genes with a p adjusted value of lower than 0.05.

# Add gene symbols as a column if they are rownames
res_T47D <- as.data.frame(res_T47D)
res_T47D$gene <- rownames(res_T47D)

# Create a new column for direction
res_T47D$regulation <- ifelse(res_T47D$log2FoldChange > 0, "Upregulated", "Downregulated")

#filter to significant genes (e.g., adjusted p < 0.05)
res_T47D_filtered <- res_T47D[res_T47D$padj < 0.05 & !is.na(res_T47D$padj), ]
res_T47D_upreg <- res_T47D_filtered[res_T47D_filtered$log2FoldChange > 0 & res_T47D_filtered$padj < 0.05 & !is.na(res_T47D_filtered$padj), ]

# Select only the gene and regulation columns
reg_T47D_upreg<- res_T47D_upreg[, c("gene", "regulation")]

names(reg_T47D_upreg)[names(reg_T47D_upreg) == "regulation"] <- "regulation_T47D"

# View the result
head(reg_T47D_upreg)

##                            gene regulation_T47D
## ENSG00000001461 ENSG00000001461     Upregulated
## ENSG00000002834 ENSG00000002834     Upregulated
## ENSG00000002919 ENSG00000002919     Upregulated
## ENSG00000003147 ENSG00000003147     Upregulated
## ENSG00000003249 ENSG00000003249     Upregulated
## ENSG00000003756 ENSG00000003756     Upregulated

Isolate genes and store them as vectors.

genes_MCF7 <- reg_MCF7_upreg$gene
genes_T47D <- reg_T47D_upreg$gene

Create a gene list of upregulated genes in both conditions.

venn_list <- list(
  MCF7_FAR = genes_MCF7,
  T47D_FAR = genes_T47D
)

Plot Venn Diagram using ggVennDiagram

Image appears different than Belli et al. Figure 2b most likely because of a different package that was used to plot the venn diagram.

ggVennDiagram(venn_list) + coord_flip()

Figure 2b. Venn Diagram of shared upregulated genes in T47D-FAR and MCF7-FAR cells.

Venn Diagram is saved as RDS.

VennDiagram <- ggVennDiagram(venn_list) + coord_flip() 
saveRDS(VennDiagram, file = "venn_diagram.rds")

GO enrich plot of Common Upregulated genes

Retrieve data for GO Enrich Terms Bar Plot

Selecting and filtering to have plot only contain genes upregulated in T47D-FAR and MCF7-FAR.

FAR_GO_terms <- all_matrix[ , -c(9:14) ] 
FAR_GO_upreg <- FAR_GO_terms[FAR_GO_terms$log2FoldChange > 0, ]
head(FAR_GO_upreg)

## # A tibble: 6 × 17
##       X input_id        baseMean log2FoldChange  lfcSE  stat   pvalue     padj
##   <int> <chr>              <dbl>          <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1     2 ENSG00000001461  3075.             1.10 0.145   7.61 2.84e-14 1.76e-13
## 2     5 ENSG00000002016   257.             1.47 0.257   5.71 1.12e- 8 4.66e- 8
## 3     6 ENSG00000002822     2.31           3.64 1.57    2.31 2.08e- 2 3.95e- 2
## 4     7 ENSG00000003096   253.             1.61 0.232   6.93 4.20e-12 2.27e-11
## 5     9 ENSG00000003147  2056.             1.12 0.0981 11.4  3.41e-30 4.28e-29
## 6    11 ENSG00000003756   933.             1.01 0.104   9.70 2.87e-22 2.65e-21
## # ℹ 9 more variables: T47D_FAR_rep1 <dbl>, T47D_FAR_rep2 <dbl>,
## #   T47D_FAR_rep3 <dbl>, MCF7_FAR_rep1 <dbl>, MCF7_FAR_rep2 <dbl>,
## #   MCF7_FAR_rep3 <dbl>, threshold <lgl>, symbol <chr>, diffexpressed <chr>

Filter genes by p adjusted value and extract log2FoldChange

# Extract significant results based on padj
sig_genes_df = subset(all_matrix, padj < 0.05)

#filter using log2FoldChange
sig_genes <- sig_genes_df$log2FoldChange

# Name the vector 
names(sig_genes) <- sig_genes_df$input_id

# omit NA values
sig_genes <- na.omit(sig_genes)

# significant results based on log2foldchange
sig_genes <- names(sig_genes)[abs(sig_genes) > 1]
head(sig_genes)

## [1] "ENSG00000000460" "ENSG00000001461" "ENSG00000001497" "ENSG00000001617"
## [5] "ENSG00000002016" "ENSG00000002822"

EnrichGO analysis

go_enrich <- enrichGO(gene = sig_genes,
                      OrgDb = org.Hs.eg.db, 
                      keyType = 'ENSEMBL',
                      readable = T,
                      ont = "MF",
                      pvalueCutoff = 0.05, 
                      qvalueCutoff = 0.05)

Plotting results as a bar plot. Seven total terms are plotted, similaryly to Figure 2c.

GO_barplot <- as.ggplot(barplot(go_enrich, showCategory = 7, title = "GO Molecular Function", font.size = 6  ))
GO_barplot

Figure 2c. Bar plot of GO Enrichment terms for molecular function in genes that are up regulated in MCF7-FAR and T47D- FAR cell lines.

Bar plot is saved as an RDS.

saveRDS(GO_barplot, file = "go_barplot.rds")

Figure Panel

Figure panel is created using cowplot.

#create figure panel
cowplot::plot_grid(
  PAK1_heatmap, 
  VennDiagram, 
  GO_barplot, 
  labels = letters[1:3],
  ncol = 2,
  rel_widths = c(6, 4, 12))

Figure 2. a) Heatmap comparing expression of genes among WT and FAR cell lines. b) Venn Diagram of shared upregulated genes in T47D-FAR and MCF7-FAR cells. c) Bar plot of GO Enrichment terms for molecular function in genes that are up regulated in MCF7-FAR and T47D- FAR cell lines.

Conclusions

The NGS analysis I performed showed similar results to Figure 2a-c published in Belli et al. In the heatmap, my results showed that T47D cell lines had more similar upregulated genes than what was published in Belli et al. This was not too surprosing given the PCA plot I constructed showed that cell lines were more similar than their WT or FAR genotypes. However, the FAR genotypes and the MCF7 cell lines had many differentially expressed genes, which was seen in the publication. The differences seen in the heatmap can be explained by establishing different parameters and filtering than Belli et al. The parameters of analysis were not mentioned in the paper, so I performed filtering based on average filtering statistics. This would result in a different gene order than Figure 2a as well as a different list of genes that were being analyzed. One visualization that I changed in this heatmap was the clustering. In Belli et al. the clustering of samples was very organized as the genotypes were clustered. Pheatmap would not allow me to cluster in the manner I wanted to. I do prefer how Belli et al. clustered their samples as it is easier to interpret the data.
The venn diagram I constructed also showed a different output than Figure 2b. Figure 2b seemed to be using the package VennDiagram, where I used ggVennDiagram. I believe that using ggVennDiagram was able to put together a nicer visual plot than VennDiagram. The amount of genes and percentages of shared genes that I analyzed is much larger than what is seen in Figure 2b. Again, this is due to not knowing the filtering parameters Belli et al. used in their analysis. However, there is similarity in that T47D-FAR cells had more upregulated genes than MCF7-FAR cells.
The last Figure I constructed was a bar plot based on GO enrichment terms for molecular function in shared up regulated genes between MCF7-FAR and T47D-FAR.This plot was very informative as it showed that up regulated genes of these two conditions have protein serine kinase activity. This is representative of up regulation of PAK1 in resistant cell lines, as PAK1 phosphorylates serine residues. Belli et al. constructed their bar plot based on Gene Set Enrichment Analysis terms. I chose to do GO terms because it was able to give more insight on molecular function, while their plot was representative of terms that are very common in cancer progression, and would likely be seen when comparing the WT genotyes as well.
I found this project to be extremely rewarding for understanding how to visualize NGS data using R. While I hit many roadblocks in the visualization of data and creating data matrices, I was able to better understand my errors and construct plots that can have results extracted from them.

Session Info

sessionInfo()

## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.3.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] fstcore_0.10.0              biomaRt_2.64.0             
##  [3] ggVennDiagram_1.5.2         org.Hs.eg.db_3.21.0        
##  [5] AnnotationDbi_1.70.0        ashr_2.2-63                
##  [7] ggpubr_0.6.0                lubridate_1.9.4            
##  [9] forcats_1.0.0               stringr_1.5.1              
## [11] purrr_1.0.4                 readr_2.1.5                
## [13] tibble_3.2.1                tidyverse_2.0.0            
## [15] ggplotify_0.1.2             patchwork_1.3.0            
## [17] pathview_1.48.0             msigdbr_10.0.2             
## [19] cowplot_1.1.3               knitr_1.50                 
## [21] enrichplot_1.28.0           ggnewscale_0.5.1           
## [23] DOSE_4.2.0                  clusterProfiler_4.16.0     
## [25] EnhancedVolcano_1.26.0      ggrepel_0.9.6              
## [27] RColorBrewer_1.1-3          pheatmap_1.0.12            
## [29] genekitr_1.2.8              ggplot2_3.5.2              
## [31] tidyr_1.3.1                 dplyr_1.1.4                
## [33] DESeq2_1.48.0               SummarizedExperiment_1.38.0
## [35] Biobase_2.68.0              MatrixGenerics_1.20.0      
## [37] matrixStats_1.5.0           GenomicRanges_1.60.0       
## [39] GenomeInfoDb_1.44.0         IRanges_2.42.0             
## [41] S4Vectors_0.46.0            BiocGenerics_0.54.0        
## [43] generics_0.1.3             
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.5.0           later_1.4.2             filelock_1.0.3         
##   [4] bitops_1.0-9            urltools_1.7.3          R.oo_1.27.0            
##   [7] triebeard_0.4.1         polyclip_1.10-7         graph_1.86.0           
##  [10] XML_3.99-0.18           httr2_1.1.2             lifecycle_1.0.4        
##  [13] mixsqp_0.3-54           rstatix_0.7.2           lattice_0.22-7         
##  [16] MASS_7.3-65             backports_1.5.0         magrittr_2.0.3         
##  [19] openxlsx_4.2.8          sass_0.4.10             rmarkdown_2.29         
##  [22] jquerylib_0.1.4         yaml_2.3.10             remotes_2.5.0          
##  [25] httpuv_1.6.16           ggtangle_0.0.6          zip_2.3.2              
##  [28] sessioninfo_1.2.3       pkgbuild_1.4.7          ggvenn_0.1.10          
##  [31] DBI_1.2.3               abind_1.4-8             pkgload_1.4.0          
##  [34] R.utils_2.13.0          RCurl_1.98-1.17         ggraph_2.2.1           
##  [37] yulab.utils_0.2.0       rappdirs_0.3.3          tweenr_2.0.3           
##  [40] GenomeInfoDbData_1.2.14 irlba_2.3.5.1           tidytree_0.4.6         
##  [43] codetools_0.2-20        DelayedArray_0.34.1     xml2_1.3.8             
##  [46] ggforce_0.4.2           tidyselect_1.2.1        aplot_0.2.5            
##  [49] UCSC.utils_1.4.0        farver_2.1.2            viridis_0.6.5          
##  [52] BiocFileCache_2.16.0    jsonlite_2.0.0          fst_0.9.8              
##  [55] Formula_1.2-5           ellipsis_0.3.2          tidygraph_1.3.1        
##  [58] tools_4.5.0             progress_1.2.3          treeio_1.32.0          
##  [61] Rcpp_1.0.14             glue_1.8.0              gridExtra_2.3          
##  [64] SparseArray_1.8.0       xfun_0.52               qvalue_2.40.0          
##  [67] usethis_3.1.0           withr_3.0.2             fastmap_1.2.0          
##  [70] truncnorm_1.0-9         digest_0.6.37           timechange_0.3.0       
##  [73] R6_2.6.1                mime_0.13               gridGraphics_0.5-1     
##  [76] colorspace_2.1-1        GO.db_3.21.0            RSQLite_2.3.9          
##  [79] R.methodsS3_1.8.2       utf8_1.2.4              data.table_1.17.0      
##  [82] prettyunits_1.2.0       graphlayouts_1.2.2      httr_1.4.7             
##  [85] htmlwidgets_1.6.4       S4Arrays_1.7.2          pkgconfig_2.0.3        
##  [88] gtable_0.3.6            blob_1.2.4              XVector_0.48.0         
##  [91] htmltools_0.5.8.1       carData_3.0-5           geneset_0.2.7          
##  [94] profvis_0.4.0           fgsea_1.34.0            msigdbdf_24.1.1        
##  [97] scales_1.4.0            png_0.1-8               ggfun_0.1.8            
## [100] rstudioapi_0.17.1       tzdb_0.5.0              reshape2_1.4.4         
## [103] curl_6.2.2              nlme_3.1-168            cachem_1.1.0           
## [106] parallel_4.5.0          miniUI_0.1.2            pillar_1.10.2          
## [109] grid_4.5.0              vctrs_0.6.5             urlchecker_1.0.1       
## [112] promises_1.3.2          car_3.1-3               dbplyr_2.5.0           
## [115] xtable_1.8-4            Rgraphviz_2.52.0        evaluate_1.0.3         
## [118] KEGGgraph_1.68.0        invgamma_1.1            cli_3.6.5              
## [121] locfit_1.5-9.12         compiler_4.5.0          rlang_1.1.6            
## [124] crayon_1.5.3            SQUAREM_2021.1          ggsignif_0.6.4         
## [127] labeling_0.4.3          plyr_1.8.9              fs_1.6.6               
## [130] stringi_1.8.7           viridisLite_0.4.2       BiocParallel_1.42.0    
## [133] assertthat_0.2.1        babelgene_22.9          Biostrings_2.76.0      
## [136] lazyeval_0.2.2          devtools_2.4.5          pacman_0.5.1           
## [139] GOSemSim_2.34.0         Matrix_1.7-3            europepmc_0.4.3        
## [142] hms_1.1.3               bit64_4.6.0-1           KEGGREST_1.48.0        
## [145] shiny_1.10.0            broom_1.0.8             igraph_2.1.4           
## [148] memoise_2.0.1           bslib_0.9.0             ggtree_3.16.0          
## [151] fastmatch_1.1-6         bit_4.6.0               ape_5.8-1              
## [154] gson_0.1.0

MMG 3320 Final Project: Analysis of Belli et al.

Lily Campbell

2025-04-24