As transcriptomics and next-gen sequencing expands in affordability and technology, the amount of gene expression data available explodes. Bioinformatic tools like HISAT2 and STAR were created with the intention of streamlining the bioinformatic pipeline and to more easily quantify gene expression. HISAT2 and STAR are two of the most prominent read alignment tools. It is important to understand the key differences in performance when comparing HISAT to STAR, as their alignment is a quintessential aspect of the process, and much of the results of these bioinformatic analyses hinge on the accuracy and precision of these aligners.
In order to assess the performance of these two aligners, I have chosen a dataset which came from a study on the heterogeneity in a mouse model of metastatic breast cancer (accession number GSE165393). GSE165393 is a dataset of organ-derived metastatic cell lines that were harvested from 10 female MMTV-PyMT mice. MMTV-PyMT female mice develop palpable mammary tumors which metastasize to the lung. Tumor-bearing organs from these mice were harvested for sample collection. These cell lines were isolated and sorted based on the expression of surface cancer markers, specifically CD44low/EpCAMhigh or CD44high/EpCAMhigh with previously sorted cell lines used as positive controls for comparison. To acquire the raw data for bulk RNA sequencing, total RNA was extracted using the QIAGEN RNeasy kit, with 2 distinct clonal isolate replicates per sample. A modified SMART-seq2 protocol was used to generate cDNA, and the Nextera XT DNA Sample Prep Kit was used to build Illumina libraries. Samples were then sequenced on a NextSeq500 with a minimum depth of 10 M reads.
This dataset is a compilation of counts files from 8 samples of mice tissue, which all have different treatments:
Each sample is either from Bone Marrow (BM), or Lymph Node (LN) tissue. The mouse either had high levels of CD44, a cell surface marker, or low levels of CD44.
The goal of this analysis is to compare and contrast HISAT2 aligner with STAR aligner. To do this, I will conduct differential expression (DE) analysis on counts files created by HISAT2 and STAR on the same dataset, simultaneously.
#BiocManager::install("limma")
#BiocManager::install("sva")
library(knitr)
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
library(pheatmap)
library(cowplot)
library(tidyr)
library(dplyr)
library(patchwork)
library(ggplotify)
library(DESeq2)
library(ggvenn)
library(limma)
Hdirectory <- "~/final_proj/Rstudio/counts3/hisat2CountsFiles" ##path to where HTseq counts are
Sdirectory <- "~/final_proj/Rstudio/counts3/star"
sampleTableHisat <- read.csv("~/final_proj/Anna_Rees_Final_Project/counts3/sampleTableHisat.csv", header = T)
sampleTableStar <- read.csv("~/final_proj/Anna_Rees_Final_Project/counts3/sampleTableStar.csv", header = T)
head(sampleTableStar)
## SampleName
## 1 BMLow1
## 2 BMLow2
## 3 BMHigh1
## 4 BMHigh2
## 5 LNLow1
## 6 LNLow2
## fileName
## 1 counts3/star/SRR13515350_GSM5032720_BM1low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 2 counts3/star/SRR13515351_GSM5032721_BM2low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 3 counts3/star/SRR13515352_GSM5032722_BM1High_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 4 counts3/star/SRR13515353_GSM5032723_BM2High_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 5 counts3/star/SRR13515354_GSM5032724_LN1low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 6 counts3/star/SRR13515355_GSM5032725_LN2low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## Tissue CD44Level Condition Rep Group
## 1 Bone Marrow Low BM 1 BMLow
## 2 Bone Marrow Low BM 2 BMLow
## 3 Bone Marrow High BM 1 BMHigh
## 4 Bone Marrow High BM 2 BMHigh
## 5 Lymph Node Low LN 1 LNLow
## 6 Lymph Node Low LN 2 LNLow
Next, I must specify the columns of interest as factors. For my purposes, this will be the “condition”, “Tissue”, and “CD44Level” columns
hisatCD44Level <- factor(sampleTableHisat$CD44Level, levels = c("Low", "High"))
starCD44Level <- factor(sampleTableStar$CD44Level, levels = c("Low", "High"))
#
hisatTissue <- factor(sampleTableHisat$Tissue, levels = c("Bone Marrow", "Lymph Node"))
starTissue <- factor(sampleTableStar$Tissue, levels = c("Bone Marrow", "Lymph Node"))
#
hisatcondition <- factor(sampleTableHisat$condition, levels = c("BM:Low", "BM:High", "LN:Low", "LN:High"))
starcondition <- factor(sampleTableStar$condition, levels = c("BM:Low", "BM:High", "LN:Low", "LN:High"))
hisatGroup <- factor(sampleTableHisat$Group)
starGroup <- factor(sampleTableStar$Group)
Firstly, I will create DDSs for HISAT2 and STAR with the CD44 level as the condition.
hisat_dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTableHisat,
directory = ".",
design = ~Group)
star_dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTableStar,
directory = ".",
design = ~Group)
head(hisat_dds)
## class: DESeqDataSet
## dim: 6 8
## metadata(1): version
## assays(1): counts
## rownames(6): ENSMUSG00000000001.5 ENSMUSG00000000003.16 ...
## ENSMUSG00000000037.18 ENSMUSG00000000049.12
## rowData names(0):
## colnames(8): BMLow1 BMLow2 ... LNHigh1 LNHigh2
## colData names(5): Tissue CD44Level Condition Rep Group
Pre-filtering low count genes before running DESeq2 functions in order to remove rows with very few reads and to reduce the memory size of the dds data objects
keepH <- rowSums(counts(hisat_dds)) >= 10
hisat_dds <- hisat_dds[keepH,]
keepS <- rowSums(counts(star_dds)) >= 10
star_dds <- star_dds[keepS,]
#Deciding the factor level - "Low" represents the control
hisat_dds$Group <- relevel(hisat_dds$Group, ref = "BMLow")
hisat_dds <- DESeq(hisat_dds)
hisat_res <- results(hisat_dds)
star_dds <- DESeq(star_dds)
star_res <- results(star_dds)
summary(hisat_res)
##
## out of 20591 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 26, 0.13%
## LFC < 0 (down) : 110, 0.53%
## outliers [1] : 0, 0%
## low counts [2] : 5589, 27%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
summary(star_res)
##
## out of 19908 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 50, 0.25%
## LFC < 0 (down) : 145, 0.73%
## outliers [1] : 0, 0%
## low counts [2] : 5790, 29%
## (mean count < 7)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
hisat_rld1 <- rlog(hisat_dds, blind=TRUE)
hisat_rld_mat1 <- assay(hisat_rld1)
hisat_rld_cor1 <- cor(hisat_rld_mat1)
heat.colors <- brewer.pal(6, "Blues")
hisat_heatmap1 <- pheatmap(hisat_rld_cor1, color = heat.colors, main = "HISAT2 Heatmap")
hisat_heatmap1 <- as.ggplot(hisat_heatmap1)
star_rld1 <- rlog(star_dds, blind=TRUE)
star_rld_mat1 <- assay(star_rld1)
star_rld_cor1 <- cor(star_rld_mat1)
heat.colors <- brewer.pal(6, "Blues")
star_heatmap1 <- pheatmap(star_rld_cor1, color = heat.colors, main = "STAR Heatmap")
star_heatmap1 <- as.ggplot(star_heatmap1)
ggsave(filename = "hisat_heatmap1.png",
plot = hisat_heatmap1,
units = "in",
width = 5,
height = 6,
dpi = 300)
ggsave(filename = "star_heatmap1.png",
plot = star_heatmap1,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_heatmap1, file = "hisat_heatmap1.png")
saveRDS(star_heatmap1, file = "star_heatmap1.png")
hisat_heatmap_plot <- readRDS("hisat_heatmap1.png")
star_heatmap_plot <- readRDS("star_heatmap1.png")
figHeatmap <- plot_grid(hisat_heatmap_plot, star_heatmap_plot,
ncol = 2,
labels = "A")
figHeatmap
The correlation heatmaps above show the hierarchical clustering of each condition type in the dataset. It appears there is higher correlation between the LNHigh1 and LNlow1 samples compared to other samples. The HISAT2 heatmap and the STAR heatmap appear to draw similar, although still different, correlations. The high correlation between different samples may be due to noise in the dataset, or unnormalized counts data.
hisat_vsd <- vst(hisat_dds, blind=FALSE)
plotPCA(hisat_vsd, intgroup=c("Group")) + theme_classic()
### A better PCA I have recreated the PCA above but with better
aesthetics
print(colData(hisat_dds))
## DataFrame with 8 rows and 6 columns
## Tissue CD44Level Condition Rep Group sizeFactor
## <character> <character> <character> <integer> <factor> <numeric>
## BMLow1 Bone Marrow Low BM 1 BMLow 0.816003
## BMLow2 Bone Marrow Low BM 2 BMLow 1.570110
## BMHigh1 Bone Marrow High BM 1 BMHigh 0.623038
## BMHigh2 Bone Marrow High BM 2 BMHigh 3.340993
## LNLow1 Lymph Node Low LN 1 LNLow 0.518614
## LNLow2 Lymph Node Low LN 2 LNLow 1.269942
## LNHigh1 Lymph Node High LN 1 LNHigh 0.680636
## LNHigh2 Lymph Node High LN 2 LNHigh 0.898237
hisat_PCA <- plotPCA(hisat_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)
percentVar <- round(100 * attr(hisat_PCA, "percentVar"))
hisat_pca_plot <- ggplot(hisat_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
ggtitle("PCA Plot of HISAT2 Counts")+
theme_classic()
hisat_pca_plot <- as.ggplot(hisat_pca_plot)
star_vsd <- vst(star_dds, blind=FALSE)
star_PCA <- plotPCA(star_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)
percentVar <- round(100 * attr(star_PCA, "percentVar"))
star_pca_plot <- ggplot(star_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
ggtitle("PCA Plot of STAR Counts")+
theme_classic()
star_pca_plot <- as.ggplot(star_pca_plot)
ggsave(filename = "hisat_pca.png",
plot = hisat_pca_plot,
units = "in",
width = 5,
height = 6,
dpi = 300)
ggsave(filename = "star_pca.png",
plot = star_pca_plot,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_pca_plot, file = "hisat_pca.png")
saveRDS(star_pca_plot, file = "star_pca.png")
hisat_pca_plot <- readRDS("hisat_pca.png")
star_pca_plot <- readRDS("star_pca.png")
figPCA1 <- plot_grid(hisat_pca_plot, star_pca_plot,
ncol = 2,
labels = "B")
figPCA1
PCA plots provide and understanding of how the features in the data
contribute to the principle components, and the distance between points
indicates how similar they are in terms of PCA values. This data appears
to be highly variable, as the plot points are widely spread across the
plot. Interestingly, the samples seem to cluster somewhat by tissue type
rather than by CD44 type. This indicates there may be a stronger
relationship between gene expression and the source tissue than between
the gene expression and CD44 level. Again, the HISAT2 data and the STAR
data seem to cluster in similar ways, although the variance for each PCA
variable is slightly different.
figPCA_Heatmap <- plot_grid(figHeatmap, figPCA1,
nrow = 2)
figPCA_Heatmap
hisat_res <- results(hisat_dds, contrast=c('Group', 'BMHigh', 'BMLow'))
head(hisat_res)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5 4908.86896 -0.451071 0.468883 -0.962013 0.336043
## ENSMUSG00000000028.16 1193.87193 -0.081853 0.549126 -0.149061 0.881506
## ENSMUSG00000000037.18 46.38722 0.894413 0.690580 1.295162 0.195264
## ENSMUSG00000000049.12 2.17764 0.647286 2.252424 0.287373 0.773827
## ENSMUSG00000000056.8 958.65044 -0.506372 0.530025 -0.955374 0.339389
## ENSMUSG00000000058.7 2421.74679 -0.321896 0.485281 -0.663317 0.507127
## padj
## <numeric>
## ENSMUSG00000000001.5 0.983242
## ENSMUSG00000000028.16 0.993699
## ENSMUSG00000000037.18 0.983242
## ENSMUSG00000000049.12 NA
## ENSMUSG00000000056.8 0.983242
## ENSMUSG00000000058.7 0.983242
dim(hisat_res)
## [1] 20591 6
hisat_result <- hisat_res[order(hisat_res$padj), ]
head(hisat_result)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000019831.15 65.359 3.87116 0.827763 4.67665 2.91592e-06
## ENSMUSG00000003135.16 132.135 3.80900 0.857747 4.44070 8.96665e-06
## ENSMUSG00000027637.4 228.286 3.74790 0.855406 4.38142 1.17906e-05
## ENSMUSG00000047843.17 231.727 3.90781 0.879141 4.44503 8.78790e-06
## ENSMUSG00000029229.9 204.232 3.58455 0.857485 4.18030 2.91122e-05
## ENSMUSG00000040331.17 325.703 3.69899 0.889321 4.15934 3.19162e-05
## padj
## <numeric>
## ENSMUSG00000019831.15 0.0367611
## ENSMUSG00000003135.16 0.0371610
## ENSMUSG00000027637.4 0.0371610
## ENSMUSG00000047843.17 0.0371610
## ENSMUSG00000029229.9 0.0640113
## ENSMUSG00000040331.17 0.0640113
table(hisat_result$padj<0.05)
##
## FALSE TRUE
## 12603 4
Doing the same for STAR
star_res <- results(star_dds, contrast=c('Group', 'BMHigh', 'BMLow'))
head(star_res)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5 5010.54538 -0.4520733 0.473439 -0.954871 0.339643
## ENSMUSG00000000028.16 1238.02136 -0.0932176 0.544633 -0.171157 0.864101
## ENSMUSG00000000037.18 47.48724 0.9486842 0.709786 1.336578 0.181360
## ENSMUSG00000000049.12 2.49053 0.8030867 2.007971 0.399949 0.689194
## ENSMUSG00000000056.8 998.47371 -0.4899313 0.528291 -0.927389 0.353725
## ENSMUSG00000000058.7 2536.63457 -0.3106055 0.482653 -0.643538 0.519875
## padj
## <numeric>
## ENSMUSG00000000001.5 0.993172
## ENSMUSG00000000028.16 0.996036
## ENSMUSG00000000037.18 0.993172
## ENSMUSG00000000049.12 NA
## ENSMUSG00000000056.8 0.993172
## ENSMUSG00000000058.7 0.993172
dim(star_res)
## [1] 19908 6
star_result <- star_res[order(star_res$padj), ]
head(star_result)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000019831.15 66.5985 3.68168 0.795608 4.62751 3.70091e-06
## ENSMUSG00000003135.16 135.2666 3.70524 0.850746 4.35528 1.32894e-05
## ENSMUSG00000027637.4 239.9591 3.55814 0.833949 4.26662 1.98461e-05
## ENSMUSG00000047843.17 238.2003 3.69590 0.866000 4.26778 1.97427e-05
## ENSMUSG00000029229.9 208.1444 3.55220 0.856138 4.14910 3.33791e-05
## ENSMUSG00000040331.17 331.1430 3.58783 0.876219 4.09467 4.22779e-05
## padj
## <numeric>
## ENSMUSG00000019831.15 0.0479638
## ENSMUSG00000003135.16 0.0643013
## ENSMUSG00000027637.4 0.0643013
## ENSMUSG00000047843.17 0.0643013
## ENSMUSG00000029229.9 0.0782745
## ENSMUSG00000040331.17 0.0782745
table(star_result$padj<0.05)
##
## FALSE TRUE
## 12959 1
hisat_resdata <- merge(as.data.frame(hisat_result), as.data.frame(counts(hisat_dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(hisat_resdata)[1] <- "Gene_id"
head(hisat_resdata)
## Gene_id baseMean log2FoldChange lfcSE stat
## 1 ENSMUSG00000019831.15 65.35899 3.871164 0.8277634 4.676655
## 2 ENSMUSG00000003135.16 132.13549 3.808998 0.8577472 4.440700
## 3 ENSMUSG00000027637.4 228.28555 3.747895 0.8554056 4.381425
## 4 ENSMUSG00000047843.17 231.72657 3.907809 0.8791410 4.445031
## 5 ENSMUSG00000029229.9 204.23224 3.584547 0.8574853 4.180302
## 6 ENSMUSG00000040331.17 325.70293 3.698992 0.8893208 4.159345
## pvalue padj BMLow1 BMLow2 BMHigh1 BMHigh2 LNLow1
## 1 2.915924e-06 0.03676106 7.352912 18.47004 54.57133 330.1414 44.34895
## 2 8.966653e-06 0.03716098 26.960678 33.75559 49.75621 800.3610 44.34895
## 3 1.179059e-05 0.03716098 64.950723 37.57698 70.62172 1297.5186 109.90827
## 4 8.787903e-06 0.03716098 56.372326 44.58285 73.83180 1435.5015 107.98006
## 5 2.911223e-05 0.06401130 72.303635 35.02939 70.62172 1206.8270 69.41575
## 6 3.191616e-05 0.06401130 82.107518 70.05877 81.85699 1889.5582 219.81655
## LNLow2 LNHigh1 LNHigh2
## 1 25.19800 24.97664 17.81267
## 2 53.54575 24.97664 23.37913
## 3 99.21713 80.80676 65.68422
## 4 47.24625 38.19956 50.09814
## 5 70.08194 61.70698 47.87155
## 6 99.21713 98.43733 64.57093
#write results
write.csv(hisat_resdata, file="HISAT_BMHighvsBMLow _unfiltered_matrix.csv")
star_resdata <- merge(as.data.frame(star_result), as.data.frame(counts(star_dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(star_resdata)[1] <- "Gene_id"
head(star_resdata)
## Gene_id baseMean log2FoldChange lfcSE stat
## 1 ENSMUSG00000019831.15 66.59854 3.681685 0.7956085 4.627508
## 2 ENSMUSG00000003135.16 135.26662 3.705243 0.8507465 4.355284
## 3 ENSMUSG00000027637.4 239.95911 3.558138 0.8339486 4.266615
## 4 ENSMUSG00000047843.17 238.20027 3.695900 0.8660002 4.267781
## 5 ENSMUSG00000029229.9 208.14436 3.552198 0.8561379 4.149096
## 6 ENSMUSG00000040331.17 331.14303 3.587826 0.8762195 4.094665
## pvalue padj BMLow1 BMLow2 BMHigh1 BMHigh2 LNLow1
## 1 3.700912e-06 0.04796382 10.94213 19.09247 56.06919 331.9771 40.57350
## 2 1.328944e-05 0.06430131 30.39481 35.63929 51.26326 808.2792 42.50557
## 3 1.984608e-05 0.06430131 72.94755 47.09477 78.49686 1327.6155 117.85635
## 4 1.974266e-05 0.06430131 66.86859 51.54968 76.89489 1451.7409 104.33185
## 5 3.337908e-05 0.07827447 72.94755 38.18495 68.88500 1225.4461 71.48664
## 6 4.227788e-05 0.07827447 89.96865 73.18782 86.50675 1870.0788 231.84855
## LNLow2 LNHigh1 LNHigh2
## 1 27.88879 28.27529 17.96992
## 2 56.57440 28.27529 29.20112
## 3 104.38375 89.29038 81.98777
## 4 53.38711 40.18067 60.64849
## 5 73.30767 65.47961 49.41729
## 6 109.96150 114.58932 73.00281
#write results
write.csv(star_resdata, file="STAR_BMHighvsBMLow _unfiltered_matrix.csv")
Now, getting a list of DEGs and filtering for rows which meet the criteria of having an absolute log2FC of 1 and padj of less than 0.05.
#Get DEG up/down lists
hisat_resdata_DEGs <- as.data.frame(hisat_result) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) %>% tibble::rownames_to_column("Gene_id")
star_resdata_DEGs <- as.data.frame(star_result) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) %>% tibble::rownames_to_column("Gene_id")
#write results
write.csv(star_resdata_DEGs, "STAR_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)
write.csv(hisat_resdata_DEGs, "HISAT_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)
head(hisat_resdata_DEGs)
## Gene_id baseMean log2FoldChange lfcSE stat
## 1 ENSMUSG00000019831.15 65.35899 3.871164 0.8277634 4.676655
## 2 ENSMUSG00000003135.16 132.13549 3.808998 0.8577472 4.440700
## 3 ENSMUSG00000027637.4 228.28555 3.747895 0.8554056 4.381425
## 4 ENSMUSG00000047843.17 231.72657 3.907809 0.8791410 4.445031
## pvalue padj
## 1 2.915924e-06 0.03676106
## 2 8.966653e-06 0.03716098
## 3 1.179059e-05 0.03716098
## 4 8.787903e-06 0.03716098
The following plots are taken inspiration from Fig. 6 in Raplee et. al.
The goal of the volcano plots is to identify genes of significance by the p-adjusted values.
# Add Threshold column based on criteria
hisat_resdata$threshold <- abs(hisat_resdata$log2FoldChange) >= 1 & hisat_resdata$padj < 0.05
hisat_threshold_padj <- hisat_resdata$padj < 0.05
## Which values in an object are considered true and what is the length??
length(which(hisat_threshold_padj))
## [1] 4
# Add Threshold column based on criteria
star_resdata$threshold <- abs(star_resdata$log2FoldChange) >= 1 & star_resdata$padj < 0.05
star_threshold_padj <- star_resdata$padj < 0.05
## Which values in an object are considered true and what is the length??
length(which(star_threshold_padj))
## [1] 1
str(star_resdata)
## 'data.frame': 19908 obs. of 16 variables:
## $ Gene_id : 'AsIs' chr "ENSMUSG00000019831.15" "ENSMUSG00000003135.16" "ENSMUSG00000027637.4" "ENSMUSG00000047843.17" ...
## $ baseMean : num 66.6 135.3 240 238.2 208.1 ...
## $ log2FoldChange: num 3.68 3.71 3.56 3.7 3.55 ...
## $ lfcSE : num 0.796 0.851 0.834 0.866 0.856 ...
## $ stat : num 4.63 4.36 4.27 4.27 4.15 ...
## $ pvalue : num 3.70e-06 1.33e-05 1.98e-05 1.97e-05 3.34e-05 ...
## $ padj : num 0.048 0.0643 0.0643 0.0643 0.0783 ...
## $ BMLow1 : num 10.9 30.4 72.9 66.9 72.9 ...
## $ BMLow2 : num 19.1 35.6 47.1 51.5 38.2 ...
## $ BMHigh1 : num 56.1 51.3 78.5 76.9 68.9 ...
## $ BMHigh2 : num 332 808 1328 1452 1225 ...
## $ LNLow1 : num 40.6 42.5 117.9 104.3 71.5 ...
## $ LNLow2 : num 27.9 56.6 104.4 53.4 73.3 ...
## $ LNHigh1 : num 28.3 28.3 89.3 40.2 65.5 ...
## $ LNHigh2 : num 18 29.2 82 60.6 49.4 ...
## $ threshold : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
The labels
hisatVolcano1 <- ggplot(hisat_resdata) +
geom_point(aes(x=log2FoldChange,
y=-log10(padj), colour=threshold)) +
theme_classic() +
ggtitle("HISAT2 DEG Expression") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("darkslategrey", "magenta"),
labels = c("Not Significant", "Differentially Expressed"))
hisatVolcano1 <- as.ggplot(hisatVolcano1)
ggsave(filename = "hisatVolcano1.png",
plot = hisatVolcano1,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisatVolcano1, file = "hisatVolcano1.png")
hisatVolcano1_plot <- readRDS("hisatVolcano1.png")
The amount of genes which meet the threshold (a p-adjusted value less than 0.05) is different for the HISAT2 alignment data and the STAR alignment data. From the HISAT2 alignment, we have 76 genes of significance, but from the STAR alignment we only have 1 gene of significance. These results indicate the alignments created varied results which will impact analyses down the line.
starVolanco1 <- ggplot(star_resdata) +
geom_point(aes(x=log2FoldChange,
y=-log10(padj), colour=threshold)) + # Add labels
theme_classic() +
ggtitle("STAR DEG Expression") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("darkslategrey", "magenta"),
labels = c("Not Significant", "Differentially Expressed"))
starVolanco1 <- as.ggplot(starVolanco1)
ggsave(filename = "starVolanco1.png",
plot = starVolanco1,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(starVolanco1, file = "starVolanco1.png")
starVolanco1_plot <- readRDS("starVolanco1.png")
For both the HISAT and STAR volcano plots, you can see that (before the batch correction) all the DEGs were upregulated.
volcanoes1 <- plot_grid(hisatVolcano1_plot, starVolanco1_plot,
ncol = 2,
labels = "C")
volcanoes1
The volcano plots above indicates the statistical significance of each
of the genes according to the results of the HISAT and STAR analysis.
Genes that are either up or downregulated are labeled in magenta, and
all other genes are gray. Both plots show that the only differentially
expressed genes are upregulated, and that there are very very few genes
of significance (further emphasized by the venn diagram below)
generating a venn diagram comparing the DEGs identified by HISAT2 vs STAR
# Extract genes marked as differentially expressed in each dataset
hisat_threshold_genes <- subset(hisat_resdata, threshold)$Gene_id
star_threshold_genes <- subset(star_resdata, threshold)$Gene_id
# Create a list of gene IDs for both datasets
gene_lists <- list(
HISAT2 = hisat_threshold_genes,
STAR = star_threshold_genes
)
# Plot the Venn diagram
ggvenn(data = gene_lists)
In the original literature from which this dataset was derived, the authors found 5,509 DEGs from 8 distinct gene clusters using STAR. Since the results from this analysis are so different to the analysis seen above, this data clearly indicates a batch correction is necessary for the most accurate analysis. That being said, for the purposes of assessing the results drawn from HISAT2 vs STAR, it is clear the results are very different depending on the aligner chosen.
Genes of interest as they are indicated above:
STAR’s one DEG is ENSMUSG00000019831.15 This is gene Wasf1, a protein coding gene. Mutations in this gene have been associated with defects in the central nervous system. The protein it encodes is an actin binding protein This gene was identified as a DEG from HISAT2, which is why there is indicated crossover in the Venn diagram above. source: https://useast.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000019831;r=10:40759471-40814566
The 3 remaining DEGs identified only by HISAT2 are: - ENSMUSG00000003135.16: Cnot11 This is another protein coding gene for the CCR4-NOT transcription complex subunit 11. It is predicted to act upstream of or within regulation of translation and regulatory ncRNA-mediated gene silencing. source: https://www.informatics.jax.org/marker/MGI:106580
ENSMUSG00000027637.4: Rab5if This is a protein coding factor for the RAB5 interacting factor, which is a part of the GEL complex subunit OPTI. It is involved in mitochondrial repirasome assembly source: https://www.informatics.jax.org/marker/MGI:1914638
ENSMUSG00000047843: Bri3 This is a protein coding gene for the I3 brain protein, which is predicted to enable identical protein binding activity source: https://www.informatics.jax.org/marker/MGI:1933174
Because the results at this stage are evident to be symptomatic of some batch effect, I will next conduct a batch correction. I will use the sva package, which provides methods for removing batch effects from RNA-seq data
To aid in this process, I used the following resources: 1. The SVA package for removing batch effects and other … (n.d.). https://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf 2. Ram, & Lj. (n.d.). Removing batch effects using combat and SVA. Bioinformatics Answers. https://www.biostars.org/p/196430/
# Load necessary libraries
library(sva)
# Extract counts matrix from HISAT2 and STAR datasets
hisat_counts <- counts(hisat_dds, normalized = TRUE)
star_counts <- counts(star_dds, normalized = TRUE)
hisat_counts <- as.data.frame(hisat_counts)
star_counts <- as.data.frame(star_counts)
# Perform batch correction for HISAT2 data
hisat_corrected <- ComBat(dat = hisat_counts, batch = hisatGroup, par.prior = TRUE, prior.plots = FALSE)
## Found 1667 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
hisat_corrected <- as.data.frame(hisat_corrected)
# Perform batch correction for STAR data
star_corrected <- ComBat(dat = star_counts, batch = starGroup, par.prior = TRUE, prior.plots = FALSE)
## Found 1650 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
star_corrected <- as.data.frame(star_corrected)
head(star_corrected)
## BMLow1 BMLow2 BMHigh1 BMHigh2
## ENSMUSG00000000001.5 4673.843966 5350.069764 5525.543325 3735.066003
## ENSMUSG00000000028.16 990.573357 1446.908007 1266.899870 1125.865281
## ENSMUSG00000000037.18 38.042464 52.222787 49.827578 59.658029
## ENSMUSG00000000049.12 2.945197 1.654655 1.621753 3.544434
## ENSMUSG00000000056.8 1169.631389 889.724868 1169.581047 712.059006
## ENSMUSG00000000058.7 2007.512416 3052.019547 2672.301236 2114.336938
## LNLow1 LNLow2 LNHigh1 LNHigh2
## ENSMUSG00000000001.5 4957.954944 5563.927395 5598.177380 4738.869383
## ENSMUSG00000000028.16 1400.301940 1237.704986 1619.295940 825.957161
## ENSMUSG00000000037.18 33.404537 64.454252 49.308261 31.622373
## ENSMUSG00000000049.12 1.515242 4.836595 1.467321 2.395742
## ENSMUSG00000000056.8 1049.582138 884.751566 807.036631 1314.573893
## ENSMUSG00000000058.7 2961.578349 2001.516379 2169.801004 3375.685164
After this batch correction, I must redo the visualizations above to see if the results are better.
First, I need to manage the transformed data for negative values since these arose during the batch correction. To deal with these, I chose to replace the values with 0.
Additionally, due to floating-point arithmetic in the data processing, the counts values are now floating point integers. In an abundance of caution to not disturb the quality of the data, I will create new dataframes for these more useable batch corrected counts values
hisat_rounded <- hisat_corrected
star_rounded <- star_corrected
# Replace negative values with zero
hisat_rounded[hisat_rounded < 0] <- 0
star_rounded[star_rounded < 0] <- 0
hisat_rounded <- round(hisat_rounded)
star_rounded <- round(star_rounded)
head(star_rounded)
## BMLow1 BMLow2 BMHigh1 BMHigh2 LNLow1 LNLow2 LNHigh1
## ENSMUSG00000000001.5 4674 5350 5526 3735 4958 5564 5598
## ENSMUSG00000000028.16 991 1447 1267 1126 1400 1238 1619
## ENSMUSG00000000037.18 38 52 50 60 33 64 49
## ENSMUSG00000000049.12 3 2 2 4 2 5 1
## ENSMUSG00000000056.8 1170 890 1170 712 1050 885 807
## ENSMUSG00000000058.7 2008 3052 2672 2114 2962 2002 2170
## LNHigh2
## ENSMUSG00000000001.5 4739
## ENSMUSG00000000028.16 826
## ENSMUSG00000000037.18 32
## ENSMUSG00000000049.12 2
## ENSMUSG00000000056.8 1315
## ENSMUSG00000000058.7 3376
# Recreate DESeq2 datasets (DDS) using the corrected data
hisat_dds_corrected <- DESeqDataSetFromMatrix(countData = hisat_rounded,
colData = colData(hisat_dds),
design = ~ Group)
star_dds_corrected <- DESeqDataSetFromMatrix(countData = star_rounded,
colData = colData(star_dds),
design = ~ Group)
# Run differential expression analysis
hisat_dds_corrected <- DESeq(hisat_dds_corrected)
star_dds_corrected <- DESeq(star_dds_corrected)
hisat_rld <- rlog(hisat_dds_corrected, blind=TRUE)
hisat_rld_mat <- assay(hisat_rld)
hisat_rld_cor <- cor(hisat_rld_mat)
heat.colors <- brewer.pal(6, "Blues")
hisat_heatmap <- pheatmap(hisat_rld_cor, color = heat.colors, main = "HISAT2 Corrected Heatmap")
hisat_heatmap <- as.ggplot(hisat_heatmap)
ggsave(filename = "hisat_heatmap_corrected.png",
plot = hisat_heatmap,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_heatmap, file = "hisat_heatmap_corrected.png")
hisat_heatmap_corrected_plot <- readRDS("hisat_heatmap_corrected.png")
star_rld <- rlog(star_dds_corrected, blind=TRUE)
star_rld_mat <- assay(star_rld)
star_rld_cor <- cor(star_rld_mat)
heat.colors <- brewer.pal(6, "Blues")
star_heatmap <- pheatmap(star_rld_cor, color = heat.colors, main = "STAR Corrected Heatmap")
star_heatmap <- as.ggplot(star_heatmap)
ggsave(filename = "star_heatmap_corrected.png",
plot = star_heatmap,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_heatmap, file = "star_heatmap_corrected.png")
star_heatmap_corrected_plot <- readRDS("star_heatmap_corrected.png")
corrected_heatmaps <- plot_grid(hisat_heatmap_corrected_plot, star_heatmap_corrected_plot,
ncol = 2)
heatmaps_combined <- plot_grid(figHeatmap, corrected_heatmaps,
nrow = 2,
labels = "D")
heatmaps_combined
Clearly, the batch effect has had an impact on the results of the data.
The heatmap provides a sanity check that our batch correction has
worked, and the correlations between the samples are showing what we
would expect; very strong correlation between the same samples, and a
weaker correlation between more different samples. Correlations which
were very strong before the batch correction are no longer appearing as
correlated
hisat_vsd <- vst(hisat_dds_corrected, blind=FALSE)
hisat_PCA <- plotPCA(hisat_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)
percentVar <- round(100 * attr(hisat_PCA, "percentVar"))
hisat_pca_plot <- ggplot(hisat_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
ggtitle("PCA Plot of HISAT2 Counts (corrected)")+
theme_classic()
hisat_pca_plot <- as.ggplot(hisat_pca_plot)
ggsave(filename = "hisat_pca_plot_corrected.png",
plot = hisat_pca_plot,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_pca_plot, file = "hisat_pca_plot_corrected.png")
hisat_pca_corrected_plot <- readRDS("hisat_pca_plot_corrected.png")
star_vsd <- vst(star_dds_corrected, blind=FALSE)
star_PCA <- plotPCA(star_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)
percentVar <- round(100 * attr(star_PCA, "percentVar"))
star_pca_plot <- ggplot(star_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
ggtitle("PCA Plot of STAR Counts (corrected)")+
theme_classic()
star_pca_plot <- as.ggplot(star_pca_plot)
ggsave(filename = "star_pca_plot_corrected.png",
plot = star_pca_plot,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_pca_plot, file = "star_pca_plot_corrected.png")
star_pca_corrected_plot <- readRDS("star_pca_plot_corrected.png")
corrected_PCAs <- plot_grid(hisat_pca_corrected_plot, star_pca_corrected_plot,
ncol = 2,
labels = "E")
corrected_PCAs
combined_PCAs <- plot_grid(figPCA1, corrected_PCAs,
nrow = 2)
combined_PCAs
With the corrected data, the PCAs indicate clustering between the lymph node samples, and less so between the bone marrow samples. The samples are still very far apart, and there is less variance in the PCA variables in the corrected values compared ot the uncorrected values.
Since I have performed a batch correction, the padj values are all 1. This is, of course, no longer a good metric to assess genes of significance. Instead, I will use the p-values < 0.05 and keep the L2FC values > 1 to identify genes of significance post batch correction.
hisat_res <- results(hisat_dds_corrected, contrast=c('Group', 'BMHigh', 'BMLow'))
head(hisat_res)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5 5697.31140 0.1226365 0.970918 0.1263098 0.899487
## ENSMUSG00000000028.16 1376.81073 0.1393018 0.970925 0.1434732 0.885916
## ENSMUSG00000000037.18 49.03352 0.2967228 0.953448 0.3112102 0.755641
## ENSMUSG00000000049.12 1.80616 -0.8096610 2.131410 -0.3798712 0.704041
## ENSMUSG00000000056.8 1134.49771 0.0521095 1.019564 0.0511096 0.959238
## ENSMUSG00000000058.7 2804.26885 0.1505281 0.974444 0.1544760 0.877234
## padj
## <numeric>
## ENSMUSG00000000001.5 1
## ENSMUSG00000000028.16 1
## ENSMUSG00000000037.18 1
## ENSMUSG00000000049.12 1
## ENSMUSG00000000056.8 1
## ENSMUSG00000000058.7 1
dim(hisat_res)
## [1] 20591 6
hisat_result_corrected <- hisat_res[order(hisat_res$pvalue), ]
head(hisat_result_corrected)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000031144.16 3.64667 -5.06534 2.04308 -2.47926 0.0131655
## ENSMUSG00000046337.18 7.97645 -3.10637 1.40110 -2.21709 0.0266168
## ENSMUSG00000109051.2 4.19634 4.64404 2.19334 2.11734 0.0342313
## ENSMUSG00000038583.14 2.03877 -5.48740 2.61062 -2.10196 0.0355571
## ENSMUSG00000044788.11 2.06370 5.27089 2.55278 2.06476 0.0389454
## ENSMUSG00000083906.2 2.47343 4.88317 2.42498 2.01370 0.0440414
## padj
## <numeric>
## ENSMUSG00000031144.16 1
## ENSMUSG00000046337.18 1
## ENSMUSG00000109051.2 1
## ENSMUSG00000038583.14 1
## ENSMUSG00000044788.11 1
## ENSMUSG00000083906.2 1
table(hisat_result_corrected$pvalue<0.05)
##
## FALSE TRUE
## 20581 10
Doing the same for STAR
star_res <- results(star_dds_corrected, contrast=c('Group', 'BMHigh', 'BMLow'))
head(star_res)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5 5837.42100 0.1299552 0.976352 0.1331028 0.894112
## ENSMUSG00000000028.16 1439.18739 0.1492737 0.980475 0.1522464 0.878993
## ENSMUSG00000000037.18 50.07160 0.2752073 0.952356 0.2889754 0.772600
## ENSMUSG00000000049.12 2.60973 -0.2616289 1.742020 -0.1501871 0.880617
## ENSMUSG00000000056.8 1185.41984 0.0654384 1.023048 0.0639642 0.948999
## ENSMUSG00000000058.7 2952.75353 0.1598753 0.980316 0.1630855 0.870451
## padj
## <numeric>
## ENSMUSG00000000001.5 1
## ENSMUSG00000000028.16 1
## ENSMUSG00000000037.18 1
## ENSMUSG00000000049.12 1
## ENSMUSG00000000056.8 1
## ENSMUSG00000000058.7 1
dim(star_res)
## [1] 19908 6
star_result_corrected <- star_res[order(star_res$pvalue), ]
head(star_result_corrected)
## log2 fold change (MLE): Group BMHigh vs BMLow
## Wald test p-value: Group BMHigh vs BMLow
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000060989.9 7.72410 6.78053 2.61845 2.58952 0.00961095
## ENSMUSG00000031144.16 3.77169 -4.41325 1.95291 -2.25983 0.02383201
## ENSMUSG00000046337.18 8.44016 -3.08623 1.38364 -2.23052 0.02571298
## ENSMUSG00000109051.2 4.40102 4.64114 2.18558 2.12353 0.03370935
## ENSMUSG00000038583.14 2.02994 -5.49382 2.60039 -2.11269 0.03462737
## ENSMUSG00000100927.2 1.98745 5.07013 2.41928 2.09572 0.03610707
## padj
## <numeric>
## ENSMUSG00000060989.9 1
## ENSMUSG00000031144.16 1
## ENSMUSG00000046337.18 1
## ENSMUSG00000109051.2 1
## ENSMUSG00000038583.14 1
## ENSMUSG00000100927.2 1
table(star_result_corrected$pvalue<0.05)
##
## FALSE TRUE
## 19897 11
hisat_resdata_corrected <- merge(as.data.frame(hisat_result_corrected), as.data.frame(counts(hisat_dds_corrected, normalized=TRUE)), by="row.names", sort=FALSE)
names(hisat_resdata_corrected)[1] <- "Gene_id"
head(hisat_resdata_corrected)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000031144.16 3.646668 -5.065336 2.043083 -2.479261 0.01316549
## 2 ENSMUSG00000046337.18 7.976448 -3.106371 1.401101 -2.217092 0.02661681
## 3 ENSMUSG00000109051.2 4.196344 4.644044 2.193342 2.117337 0.03423127
## 4 ENSMUSG00000038583.14 2.038767 -5.487404 2.610617 -2.101956 0.03555709
## 5 ENSMUSG00000044788.11 2.063696 5.270887 2.552780 2.064763 0.03894539
## 6 ENSMUSG00000083906.2 2.473435 4.883165 2.424976 2.013697 0.04404139
## padj BMLow1 BMLow2 BMHigh1 BMHigh2 LNLow1 LNLow2 LNHigh1
## 1 1 18.38228 1.9106938 0.000000 0.2993122 3.856431 4.7246253 0.000000
## 2 1 34.31359 6.3689793 3.210078 1.7958735 0.000000 0.0000000 5.876855
## 3 1 0.00000 0.0000000 8.025195 0.8979367 11.569292 3.1497502 8.815283
## 4 1 11.02937 0.6368979 0.000000 0.0000000 3.856431 0.7874376 0.000000
## 5 1 0.00000 0.0000000 12.840312 0.2993122 0.000000 0.7874376 1.469214
## 6 1 0.00000 0.0000000 9.630234 0.5986245 3.856431 2.3623127 0.000000
## LNHigh2
## 1 0.000000
## 2 12.246211
## 3 1.113292
## 4 0.000000
## 5 1.113292
## 6 3.339876
#write results
write.csv(hisat_resdata_corrected, file="CORRECTED_HISAT_BMHighvsBMLow _unfiltered_matrix.csv")
star_resdata_corrected <- merge(as.data.frame(star_result_corrected), as.data.frame(counts(star_dds_corrected, normalized=TRUE)), by="row.names", sort=FALSE)
names(star_resdata_corrected)[1] <- "Gene_id"
head(star_resdata_corrected)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000060989.9 7.724096 6.780527 2.618448 2.589521 0.009610954
## 2 ENSMUSG00000031144.16 3.771692 -4.413245 1.952913 -2.259827 0.023832010
## 3 ENSMUSG00000046337.18 8.440158 -3.086226 1.383635 -2.230519 0.025712977
## 4 ENSMUSG00000109051.2 4.401022 4.641142 2.185577 2.123531 0.033709353
## 5 ENSMUSG00000038583.14 2.029939 -5.493822 2.600393 -2.112689 0.034627366
## 6 ENSMUSG00000100927.2 1.987449 5.070125 2.419277 2.095720 0.036107067
## padj BMLow1 BMLow2 BMHigh1 BMHigh2 LNLow1 LNLow2 LNHigh1
## 1 1 0.00000 0.0000000 35.243489 0.8782462 15.456570 7.9682248 0.000000
## 2 1 18.23689 1.9092474 0.000000 0.5854974 3.864143 5.5777574 0.000000
## 3 1 36.47378 6.3641581 3.203954 2.0492411 0.000000 0.0000000 5.952692
## 4 1 0.00000 0.0000000 8.009884 0.8782462 11.592428 3.1872899 10.417211
## 5 1 10.94213 0.6364158 0.000000 0.0000000 3.864143 0.7968225 0.000000
## 6 1 0.00000 0.0000000 9.611861 1.7564923 0.000000 0.7968225 1.488173
## LNHigh2
## 1 2.24624
## 2 0.00000
## 3 13.47744
## 4 1.12312
## 5 0.00000
## 6 2.24624
#write results
write.csv(star_resdata_corrected, file="CORRECTED_STAR_BMHighvsBMLow _unfiltered_matrix_CORRECTED.csv")
Now, getting a list of DEGs and filtering for rows which meet the criteria of having an absolute log2FC of 1 and padj of less than 0.05.
#Get DEG up/down lists
hisat_resdata_DEGs <- as.data.frame(hisat_result_corrected) %>% dplyr::filter(abs(log2FoldChange) > 1 & pvalue < 0.05) %>% tibble::rownames_to_column("Gene_id")
star_resdata_DEGs <- as.data.frame(star_result_corrected) %>% dplyr::filter(abs(log2FoldChange) > 1 & pvalue < 0.05) %>% tibble::rownames_to_column("Gene_id")
#write results
write.csv(star_resdata_DEGs, "CORRECTED_STAR_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)
write.csv(hisat_resdata_DEGs, "CORRECTED_HISAT_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)
head(hisat_resdata_DEGs)
## Gene_id baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSMUSG00000031144.16 3.646668 -5.065336 2.043083 -2.479261 0.01316549
## 2 ENSMUSG00000046337.18 7.976448 -3.106371 1.401101 -2.217092 0.02661681
## 3 ENSMUSG00000109051.2 4.196344 4.644044 2.193342 2.117337 0.03423127
## 4 ENSMUSG00000038583.14 2.038767 -5.487404 2.610617 -2.101956 0.03555709
## 5 ENSMUSG00000044788.11 2.063696 5.270887 2.552780 2.064763 0.03894539
## 6 ENSMUSG00000083906.2 2.473435 4.883165 2.424976 2.013697 0.04404139
## padj
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
The following volcano plots were created with the pvalue < 0.05 and L2FC > 1, which is an adjustment for the batch correction and different from the original volcano plots, which used padj < 0.05 and L2FC > 1
# Add Threshold column based on criteria
hisat_resdata_corrected$threshold <- abs(hisat_resdata_corrected$log2FoldChange) >= 1 & hisat_resdata_corrected$pvalue < 0.05
hisat_threshold_pvalue <- hisat_resdata_corrected$pvalue < 0.05
## Which values in an object are considered true and what is the length??
length(which(hisat_threshold_pvalue))
## [1] 10
# Add Threshold column based on criteria
star_resdata_corrected$threshold <- abs(star_resdata_corrected$log2FoldChange) >= 1 & star_resdata_corrected$pvalue < 0.05
star_threshold_pvalue <- star_resdata_corrected$pvalue < 0.05
## Which values in an object are considered true and what is the length??
length(which(star_threshold_pvalue))
## [1] 11
str(star_resdata_corrected)
## 'data.frame': 19908 obs. of 16 variables:
## $ Gene_id : 'AsIs' chr "ENSMUSG00000060989.9" "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000109051.2" ...
## $ baseMean : num 7.72 3.77 8.44 4.4 2.03 ...
## $ log2FoldChange: num 6.78 -4.41 -3.09 4.64 -5.49 ...
## $ lfcSE : num 2.62 1.95 1.38 2.19 2.6 ...
## $ stat : num 2.59 -2.26 -2.23 2.12 -2.11 ...
## $ pvalue : num 0.00961 0.02383 0.02571 0.03371 0.03463 ...
## $ padj : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BMLow1 : num 0 18.2 36.5 0 10.9 ...
## $ BMLow2 : num 0 1.909 6.364 0 0.636 ...
## $ BMHigh1 : num 35.24 0 3.2 8.01 0 ...
## $ BMHigh2 : num 0.878 0.585 2.049 0.878 0 ...
## $ LNLow1 : num 15.46 3.86 0 11.59 3.86 ...
## $ LNLow2 : num 7.968 5.578 0 3.187 0.797 ...
## $ LNHigh1 : num 0 0 5.95 10.42 0 ...
## $ LNHigh2 : num 2.25 0 13.48 1.12 0 ...
## $ threshold : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
hisat_corrected_volcano <- ggplot(hisat_resdata_corrected) +
geom_point(aes(x=log2FoldChange,
y=-log10(pvalue), colour=threshold)) +
theme_classic() +
ggtitle("HISAT2 DEG Expression (corrected)") +
xlab("Log2 fold change") +
ylab("-Log10(p-value)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("darkslategrey", "magenta"),
labels = c("Not Significant", "Differentially Expressed")) +
ylim(0, 2)
ggsave(filename = "hisat_corrected_volcano.png",
plot = hisat_corrected_volcano,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_corrected_volcano, file = "hisat_corrected_volcano.png")
hisat_corrected_volcano_plot <- readRDS("hisat_corrected_volcano.png")
star_corrected_volcano <- ggplot(star_resdata_corrected) +
geom_point(aes(x=log2FoldChange,
y=-log10(pvalue), colour=threshold)) +
theme_classic() +
ggtitle("STAR DEG Expression (corrected)") +
xlab("Log2 fold change") +
ylab("-Log10(pvalue)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("darkslategrey", "magenta"),
labels = c("Not Significant", "Differentially Expressed")) +
ylim(0, 2)
ggsave(filename = "star_corrected_volcano.png",
plot = star_corrected_volcano,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_corrected_volcano, file = "star_corrected_volcano.png")
star_corrected_volcano_plot <- readRDS("star_corrected_volcano.png")
corrected_volcanos <- plot_grid(hisat_corrected_volcano_plot, star_corrected_volcano_plot,
ncol = 2,
labels = "F")
corrected_volcanos
Identified as a DEG by both aligners:
Identified by HISAT2 only:
Identified by STAR only:
# For HISAT data
hisat_upregulated <- hisat_resdata_corrected$Gene_id[hisat_resdata_corrected$threshold & hisat_resdata_corrected$log2FoldChange > 0]
hisat_downregulated <- hisat_resdata_corrected$Gene_id[hisat_resdata_corrected$threshold & hisat_resdata_corrected$log2FoldChange < 0]
# For STAR data
star_upregulated <- star_resdata_corrected$Gene_id[star_resdata_corrected$threshold & star_resdata_corrected$log2FoldChange > 0]
star_downregulated <- star_resdata_corrected$Gene_id[star_resdata_corrected$threshold & star_resdata_corrected$log2FoldChange < 0]
hisat_upregulated
## [1] "ENSMUSG00000109051.2" "ENSMUSG00000044788.11" "ENSMUSG00000083906.2"
## [4] "ENSMUSG00000044231.4" "ENSMUSG00000080801.3" "ENSMUSG00000109302.2"
hisat_downregulated
## [1] "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000038583.14"
## [4] "ENSMUSG00000108688.2"
star_upregulated
## [1] "ENSMUSG00000060989.9" "ENSMUSG00000109051.2" "ENSMUSG00000100927.2"
## [4] "ENSMUSG00000082194.4" "ENSMUSG00000104843.2" "ENSMUSG00000109302.2"
star_downregulated
## [1] "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000038583.14"
## [4] "ENSMUSG00000108688.2" "ENSMUSG00000084989.4"
The amount of genes which meet the threshold (a p-adjusted value less than 0.05) is different for the HISAT2 alignment data and the STAR alignment data. From the HISAT2 alignment, we have 10 genes of significance, and from the STAR alignment we have 11 gene of significance. As is shown in the Venn Diagram below, these DEGs the two aligners have identified are not all the same. In fact, only 40% of the DEGs found are the same between the two aligners. Based on the aligner used, which genes are considered differentially expressed are much different.
generating a venn diagram comparing the DEGs identified by HISAT2 vs STAR. I want to identify what genes each aligner considers a DEG
# Extract genes marked as differentially expressed in each dataset
hisat_threshold_genes_corrected <- subset(hisat_resdata_corrected, threshold)$Gene_id
star_threshold_genes_corrected <- subset(star_resdata_corrected, threshold)$Gene_id
# Create a list of gene IDs for both datasets
gene_lists <- list(
HISAT2 = hisat_threshold_genes_corrected,
STAR = star_threshold_genes_corrected
)
# Plot the Venn diagram
ggvenn(data = gene_lists)
To summarize the venn diagram’s findings below:
hisat_threshold_genes_corrected
## [1] "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000109051.2"
## [4] "ENSMUSG00000038583.14" "ENSMUSG00000044788.11" "ENSMUSG00000083906.2"
## [7] "ENSMUSG00000044231.4" "ENSMUSG00000080801.3" "ENSMUSG00000108688.2"
## [10] "ENSMUSG00000109302.2"
star_threshold_genes_corrected
## [1] "ENSMUSG00000060989.9" "ENSMUSG00000031144.16" "ENSMUSG00000046337.18"
## [4] "ENSMUSG00000109051.2" "ENSMUSG00000038583.14" "ENSMUSG00000100927.2"
## [7] "ENSMUSG00000082194.4" "ENSMUSG00000104843.2" "ENSMUSG00000108688.2"
## [10] "ENSMUSG00000109302.2" "ENSMUSG00000084989.4"
The results of HISAT and STAR alignments look much more similar after the batch correction, but there are still significant differences in what each aligner found to be a DEG.
For the sake of simplicity, the following boxplots are created for the genes which both HISAT2 and STAR agree are differentially expressed, after batch correction.
hisat_all_matrix <- subset(hisat_resdata_corrected, threshold)
star_all_matrix <- subset(star_resdata_corrected, threshold)
hisat_symbol_deg <- hisat_all_matrix %>% select(Gene_id, BMLow1:LNHigh2)
star_symbol_deg <- star_all_matrix %>% select(Gene_id, BMLow1:LNHigh2)
hisat_expression_plot <- pivot_longer(hisat_symbol_deg,
cols = 2:9,
values_to = "normalized_counts")
star_expression_plot <- pivot_longer(star_symbol_deg,
cols = 2:9,
values_to = "normalized_counts")
head(hisat_expression_plot)
## # A tibble: 6 × 3
## Gene_id name normalized_counts
## <I<chr>> <chr> <dbl>
## 1 ENSMUSG00000031144.16 BMLow1 18.4
## 2 ENSMUSG00000031144.16 BMLow2 1.91
## 3 ENSMUSG00000031144.16 BMHigh1 0
## 4 ENSMUSG00000031144.16 BMHigh2 0.299
## 5 ENSMUSG00000031144.16 LNLow1 3.86
## 6 ENSMUSG00000031144.16 LNLow2 4.72
combined_expression_plot <- left_join(x = hisat_expression_plot,
y = star_expression_plot,
by = "Gene_id")
head(combined_expression_plot)
## # A tibble: 6 × 5
## Gene_id name.x normalized_counts.x name.y normalized_counts.y
## <I<chr>> <chr> <dbl> <chr> <dbl>
## 1 ENSMUSG00000031144.16 BMLow1 18.4 BMLow1 18.2
## 2 ENSMUSG00000031144.16 BMLow1 18.4 BMLow2 1.91
## 3 ENSMUSG00000031144.16 BMLow1 18.4 BMHigh1 0
## 4 ENSMUSG00000031144.16 BMLow1 18.4 BMHigh2 0.585
## 5 ENSMUSG00000031144.16 BMLow1 18.4 LNLow1 3.86
## 6 ENSMUSG00000031144.16 BMLow1 18.4 LNLow2 5.58
I chose to plot each DEG based on the aligner and the condition. Each boxplot has a y-axis scale of (0,20) and each value deals with the normalized counts values
hisat_syp_plot <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000031144.16")
hisat_syp <- ggplot(hisat_syp_plot) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT syp") +
ylim(0, 20)
ggsave(filename = "hisat_syp.png",
plot = hisat_syp,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_syp, file = "hisat_syp.png")
hisat_syp <- readRDS("hisat_syp.png")
# ----------------------------------------------
star_syp_plot <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000031144.16")
star_syp <- ggplot(star_syp_plot) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR syp") +
ylim(0, 20)
ggsave(filename = "star_syp.png",
plot = star_syp,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_syp, file = "star_syp.png")
star_syp <- readRDS("star_syp.png")
syp_plot <- plot_grid(hisat_syp, star_syp,
nrow = 2,
labels = "G")
syp_plot
hisat_Fam178b_plot <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000046337.18")
hisat_Fam178b <- ggplot(hisat_Fam178b_plot) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Fam178b") +
ylim(0, 20)
ggsave(filename = "hisat_Fam178b.png",
plot = hisat_Fam178b,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Fam178b, file = "hisat_Fam178b.png")
hisat_Fam178b <- readRDS("hisat_Fam178b.png")
#-----------------------------------------------
star_Fam178b_plot <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000046337.18")
star_Fam178b <- ggplot(star_Fam178b_plot) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Fam178b") +
ylim(0, 20)
ggsave(filename = "star_Fam178b.png",
plot = star_Fam178b,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Fam178b, file = "star_Fam178b.png")
star_Fam178b <- readRDS("star_Fam178b.png")
Fam178b_plots <- plot_grid(hisat_Fam178b, star_Fam178b,
nrow = 2,
labels = "H")
Fam178b_plots
hisat_Gm44913 <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000109051.2")
hisat_Gm44913 <- ggplot(hisat_Gm44913) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Gm44913") +
ylim(0, 20)
ggsave(filename = "hisat_Gm44913.png",
plot = hisat_Gm44913,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Gm44913, file = "hisat_Gm44913.png")
hisat_Gm44913 <- readRDS("hisat_Gm44913.png")
#-------------------------------------------
star_Gm44913 <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000109051.2")
star_Gm44913 <- ggplot(star_Gm44913) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Gm44913") +
ylim(0, 20)
ggsave(filename = "star_Gm44913.png",
plot = hisat_Gm44913,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Gm44913, file = "star_Gm44913.png")
star_Gm44913 <- readRDS("star_Gm44913.png")
Gm44913_plots <- plot_grid(hisat_Gm44913, star_Gm44913,
nrow = 2,
labels = "I")
Gm44913_plots
hisat_Pln <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000038583.14")
hisat_Pln <- ggplot(hisat_Pln) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Pln") +
ylim(0, 20)
ggsave(filename = "hisat_Pln.png",
plot = hisat_Pln,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Pln, file = "hisat_Pln.png")
hisat_Pln <- readRDS("hisat_Pln.png")
#-------------------------------------
star_Pln <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000038583.14")
star_Pln <- ggplot(star_Pln) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Pln") +
ylim(0, 20)
ggsave(filename = "star_Pln.png",
plot = star_Pln,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Pln, file = "star_Pln.png")
star_Pln <- readRDS("star_Pln.png")
Pln_plots <- plot_grid(hisat_Pln, star_Pln,
nrow = 2,
labels = "J")
Pln_plots
hisat_Gm44985 <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000108688.2")
hisat_Gm44985 <- ggplot(hisat_Gm44985) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Gm44985") +
ylim(0, 20)
ggsave(filename = "hisat_Gm44985.png",
plot = hisat_Gm44985,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Gm44985, file = "hisat_Gm44985.png")
hisat_Gm44985 <- readRDS("hisat_Gm44985.png")
#---------------------------------------
star_Gm44985 <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000108688.2")
star_Gm44985 <- ggplot(star_Gm44985) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Gm44985") +
ylim(0, 20)
ggsave(filename = "star_Gm44985.png",
plot = star_Gm44985,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Gm44985, file = "star_Gm44985.png")
star_Gm44985 <- readRDS("star_Gm44985.png")
Gm44985_plot <- plot_grid(hisat_Gm44985, star_Gm44985,
nrow = 2,
labels = "K")
Gm44985_plot
hisat_Gm44967 <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000109302.2")
hisat_Gm44967 <- ggplot(hisat_Gm44967) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Gm44967") +
ylim(0, 20)
ggsave(filename = "hisat_Gm44967.png",
plot = hisat_Gm44967,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Gm44967, file = "hisat_Gm44967.png")
hisat_Gm44967 <- readRDS("hisat_Gm44967.png")
#---------------------------------------------
star_Gm44967 <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000109302.2")
star_Gm44967 <- ggplot(star_Gm44967) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Gm44967") +
ylim(0, 20)
ggsave(filename = "star_Gm44967.png",
plot = star_Gm44967,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Gm44967, file = "star_Gm44967.png")
star_Gm44967 <- readRDS("star_Gm44967.png")
Gm44967_plots <- plot_grid(hisat_Gm44967, star_Gm44967,
nrow = 2,
labels = "L")
Gm44967_plots
All boxplots together
boxplots <- plot_grid(syp_plot, Fam178b_plots, Gm44913_plots, Pln_plots, Gm44985_plot, Gm44967_plots,
ncol = 6)
boxplots
This figure may be too small to use.
For the most part, the boxplots indicate that the normalized counts the DEGs found for each sample across both aligners are similar. This shows that when the aligners are in concensus, they are both aligning the dataset similarly. Further research is necessary to understand why the aligners results are so different, and what the implications of this could mean for transcriptomics and bioinformatic pipelines moving forward.
There are several DEGs identified by either HISAT2 or STAR, but not both. I have chosen 2 to examine further:
Identified by HISAT2 only: * ENSMUSG00000044788.11 - Fads6
Identified by STAR only: * ENSMUSG00000084989.4 - Crocc2
I must first add them to the hisat and star expression plot data
# Filter out the genes not meeting the threshold
hisat_additional_genes <- hisat_resdata_corrected %>%
filter(Gene_id %in% c("ENSMUSG00000084989.4", "ENSMUSG00000044788.11") & !threshold)
star_additional_genes <- star_resdata_corrected %>%
filter(Gene_id %in% c("ENSMUSG00000084989.4", "ENSMUSG00000044788.11") & !threshold)
# Combine additional genes with hisat_expression_plot and star_expression_plot
hisat_expression_plot <- bind_rows(hisat_expression_plot,
pivot_longer(hisat_additional_genes,
cols = 2:9,
values_to = "normalized_counts"))
star_expression_plot <- bind_rows(star_expression_plot,
pivot_longer(star_additional_genes,
cols = 2:9,
values_to = "normalized_counts"))
hisat_Fads6 <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000044788.11")
hisat_Fads6 <- ggplot(hisat_Fads6) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Fads6") +
ylim(0, 20)
ggsave(filename = "hisat_Fads6.png",
plot = hisat_Fads6,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Fads6, file = "hisat_Fads6.png")
hisat_Fads6 <- readRDS("hisat_Fads6.png")
#---------------------------------------------
star_Fads6 <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000044788.11")
star_Fads6 <- ggplot(star_Fads6) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Fads6") +
ylim(0, 20)
ggsave(filename = "star_Fads6.png",
plot = star_Fads6,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Fads6, file = "star_Fads6.png")
star_Fads6 <- readRDS("star_Fads6.png")
Fads6_plots <- plot_grid(hisat_Fads6, star_Fads6,
nrow = 2,
labels = "M")
Fads6_plots
hisat_Crocc2 <- hisat_expression_plot %>%
filter(Gene_id == "ENSMUSG00000084989.4")
hisat_Crocc2 <- ggplot(hisat_Crocc2) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("HISAT Crocc2") +
ylim(0, 20)
ggsave(filename = "hisat_Crocc2.png",
plot = hisat_Crocc2,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(hisat_Crocc2, file = "hisat_Crocc2.png")
hisat_Crocc2 <- readRDS("hisat_Crocc2.png")
#---------------------------------------------
star_Crocc2 <- star_expression_plot %>%
filter(Gene_id == "ENSMUSG00000084989.4")
star_Crocc2 <- ggplot(star_Crocc2) +
geom_boxplot(aes(x=name,
y=normalized_counts,
fill=name)) +
ggtitle("STAR Crocc2") +
ylim(0, 20)
ggsave(filename = "star_Crocc2.png",
plot = star_Crocc2,
units = "in",
width = 5,
height = 6,
dpi = 300)
saveRDS(star_Crocc2, file = "star_Crocc2.png")
star_Crocc2 <- readRDS("star_Crocc2.png")
Crocc2_plots <- plot_grid(hisat_Crocc2, star_Crocc2,
nrow = 2,
labels = "N")
Crocc2_plots
It appears that when the aligners do not agree on whether a gene is a DEG, they have different results for the normalized counts as well. This means that they aligned the original data differently, and one may have spliced when another didnt.
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] sva_3.46.0 BiocParallel_1.32.6
## [3] genefilter_1.80.3 mgcv_1.8-41
## [5] nlme_3.1-160 limma_3.54.2
## [7] ggvenn_0.1.10 ggplotify_0.1.2
## [9] patchwork_1.2.0 cowplot_1.1.3
## [11] lubridate_1.9.3 forcats_1.0.0
## [13] stringr_1.5.1 dplyr_1.1.4
## [15] purrr_1.0.2 readr_2.1.5
## [17] tidyr_1.3.1 tibble_3.2.1
## [19] tidyverse_2.0.0 ggplot2_3.5.1
## [21] pheatmap_1.0.12 RColorBrewer_1.1-3
## [23] DESeq2_1.38.3 SummarizedExperiment_1.28.0
## [25] Biobase_2.58.0 MatrixGenerics_1.10.0
## [27] matrixStats_1.3.0 GenomicRanges_1.50.2
## [29] GenomeInfoDb_1.34.9 IRanges_2.32.0
## [31] S4Vectors_0.36.2 BiocGenerics_0.44.0
## [33] knitr_1.46
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 fs_1.6.4 bit64_4.0.5
## [4] httr_1.4.7 tools_4.2.2 bslib_0.7.0
## [7] utf8_1.2.4 R6_2.5.1 DBI_1.2.2
## [10] colorspace_2.1-0 withr_3.0.0 tidyselect_1.2.1
## [13] bit_4.0.5 compiler_4.2.2 textshaping_0.3.7
## [16] cli_3.6.2 DelayedArray_0.24.0 labeling_0.4.3
## [19] sass_0.4.9 scales_1.3.0 systemfonts_1.0.6
## [22] digest_0.6.35 yulab.utils_0.1.4 rmarkdown_2.26
## [25] XVector_0.38.0 pkgconfig_2.0.3 htmltools_0.5.8.1
## [28] highr_0.10 fastmap_1.1.1 rlang_1.1.3
## [31] rstudioapi_0.16.0 RSQLite_2.3.6 farver_2.1.1
## [34] gridGraphics_0.5-1 jquerylib_0.1.4 generics_0.1.3
## [37] jsonlite_1.8.8 RCurl_1.98-1.14 magrittr_2.0.3
## [40] GenomeInfoDbData_1.2.9 Matrix_1.5-1 Rcpp_1.0.12
## [43] munsell_0.5.1 fansi_1.0.6 lifecycle_1.0.4
## [46] edgeR_3.40.2 stringi_1.8.3 yaml_2.3.8
## [49] zlibbioc_1.44.0 blob_1.2.4 parallel_4.2.2
## [52] crayon_1.5.2 lattice_0.20-45 splines_4.2.2
## [55] Biostrings_2.66.0 annotate_1.76.0 hms_1.1.3
## [58] KEGGREST_1.38.0 locfit_1.5-9.9 pillar_1.9.0
## [61] geneplotter_1.76.0 codetools_0.2-18 XML_3.99-0.16.1
## [64] glue_1.7.0 evaluate_0.23 png_0.1-8
## [67] vctrs_0.6.5 tzdb_0.4.0 gtable_0.3.5
## [70] cachem_1.0.8 xfun_0.43 xtable_1.8-4
## [73] survival_3.4-0 ragg_1.3.0 AnnotationDbi_1.60.2
## [76] memoise_2.0.1 timechange_0.3.0