Introduction and Background

As transcriptomics and next-gen sequencing expands in affordability and technology, the amount of gene expression data available explodes. Bioinformatic tools like HISAT2 and STAR were created with the intention of streamlining the bioinformatic pipeline and to more easily quantify gene expression. HISAT2 and STAR are two of the most prominent read alignment tools. It is important to understand the key differences in performance when comparing HISAT to STAR, as their alignment is a quintessential aspect of the process, and much of the results of these bioinformatic analyses hinge on the accuracy and precision of these aligners.

Information on the data

In order to assess the performance of these two aligners, I have chosen a dataset which came from a study on the heterogeneity in a mouse model of metastatic breast cancer (accession number GSE165393). GSE165393 is a dataset of organ-derived metastatic cell lines that were harvested from 10 female MMTV-PyMT mice. MMTV-PyMT female mice develop palpable mammary tumors which metastasize to the lung. Tumor-bearing organs from these mice were harvested for sample collection. These cell lines were isolated and sorted based on the expression of surface cancer markers, specifically CD44low/EpCAMhigh or CD44high/EpCAMhigh with previously sorted cell lines used as positive controls for comparison. To acquire the raw data for bulk RNA sequencing, total RNA was extracted using the QIAGEN RNeasy kit, with 2 distinct clonal isolate replicates per sample. A modified SMART-seq2 protocol was used to generate cDNA, and the Nextera XT DNA Sample Prep Kit was used to build Illumina libraries. Samples were then sequenced on a NextSeq500 with a minimum depth of 10 M reads.

This dataset is a compilation of counts files from 8 samples of mice tissue, which all have different treatments:

SRR13515350 - Bone Marrow, CD44 low (rep 1)
SRR13515351 - Bone Marrow, CD44 low (rep 2)
SRR13515352 - Bone Marrow, CD44 high (rep 1)
SRR13515353 - Bone Marrow, CD44 high (rep 2)
SRR13515354 - Lymph Node, CD44 low (rep 1)
SRR13515355 - Lymph Node, CD44 low (rep 2)
SRR13515356 - Lymph Node, CD44 High (rep 1)
SRR13515357 - Lymph Node, CD44 High (rep 2)

Each sample is either from Bone Marrow (BM), or Lymph Node (LN) tissue. The mouse either had high levels of CD44, a cell surface marker, or low levels of CD44.

The goal of this analysis is to compare and contrast HISAT2 aligner with STAR aligner. To do this, I will conduct differential expression (DE) analysis on counts files created by HISAT2 and STAR on the same dataset, simultaneously.

Analyzing RNA-Seq with DESeq2

Load required libraries

#BiocManager::install("limma")
#BiocManager::install("sva")

library(knitr)
library(DESeq2) 
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
library(pheatmap)
library(cowplot)
library(tidyr)
library(dplyr)
library(patchwork)
library(ggplotify)
library(DESeq2)
library(ggvenn)
library(limma)

htseq-count input

Hdirectory <- "~/final_proj/Rstudio/counts3/hisat2CountsFiles"  ##path to where HTseq counts are 

Sdirectory <- "~/final_proj/Rstudio/counts3/star" 

sampleTableHisat <- read.csv("~/final_proj/Anna_Rees_Final_Project/counts3/sampleTableHisat.csv", header = T)
sampleTableStar <- read.csv("~/final_proj/Anna_Rees_Final_Project/counts3/sampleTableStar.csv", header = T)

head(sampleTableStar)

##   SampleName
## 1     BMLow1
## 2     BMLow2
## 3    BMHigh1
## 4    BMHigh2
## 5     LNLow1
## 6     LNLow2
##                                                                                                fileName
## 1  counts3/star/SRR13515350_GSM5032720_BM1low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 2  counts3/star/SRR13515351_GSM5032721_BM2low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 3 counts3/star/SRR13515352_GSM5032722_BM1High_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 4 counts3/star/SRR13515353_GSM5032723_BM2High_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 5  counts3/star/SRR13515354_GSM5032724_LN1low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
## 6  counts3/star/SRR13515355_GSM5032725_LN2low_Mus_musculus_RNA-Seq_Aligned.sortedByCoord.out.ENSG.count
##        Tissue CD44Level Condition Rep  Group
## 1 Bone Marrow       Low        BM   1  BMLow
## 2 Bone Marrow       Low        BM   2  BMLow
## 3 Bone Marrow      High        BM   1 BMHigh
## 4 Bone Marrow      High        BM   2 BMHigh
## 5  Lymph Node       Low        LN   1  LNLow
## 6  Lymph Node       Low        LN   2  LNLow

Next, I must specify the columns of interest as factors. For my purposes, this will be the “condition”, “Tissue”, and “CD44Level” columns

hisatCD44Level <- factor(sampleTableHisat$CD44Level, levels = c("Low", "High"))
starCD44Level <- factor(sampleTableStar$CD44Level, levels = c("Low", "High"))
# 
hisatTissue <- factor(sampleTableHisat$Tissue, levels = c("Bone Marrow", "Lymph Node"))
starTissue <- factor(sampleTableStar$Tissue, levels = c("Bone Marrow", "Lymph Node"))
# 
hisatcondition <- factor(sampleTableHisat$condition, levels = c("BM:Low", "BM:High", "LN:Low", "LN:High"))
starcondition <- factor(sampleTableStar$condition, levels = c("BM:Low", "BM:High", "LN:Low", "LN:High"))

hisatGroup <- factor(sampleTableHisat$Group)
starGroup <- factor(sampleTableStar$Group)

DESeq2 Dataset (DDS)

Firstly, I will create DDSs for HISAT2 and STAR with the CD44 level as the condition.

hisat_dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTableHisat,
                                  directory = ".",
                                  design = ~Group)


star_dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTableStar,
                                       directory = ".",
                                       design = ~Group)

head(hisat_dds)

## class: DESeqDataSet 
## dim: 6 8 
## metadata(1): version
## assays(1): counts
## rownames(6): ENSMUSG00000000001.5 ENSMUSG00000000003.16 ...
##   ENSMUSG00000000037.18 ENSMUSG00000000049.12
## rowData names(0):
## colnames(8): BMLow1 BMLow2 ... LNHigh1 LNHigh2
## colData names(5): Tissue CD44Level Condition Rep Group

Pre-filtering

Pre-filtering low count genes before running DESeq2 functions in order to remove rows with very few reads and to reduce the memory size of the dds data objects

keepH <- rowSums(counts(hisat_dds)) >= 10
hisat_dds <- hisat_dds[keepH,]

keepS <- rowSums(counts(star_dds)) >= 10
star_dds <- star_dds[keepS,]

#Deciding the factor level - "Low" represents the control
hisat_dds$Group <- relevel(hisat_dds$Group, ref = "BMLow")

Running the Differential Expression Analysis

hisat_dds <- DESeq(hisat_dds)
hisat_res <- results(hisat_dds)

star_dds <- DESeq(star_dds)
star_res <- results(star_dds)

summary(hisat_res)

## 
## out of 20591 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 26, 0.13%
## LFC < 0 (down)     : 110, 0.53%
## outliers [1]       : 0, 0%
## low counts [2]     : 5589, 27%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

summary(star_res)

## 
## out of 19908 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 50, 0.25%
## LFC < 0 (down)     : 145, 0.73%
## outliers [1]       : 0, 0%
## low counts [2]     : 5790, 29%
## (mean count < 7)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Hierarchical Clustering Heatmap

hisat_rld1 <- rlog(hisat_dds, blind=TRUE)
hisat_rld_mat1 <- assay(hisat_rld1)  
hisat_rld_cor1 <- cor(hisat_rld_mat1) 
heat.colors <- brewer.pal(6, "Blues")
hisat_heatmap1 <- pheatmap(hisat_rld_cor1, color = heat.colors, main = "HISAT2 Heatmap")

hisat_heatmap1 <- as.ggplot(hisat_heatmap1)

star_rld1 <- rlog(star_dds, blind=TRUE)
star_rld_mat1 <- assay(star_rld1)  
star_rld_cor1 <- cor(star_rld_mat1) 
heat.colors <- brewer.pal(6, "Blues")
star_heatmap1 <- pheatmap(star_rld_cor1, color = heat.colors, main = "STAR Heatmap")

star_heatmap1 <- as.ggplot(star_heatmap1)

Side by Side for the Heatmaps

ggsave(filename = "hisat_heatmap1.png",
       plot = hisat_heatmap1,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

ggsave(filename = "star_heatmap1.png",
       plot = star_heatmap1,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_heatmap1, file = "hisat_heatmap1.png")
saveRDS(star_heatmap1, file = "star_heatmap1.png")

hisat_heatmap_plot <- readRDS("hisat_heatmap1.png")
star_heatmap_plot <- readRDS("star_heatmap1.png")

figHeatmap <- plot_grid(hisat_heatmap_plot, star_heatmap_plot,
                        ncol = 2,
                        labels = "A")

figHeatmap

The correlation heatmaps above show the hierarchical clustering of each condition type in the dataset. It appears there is higher correlation between the LNHigh1 and LNlow1 samples compared to other samples. The HISAT2 heatmap and the STAR heatmap appear to draw similar, although still different, correlations. The high correlation between different samples may be due to noise in the dataset, or unnormalized counts data.

Principal components analysis (PCA) from the DESeq Results

hisat_vsd <- vst(hisat_dds, blind=FALSE)

plotPCA(hisat_vsd, intgroup=c("Group")) + theme_classic()

### A better PCA I have recreated the PCA above but with better aesthetics

print(colData(hisat_dds))

## DataFrame with 8 rows and 6 columns
##              Tissue   CD44Level   Condition       Rep    Group sizeFactor
##         <character> <character> <character> <integer> <factor>  <numeric>
## BMLow1  Bone Marrow         Low          BM         1   BMLow    0.816003
## BMLow2  Bone Marrow         Low          BM         2   BMLow    1.570110
## BMHigh1 Bone Marrow        High          BM         1   BMHigh   0.623038
## BMHigh2 Bone Marrow        High          BM         2   BMHigh   3.340993
## LNLow1   Lymph Node         Low          LN         1   LNLow    0.518614
## LNLow2   Lymph Node         Low          LN         2   LNLow    1.269942
## LNHigh1  Lymph Node        High          LN         1   LNHigh   0.680636
## LNHigh2  Lymph Node        High          LN         2   LNHigh   0.898237

hisat_PCA <- plotPCA(hisat_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)

percentVar <- round(100 * attr(hisat_PCA, "percentVar"))

hisat_pca_plot <- ggplot(hisat_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed() +
  ggtitle("PCA Plot of HISAT2 Counts")+
  theme_classic()

hisat_pca_plot <- as.ggplot(hisat_pca_plot)

star_vsd <- vst(star_dds, blind=FALSE)
star_PCA <- plotPCA(star_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)

percentVar <- round(100 * attr(star_PCA, "percentVar"))

star_pca_plot <- ggplot(star_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed() +
  ggtitle("PCA Plot of STAR Counts")+
  theme_classic()

star_pca_plot <- as.ggplot(star_pca_plot)

Side by Side of the PCA Plots

ggsave(filename = "hisat_pca.png",
       plot = hisat_pca_plot,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

ggsave(filename = "star_pca.png",
       plot = star_pca_plot,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)  

saveRDS(hisat_pca_plot, file = "hisat_pca.png")
saveRDS(star_pca_plot, file = "star_pca.png")

hisat_pca_plot <- readRDS("hisat_pca.png")
star_pca_plot <- readRDS("star_pca.png")

figPCA1 <- plot_grid(hisat_pca_plot, star_pca_plot,
                    ncol = 2,
                    labels = "B")

figPCA1

PCA plots provide and understanding of how the features in the data contribute to the principle components, and the distance between points indicates how similar they are in terms of PCA values. This data appears to be highly variable, as the plot points are widely spread across the plot. Interestingly, the samples seem to cluster somewhat by tissue type rather than by CD44 type. This indicates there may be a stronger relationship between gene expression and the source tissue than between the gene expression and CD44 level. Again, the HISAT2 data and the STAR data seem to cluster in similar ways, although the variance for each PCA variable is slightly different.

Combining heatmap and PCA visualizations

figPCA_Heatmap <- plot_grid(figHeatmap, figPCA1,
                            nrow = 2)

figPCA_Heatmap

hisat_res <- results(hisat_dds, contrast=c('Group', 'BMHigh', 'BMLow')) 
head(hisat_res)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                         baseMean log2FoldChange     lfcSE      stat    pvalue
##                        <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5  4908.86896      -0.451071  0.468883 -0.962013  0.336043
## ENSMUSG00000000028.16 1193.87193      -0.081853  0.549126 -0.149061  0.881506
## ENSMUSG00000000037.18   46.38722       0.894413  0.690580  1.295162  0.195264
## ENSMUSG00000000049.12    2.17764       0.647286  2.252424  0.287373  0.773827
## ENSMUSG00000000056.8   958.65044      -0.506372  0.530025 -0.955374  0.339389
## ENSMUSG00000000058.7  2421.74679      -0.321896  0.485281 -0.663317  0.507127
##                            padj
##                       <numeric>
## ENSMUSG00000000001.5   0.983242
## ENSMUSG00000000028.16  0.993699
## ENSMUSG00000000037.18  0.983242
## ENSMUSG00000000049.12        NA
## ENSMUSG00000000056.8   0.983242
## ENSMUSG00000000058.7   0.983242

dim(hisat_res)

## [1] 20591     6

hisat_result <- hisat_res[order(hisat_res$padj), ]
head(hisat_result)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat      pvalue
##                       <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSMUSG00000019831.15    65.359        3.87116  0.827763   4.67665 2.91592e-06
## ENSMUSG00000003135.16   132.135        3.80900  0.857747   4.44070 8.96665e-06
## ENSMUSG00000027637.4    228.286        3.74790  0.855406   4.38142 1.17906e-05
## ENSMUSG00000047843.17   231.727        3.90781  0.879141   4.44503 8.78790e-06
## ENSMUSG00000029229.9    204.232        3.58455  0.857485   4.18030 2.91122e-05
## ENSMUSG00000040331.17   325.703        3.69899  0.889321   4.15934 3.19162e-05
##                            padj
##                       <numeric>
## ENSMUSG00000019831.15 0.0367611
## ENSMUSG00000003135.16 0.0371610
## ENSMUSG00000027637.4  0.0371610
## ENSMUSG00000047843.17 0.0371610
## ENSMUSG00000029229.9  0.0640113
## ENSMUSG00000040331.17 0.0640113

table(hisat_result$padj<0.05)

## 
## FALSE  TRUE 
## 12603     4

Doing the same for STAR

star_res <- results(star_dds, contrast=c('Group', 'BMHigh', 'BMLow')) 
head(star_res)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                         baseMean log2FoldChange     lfcSE      stat    pvalue
##                        <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001.5  5010.54538     -0.4520733  0.473439 -0.954871  0.339643
## ENSMUSG00000000028.16 1238.02136     -0.0932176  0.544633 -0.171157  0.864101
## ENSMUSG00000000037.18   47.48724      0.9486842  0.709786  1.336578  0.181360
## ENSMUSG00000000049.12    2.49053      0.8030867  2.007971  0.399949  0.689194
## ENSMUSG00000000056.8   998.47371     -0.4899313  0.528291 -0.927389  0.353725
## ENSMUSG00000000058.7  2536.63457     -0.3106055  0.482653 -0.643538  0.519875
##                            padj
##                       <numeric>
## ENSMUSG00000000001.5   0.993172
## ENSMUSG00000000028.16  0.996036
## ENSMUSG00000000037.18  0.993172
## ENSMUSG00000000049.12        NA
## ENSMUSG00000000056.8   0.993172
## ENSMUSG00000000058.7   0.993172

dim(star_res)

## [1] 19908     6

star_result <- star_res[order(star_res$padj), ]
head(star_result)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat      pvalue
##                       <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSMUSG00000019831.15   66.5985        3.68168  0.795608   4.62751 3.70091e-06
## ENSMUSG00000003135.16  135.2666        3.70524  0.850746   4.35528 1.32894e-05
## ENSMUSG00000027637.4   239.9591        3.55814  0.833949   4.26662 1.98461e-05
## ENSMUSG00000047843.17  238.2003        3.69590  0.866000   4.26778 1.97427e-05
## ENSMUSG00000029229.9   208.1444        3.55220  0.856138   4.14910 3.33791e-05
## ENSMUSG00000040331.17  331.1430        3.58783  0.876219   4.09467 4.22779e-05
##                            padj
##                       <numeric>
## ENSMUSG00000019831.15 0.0479638
## ENSMUSG00000003135.16 0.0643013
## ENSMUSG00000027637.4  0.0643013
## ENSMUSG00000047843.17 0.0643013
## ENSMUSG00000029229.9  0.0782745
## ENSMUSG00000040331.17 0.0782745

table(star_result$padj<0.05)

## 
## FALSE  TRUE 
## 12959     1

hisat_resdata <- merge(as.data.frame(hisat_result), as.data.frame(counts(hisat_dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(hisat_resdata)[1] <- "Gene_id"
head(hisat_resdata)

##                 Gene_id  baseMean log2FoldChange     lfcSE     stat
## 1 ENSMUSG00000019831.15  65.35899       3.871164 0.8277634 4.676655
## 2 ENSMUSG00000003135.16 132.13549       3.808998 0.8577472 4.440700
## 3  ENSMUSG00000027637.4 228.28555       3.747895 0.8554056 4.381425
## 4 ENSMUSG00000047843.17 231.72657       3.907809 0.8791410 4.445031
## 5  ENSMUSG00000029229.9 204.23224       3.584547 0.8574853 4.180302
## 6 ENSMUSG00000040331.17 325.70293       3.698992 0.8893208 4.159345
##         pvalue       padj    BMLow1   BMLow2  BMHigh1   BMHigh2    LNLow1
## 1 2.915924e-06 0.03676106  7.352912 18.47004 54.57133  330.1414  44.34895
## 2 8.966653e-06 0.03716098 26.960678 33.75559 49.75621  800.3610  44.34895
## 3 1.179059e-05 0.03716098 64.950723 37.57698 70.62172 1297.5186 109.90827
## 4 8.787903e-06 0.03716098 56.372326 44.58285 73.83180 1435.5015 107.98006
## 5 2.911223e-05 0.06401130 72.303635 35.02939 70.62172 1206.8270  69.41575
## 6 3.191616e-05 0.06401130 82.107518 70.05877 81.85699 1889.5582 219.81655
##     LNLow2  LNHigh1  LNHigh2
## 1 25.19800 24.97664 17.81267
## 2 53.54575 24.97664 23.37913
## 3 99.21713 80.80676 65.68422
## 4 47.24625 38.19956 50.09814
## 5 70.08194 61.70698 47.87155
## 6 99.21713 98.43733 64.57093

#write results
write.csv(hisat_resdata, file="HISAT_BMHighvsBMLow _unfiltered_matrix.csv")

star_resdata <- merge(as.data.frame(star_result), as.data.frame(counts(star_dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(star_resdata)[1] <- "Gene_id"
head(star_resdata)

##                 Gene_id  baseMean log2FoldChange     lfcSE     stat
## 1 ENSMUSG00000019831.15  66.59854       3.681685 0.7956085 4.627508
## 2 ENSMUSG00000003135.16 135.26662       3.705243 0.8507465 4.355284
## 3  ENSMUSG00000027637.4 239.95911       3.558138 0.8339486 4.266615
## 4 ENSMUSG00000047843.17 238.20027       3.695900 0.8660002 4.267781
## 5  ENSMUSG00000029229.9 208.14436       3.552198 0.8561379 4.149096
## 6 ENSMUSG00000040331.17 331.14303       3.587826 0.8762195 4.094665
##         pvalue       padj   BMLow1   BMLow2  BMHigh1   BMHigh2    LNLow1
## 1 3.700912e-06 0.04796382 10.94213 19.09247 56.06919  331.9771  40.57350
## 2 1.328944e-05 0.06430131 30.39481 35.63929 51.26326  808.2792  42.50557
## 3 1.984608e-05 0.06430131 72.94755 47.09477 78.49686 1327.6155 117.85635
## 4 1.974266e-05 0.06430131 66.86859 51.54968 76.89489 1451.7409 104.33185
## 5 3.337908e-05 0.07827447 72.94755 38.18495 68.88500 1225.4461  71.48664
## 6 4.227788e-05 0.07827447 89.96865 73.18782 86.50675 1870.0788 231.84855
##      LNLow2   LNHigh1  LNHigh2
## 1  27.88879  28.27529 17.96992
## 2  56.57440  28.27529 29.20112
## 3 104.38375  89.29038 81.98777
## 4  53.38711  40.18067 60.64849
## 5  73.30767  65.47961 49.41729
## 6 109.96150 114.58932 73.00281

#write results
write.csv(star_resdata, file="STAR_BMHighvsBMLow _unfiltered_matrix.csv")

Now, getting a list of DEGs and filtering for rows which meet the criteria of having an absolute log2FC of 1 and padj of less than 0.05.

#Get DEG up/down lists
hisat_resdata_DEGs <- as.data.frame(hisat_result) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) %>% tibble::rownames_to_column("Gene_id")

star_resdata_DEGs <- as.data.frame(star_result) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05) %>% tibble::rownames_to_column("Gene_id")

#write results
write.csv(star_resdata_DEGs, "STAR_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)
write.csv(hisat_resdata_DEGs, "HISAT_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)

head(hisat_resdata_DEGs)

##                 Gene_id  baseMean log2FoldChange     lfcSE     stat
## 1 ENSMUSG00000019831.15  65.35899       3.871164 0.8277634 4.676655
## 2 ENSMUSG00000003135.16 132.13549       3.808998 0.8577472 4.440700
## 3  ENSMUSG00000027637.4 228.28555       3.747895 0.8554056 4.381425
## 4 ENSMUSG00000047843.17 231.72657       3.907809 0.8791410 4.445031
##         pvalue       padj
## 1 2.915924e-06 0.03676106
## 2 8.966653e-06 0.03716098
## 3 1.179059e-05 0.03716098
## 4 8.787903e-06 0.03716098

Volcano plots

The following plots are taken inspiration from Fig. 6 in Raplee et. al.

The goal of the volcano plots is to identify genes of significance by the p-adjusted values.

# Add Threshold column based on criteria
hisat_resdata$threshold <- abs(hisat_resdata$log2FoldChange) >= 1 & hisat_resdata$padj < 0.05


hisat_threshold_padj <- hisat_resdata$padj < 0.05 
## Which values in an object are considered true and what is the length?? 

length(which(hisat_threshold_padj))

## [1] 4

# Add Threshold column based on criteria
star_resdata$threshold <- abs(star_resdata$log2FoldChange) >= 1 & star_resdata$padj < 0.05


star_threshold_padj <- star_resdata$padj < 0.05 
## Which values in an object are considered true and what is the length?? 

length(which(star_threshold_padj))

## [1] 1

str(star_resdata)

## 'data.frame':    19908 obs. of  16 variables:
##  $ Gene_id       : 'AsIs' chr  "ENSMUSG00000019831.15" "ENSMUSG00000003135.16" "ENSMUSG00000027637.4" "ENSMUSG00000047843.17" ...
##  $ baseMean      : num  66.6 135.3 240 238.2 208.1 ...
##  $ log2FoldChange: num  3.68 3.71 3.56 3.7 3.55 ...
##  $ lfcSE         : num  0.796 0.851 0.834 0.866 0.856 ...
##  $ stat          : num  4.63 4.36 4.27 4.27 4.15 ...
##  $ pvalue        : num  3.70e-06 1.33e-05 1.98e-05 1.97e-05 3.34e-05 ...
##  $ padj          : num  0.048 0.0643 0.0643 0.0643 0.0783 ...
##  $ BMLow1        : num  10.9 30.4 72.9 66.9 72.9 ...
##  $ BMLow2        : num  19.1 35.6 47.1 51.5 38.2 ...
##  $ BMHigh1       : num  56.1 51.3 78.5 76.9 68.9 ...
##  $ BMHigh2       : num  332 808 1328 1452 1225 ...
##  $ LNLow1        : num  40.6 42.5 117.9 104.3 71.5 ...
##  $ LNLow2        : num  27.9 56.6 104.4 53.4 73.3 ...
##  $ LNHigh1       : num  28.3 28.3 89.3 40.2 65.5 ...
##  $ LNHigh2       : num  18 29.2 82 60.6 49.4 ...
##  $ threshold     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...

The labels

hisatVolcano1 <- ggplot(hisat_resdata) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic() + 
        ggtitle("HISAT2 DEG Expression") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) + 
  scale_color_manual(values = c("darkslategrey", "magenta"),
                     labels = c("Not Significant", "Differentially Expressed"))

hisatVolcano1 <- as.ggplot(hisatVolcano1)

ggsave(filename = "hisatVolcano1.png",
       plot = hisatVolcano1,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisatVolcano1, file = "hisatVolcano1.png")

hisatVolcano1_plot <- readRDS("hisatVolcano1.png")

starVolanco1 <- ggplot(star_resdata) +
  geom_point(aes(x=log2FoldChange, 
                 y=-log10(padj), colour=threshold)) +  # Add labels
  theme_classic() + 
  ggtitle("STAR DEG Expression") +
  xlab("Log2 fold change") + 
  ylab("-Log10(padj)") +
  theme(legend.position = "right",
        plot.title = element_text(size = rel(1.25), hjust = 0.5), 
        axis.title = element_text(size = rel(1.25))) + 
  scale_color_manual(values = c("darkslategrey", "magenta"),
                     labels = c("Not Significant", "Differentially Expressed"))

starVolanco1 <- as.ggplot(starVolanco1)

ggsave(filename = "starVolanco1.png",
       plot = starVolanco1,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(starVolanco1, file = "starVolanco1.png")

starVolanco1_plot <- readRDS("starVolanco1.png")

For both the HISAT and STAR volcano plots, you can see that (before the batch correction) all the DEGs were upregulated.

volcanoes1 <- plot_grid(hisatVolcano1_plot, starVolanco1_plot,
                        ncol = 2,
                        labels = "C")

volcanoes1

The volcano plots above indicates the statistical significance of each of the genes according to the results of the HISAT and STAR analysis. Genes that are either up or downregulated are labeled in magenta, and all other genes are gray. Both plots show that the only differentially expressed genes are upregulated, and that there are very very few genes of significance (further emphasized by the venn diagram below)

Venn Diagrams

generating a venn diagram comparing the DEGs identified by HISAT2 vs STAR

# Extract genes marked as differentially expressed in each dataset
hisat_threshold_genes <- subset(hisat_resdata, threshold)$Gene_id
star_threshold_genes <- subset(star_resdata, threshold)$Gene_id

# Create a list of gene IDs for both datasets
gene_lists <- list(
  HISAT2 = hisat_threshold_genes,
  STAR = star_threshold_genes
)

# Plot the Venn diagram
ggvenn(data = gene_lists)

In the original literature from which this dataset was derived, the authors found 5,509 DEGs from 8 distinct gene clusters using STAR. Since the results from this analysis are so different to the analysis seen above, this data clearly indicates a batch correction is necessary for the most accurate analysis. That being said, for the purposes of assessing the results drawn from HISAT2 vs STAR, it is clear the results are very different depending on the aligner chosen.

Genes of interest as they are indicated above:

STAR’s one DEG is ENSMUSG00000019831.15 This is gene Wasf1, a protein coding gene. Mutations in this gene have been associated with defects in the central nervous system. The protein it encodes is an actin binding protein This gene was identified as a DEG from HISAT2, which is why there is indicated crossover in the Venn diagram above. source: https://useast.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000019831;r=10:40759471-40814566

The 3 remaining DEGs identified only by HISAT2 are: - ENSMUSG00000003135.16: Cnot11 This is another protein coding gene for the CCR4-NOT transcription complex subunit 11. It is predicted to act upstream of or within regulation of translation and regulatory ncRNA-mediated gene silencing. source: https://www.informatics.jax.org/marker/MGI:106580

ENSMUSG00000027637.4: Rab5if This is a protein coding factor for the RAB5 interacting factor, which is a part of the GEL complex subunit OPTI. It is involved in mitochondrial repirasome assembly source: https://www.informatics.jax.org/marker/MGI:1914638
ENSMUSG00000047843: Bri3 This is a protein coding gene for the I3 brain protein, which is predicted to enable identical protein binding activity source: https://www.informatics.jax.org/marker/MGI:1933174

Because the results at this stage are evident to be symptomatic of some batch effect, I will next conduct a batch correction. I will use the sva package, which provides methods for removing batch effects from RNA-seq data

Batch Correction

To aid in this process, I used the following resources: 1. The SVA package for removing batch effects and other … (n.d.). https://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf 2. Ram, & Lj. (n.d.). Removing batch effects using combat and SVA. Bioinformatics Answers. https://www.biostars.org/p/196430/

# Load necessary libraries
library(sva)

# Extract counts matrix from HISAT2 and STAR datasets
hisat_counts <- counts(hisat_dds, normalized = TRUE)
star_counts <- counts(star_dds, normalized = TRUE)

hisat_counts <- as.data.frame(hisat_counts)
star_counts <- as.data.frame(star_counts)

# Perform batch correction for HISAT2 data
hisat_corrected <- ComBat(dat = hisat_counts, batch = hisatGroup, par.prior = TRUE, prior.plots = FALSE)

## Found 1667 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.

hisat_corrected <- as.data.frame(hisat_corrected)


# Perform batch correction for STAR data
star_corrected <- ComBat(dat = star_counts, batch = starGroup, par.prior = TRUE, prior.plots = FALSE)

## Found 1650 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.

star_corrected <- as.data.frame(star_corrected)


head(star_corrected)

##                            BMLow1      BMLow2     BMHigh1     BMHigh2
## ENSMUSG00000000001.5  4673.843966 5350.069764 5525.543325 3735.066003
## ENSMUSG00000000028.16  990.573357 1446.908007 1266.899870 1125.865281
## ENSMUSG00000000037.18   38.042464   52.222787   49.827578   59.658029
## ENSMUSG00000000049.12    2.945197    1.654655    1.621753    3.544434
## ENSMUSG00000000056.8  1169.631389  889.724868 1169.581047  712.059006
## ENSMUSG00000000058.7  2007.512416 3052.019547 2672.301236 2114.336938
##                            LNLow1      LNLow2     LNHigh1     LNHigh2
## ENSMUSG00000000001.5  4957.954944 5563.927395 5598.177380 4738.869383
## ENSMUSG00000000028.16 1400.301940 1237.704986 1619.295940  825.957161
## ENSMUSG00000000037.18   33.404537   64.454252   49.308261   31.622373
## ENSMUSG00000000049.12    1.515242    4.836595    1.467321    2.395742
## ENSMUSG00000000056.8  1049.582138  884.751566  807.036631 1314.573893
## ENSMUSG00000000058.7  2961.578349 2001.516379 2169.801004 3375.685164

After this batch correction, I must redo the visualizations above to see if the results are better.

Recreating DESeq2 datasets (DDS) using the corrected data.

First, I need to manage the transformed data for negative values since these arose during the batch correction. To deal with these, I chose to replace the values with 0.

Additionally, due to floating-point arithmetic in the data processing, the counts values are now floating point integers. In an abundance of caution to not disturb the quality of the data, I will create new dataframes for these more useable batch corrected counts values

hisat_rounded <- hisat_corrected
star_rounded <- star_corrected

# Replace negative values with zero
hisat_rounded[hisat_rounded < 0] <- 0
star_rounded[star_rounded < 0] <- 0

hisat_rounded <- round(hisat_rounded)
star_rounded <- round(star_rounded)

head(star_rounded)

##                       BMLow1 BMLow2 BMHigh1 BMHigh2 LNLow1 LNLow2 LNHigh1
## ENSMUSG00000000001.5    4674   5350    5526    3735   4958   5564    5598
## ENSMUSG00000000028.16    991   1447    1267    1126   1400   1238    1619
## ENSMUSG00000000037.18     38     52      50      60     33     64      49
## ENSMUSG00000000049.12      3      2       2       4      2      5       1
## ENSMUSG00000000056.8    1170    890    1170     712   1050    885     807
## ENSMUSG00000000058.7    2008   3052    2672    2114   2962   2002    2170
##                       LNHigh2
## ENSMUSG00000000001.5     4739
## ENSMUSG00000000028.16     826
## ENSMUSG00000000037.18      32
## ENSMUSG00000000049.12       2
## ENSMUSG00000000056.8     1315
## ENSMUSG00000000058.7     3376

# Recreate DESeq2 datasets (DDS) using the corrected data
hisat_dds_corrected <- DESeqDataSetFromMatrix(countData = hisat_rounded,
                                              colData = colData(hisat_dds),
                                              design = ~ Group)

star_dds_corrected <- DESeqDataSetFromMatrix(countData = star_rounded,
                                             colData = colData(star_dds),
                                             design = ~ Group)

# Run differential expression analysis
hisat_dds_corrected <- DESeq(hisat_dds_corrected)
star_dds_corrected <- DESeq(star_dds_corrected)

Recreating Hierarchical Clustering Heatmap

hisat_rld <- rlog(hisat_dds_corrected, blind=TRUE)
hisat_rld_mat <- assay(hisat_rld)  
hisat_rld_cor <- cor(hisat_rld_mat) 
heat.colors <- brewer.pal(6, "Blues")
hisat_heatmap <- pheatmap(hisat_rld_cor, color = heat.colors, main = "HISAT2 Corrected Heatmap")

hisat_heatmap <- as.ggplot(hisat_heatmap)
ggsave(filename = "hisat_heatmap_corrected.png",
       plot = hisat_heatmap,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_heatmap, file = "hisat_heatmap_corrected.png")

hisat_heatmap_corrected_plot <- readRDS("hisat_heatmap_corrected.png")

star_rld <- rlog(star_dds_corrected, blind=TRUE)
star_rld_mat <- assay(star_rld)  
star_rld_cor <- cor(star_rld_mat) 
heat.colors <- brewer.pal(6, "Blues")
star_heatmap <- pheatmap(star_rld_cor, color = heat.colors, main = "STAR Corrected Heatmap")

star_heatmap <- as.ggplot(star_heatmap)

ggsave(filename = "star_heatmap_corrected.png",
       plot = star_heatmap,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_heatmap, file = "star_heatmap_corrected.png")

star_heatmap_corrected_plot <- readRDS("star_heatmap_corrected.png")

corrected_heatmaps <- plot_grid(hisat_heatmap_corrected_plot, star_heatmap_corrected_plot,
                                ncol = 2)
heatmaps_combined <- plot_grid(figHeatmap, corrected_heatmaps,
                               nrow = 2,
                               labels = "D")
heatmaps_combined

Clearly, the batch effect has had an impact on the results of the data. The heatmap provides a sanity check that our batch correction has worked, and the correlations between the samples are showing what we would expect; very strong correlation between the same samples, and a weaker correlation between more different samples. Correlations which were very strong before the batch correction are no longer appearing as correlated

Recreating PCAs

hisat_vsd <- vst(hisat_dds_corrected, blind=FALSE)

hisat_PCA <- plotPCA(hisat_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)

percentVar <- round(100 * attr(hisat_PCA, "percentVar"))

hisat_pca_plot <- ggplot(hisat_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed() +
  ggtitle("PCA Plot of HISAT2 Counts (corrected)")+
  theme_classic()

hisat_pca_plot <- as.ggplot(hisat_pca_plot)
ggsave(filename = "hisat_pca_plot_corrected.png",
       plot = hisat_pca_plot,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_pca_plot, file = "hisat_pca_plot_corrected.png")
hisat_pca_corrected_plot <- readRDS("hisat_pca_plot_corrected.png")

star_vsd <- vst(star_dds_corrected, blind=FALSE)
star_PCA <- plotPCA(star_vsd, intgroup = c("Tissue", "CD44Level"), returnData = TRUE)

percentVar <- round(100 * attr(star_PCA, "percentVar"))

star_pca_plot <- ggplot(star_PCA, aes(PC1, PC2, color=Tissue, shape=CD44Level)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed() +
  ggtitle("PCA Plot of STAR Counts (corrected)")+
  theme_classic()

star_pca_plot <- as.ggplot(star_pca_plot)

ggsave(filename = "star_pca_plot_corrected.png",
       plot = star_pca_plot,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_pca_plot, file = "star_pca_plot_corrected.png")
star_pca_corrected_plot <- readRDS("star_pca_plot_corrected.png")

corrected_PCAs <- plot_grid(hisat_pca_corrected_plot, star_pca_corrected_plot,
                            ncol = 2,
                            labels = "E")

corrected_PCAs

combined_PCAs <- plot_grid(figPCA1, corrected_PCAs,
                           nrow = 2)
combined_PCAs

With the corrected data, the PCAs indicate clustering between the lymph node samples, and less so between the bone marrow samples. The samples are still very far apart, and there is less variance in the PCA variables in the corrected values compared ot the uncorrected values.

Since I have performed a batch correction, the padj values are all 1. This is, of course, no longer a good metric to assess genes of significance. Instead, I will use the p-values < 0.05 and keep the L2FC values > 1 to identify genes of significance post batch correction.

hisat_res <- results(hisat_dds_corrected, contrast=c('Group', 'BMHigh', 'BMLow')) 
head(hisat_res)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                         baseMean log2FoldChange     lfcSE       stat    pvalue
##                        <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSMUSG00000000001.5  5697.31140      0.1226365  0.970918  0.1263098  0.899487
## ENSMUSG00000000028.16 1376.81073      0.1393018  0.970925  0.1434732  0.885916
## ENSMUSG00000000037.18   49.03352      0.2967228  0.953448  0.3112102  0.755641
## ENSMUSG00000000049.12    1.80616     -0.8096610  2.131410 -0.3798712  0.704041
## ENSMUSG00000000056.8  1134.49771      0.0521095  1.019564  0.0511096  0.959238
## ENSMUSG00000000058.7  2804.26885      0.1505281  0.974444  0.1544760  0.877234
##                            padj
##                       <numeric>
## ENSMUSG00000000001.5          1
## ENSMUSG00000000028.16         1
## ENSMUSG00000000037.18         1
## ENSMUSG00000000049.12         1
## ENSMUSG00000000056.8          1
## ENSMUSG00000000058.7          1

dim(hisat_res)

## [1] 20591     6

hisat_result_corrected <- hisat_res[order(hisat_res$pvalue), ]
head(hisat_result_corrected)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat    pvalue
##                       <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000031144.16   3.64667       -5.06534   2.04308  -2.47926 0.0131655
## ENSMUSG00000046337.18   7.97645       -3.10637   1.40110  -2.21709 0.0266168
## ENSMUSG00000109051.2    4.19634        4.64404   2.19334   2.11734 0.0342313
## ENSMUSG00000038583.14   2.03877       -5.48740   2.61062  -2.10196 0.0355571
## ENSMUSG00000044788.11   2.06370        5.27089   2.55278   2.06476 0.0389454
## ENSMUSG00000083906.2    2.47343        4.88317   2.42498   2.01370 0.0440414
##                            padj
##                       <numeric>
## ENSMUSG00000031144.16         1
## ENSMUSG00000046337.18         1
## ENSMUSG00000109051.2          1
## ENSMUSG00000038583.14         1
## ENSMUSG00000044788.11         1
## ENSMUSG00000083906.2          1

table(hisat_result_corrected$pvalue<0.05)

## 
## FALSE  TRUE 
## 20581    10

Doing the same for STAR

star_res <- results(star_dds_corrected, contrast=c('Group', 'BMHigh', 'BMLow')) 
head(star_res)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                         baseMean log2FoldChange     lfcSE       stat    pvalue
##                        <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSMUSG00000000001.5  5837.42100      0.1299552  0.976352  0.1331028  0.894112
## ENSMUSG00000000028.16 1439.18739      0.1492737  0.980475  0.1522464  0.878993
## ENSMUSG00000000037.18   50.07160      0.2752073  0.952356  0.2889754  0.772600
## ENSMUSG00000000049.12    2.60973     -0.2616289  1.742020 -0.1501871  0.880617
## ENSMUSG00000000056.8  1185.41984      0.0654384  1.023048  0.0639642  0.948999
## ENSMUSG00000000058.7  2952.75353      0.1598753  0.980316  0.1630855  0.870451
##                            padj
##                       <numeric>
## ENSMUSG00000000001.5          1
## ENSMUSG00000000028.16         1
## ENSMUSG00000000037.18         1
## ENSMUSG00000000049.12         1
## ENSMUSG00000000056.8          1
## ENSMUSG00000000058.7          1

dim(star_res)

## [1] 19908     6

star_result_corrected <- star_res[order(star_res$pvalue), ]
head(star_result_corrected)

## log2 fold change (MLE): Group BMHigh vs BMLow 
## Wald test p-value: Group BMHigh vs BMLow 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat     pvalue
##                       <numeric>      <numeric> <numeric> <numeric>  <numeric>
## ENSMUSG00000060989.9    7.72410        6.78053   2.61845   2.58952 0.00961095
## ENSMUSG00000031144.16   3.77169       -4.41325   1.95291  -2.25983 0.02383201
## ENSMUSG00000046337.18   8.44016       -3.08623   1.38364  -2.23052 0.02571298
## ENSMUSG00000109051.2    4.40102        4.64114   2.18558   2.12353 0.03370935
## ENSMUSG00000038583.14   2.02994       -5.49382   2.60039  -2.11269 0.03462737
## ENSMUSG00000100927.2    1.98745        5.07013   2.41928   2.09572 0.03610707
##                            padj
##                       <numeric>
## ENSMUSG00000060989.9          1
## ENSMUSG00000031144.16         1
## ENSMUSG00000046337.18         1
## ENSMUSG00000109051.2          1
## ENSMUSG00000038583.14         1
## ENSMUSG00000100927.2          1

table(star_result_corrected$pvalue<0.05)

## 
## FALSE  TRUE 
## 19897    11

hisat_resdata_corrected <- merge(as.data.frame(hisat_result_corrected), as.data.frame(counts(hisat_dds_corrected, normalized=TRUE)), by="row.names", sort=FALSE)
names(hisat_resdata_corrected)[1] <- "Gene_id"
head(hisat_resdata_corrected)

##                 Gene_id baseMean log2FoldChange    lfcSE      stat     pvalue
## 1 ENSMUSG00000031144.16 3.646668      -5.065336 2.043083 -2.479261 0.01316549
## 2 ENSMUSG00000046337.18 7.976448      -3.106371 1.401101 -2.217092 0.02661681
## 3  ENSMUSG00000109051.2 4.196344       4.644044 2.193342  2.117337 0.03423127
## 4 ENSMUSG00000038583.14 2.038767      -5.487404 2.610617 -2.101956 0.03555709
## 5 ENSMUSG00000044788.11 2.063696       5.270887 2.552780  2.064763 0.03894539
## 6  ENSMUSG00000083906.2 2.473435       4.883165 2.424976  2.013697 0.04404139
##   padj   BMLow1    BMLow2   BMHigh1   BMHigh2    LNLow1    LNLow2  LNHigh1
## 1    1 18.38228 1.9106938  0.000000 0.2993122  3.856431 4.7246253 0.000000
## 2    1 34.31359 6.3689793  3.210078 1.7958735  0.000000 0.0000000 5.876855
## 3    1  0.00000 0.0000000  8.025195 0.8979367 11.569292 3.1497502 8.815283
## 4    1 11.02937 0.6368979  0.000000 0.0000000  3.856431 0.7874376 0.000000
## 5    1  0.00000 0.0000000 12.840312 0.2993122  0.000000 0.7874376 1.469214
## 6    1  0.00000 0.0000000  9.630234 0.5986245  3.856431 2.3623127 0.000000
##     LNHigh2
## 1  0.000000
## 2 12.246211
## 3  1.113292
## 4  0.000000
## 5  1.113292
## 6  3.339876

#write results
write.csv(hisat_resdata_corrected, file="CORRECTED_HISAT_BMHighvsBMLow _unfiltered_matrix.csv")

star_resdata_corrected <- merge(as.data.frame(star_result_corrected), as.data.frame(counts(star_dds_corrected, normalized=TRUE)), by="row.names", sort=FALSE)
names(star_resdata_corrected)[1] <- "Gene_id"
head(star_resdata_corrected)

##                 Gene_id baseMean log2FoldChange    lfcSE      stat      pvalue
## 1  ENSMUSG00000060989.9 7.724096       6.780527 2.618448  2.589521 0.009610954
## 2 ENSMUSG00000031144.16 3.771692      -4.413245 1.952913 -2.259827 0.023832010
## 3 ENSMUSG00000046337.18 8.440158      -3.086226 1.383635 -2.230519 0.025712977
## 4  ENSMUSG00000109051.2 4.401022       4.641142 2.185577  2.123531 0.033709353
## 5 ENSMUSG00000038583.14 2.029939      -5.493822 2.600393 -2.112689 0.034627366
## 6  ENSMUSG00000100927.2 1.987449       5.070125 2.419277  2.095720 0.036107067
##   padj   BMLow1    BMLow2   BMHigh1   BMHigh2    LNLow1    LNLow2   LNHigh1
## 1    1  0.00000 0.0000000 35.243489 0.8782462 15.456570 7.9682248  0.000000
## 2    1 18.23689 1.9092474  0.000000 0.5854974  3.864143 5.5777574  0.000000
## 3    1 36.47378 6.3641581  3.203954 2.0492411  0.000000 0.0000000  5.952692
## 4    1  0.00000 0.0000000  8.009884 0.8782462 11.592428 3.1872899 10.417211
## 5    1 10.94213 0.6364158  0.000000 0.0000000  3.864143 0.7968225  0.000000
## 6    1  0.00000 0.0000000  9.611861 1.7564923  0.000000 0.7968225  1.488173
##    LNHigh2
## 1  2.24624
## 2  0.00000
## 3 13.47744
## 4  1.12312
## 5  0.00000
## 6  2.24624

#write results
write.csv(star_resdata_corrected, file="CORRECTED_STAR_BMHighvsBMLow _unfiltered_matrix_CORRECTED.csv")

Now, getting a list of DEGs and filtering for rows which meet the criteria of having an absolute log2FC of 1 and padj of less than 0.05.

#Get DEG up/down lists
hisat_resdata_DEGs <- as.data.frame(hisat_result_corrected) %>% dplyr::filter(abs(log2FoldChange) > 1 & pvalue < 0.05) %>% tibble::rownames_to_column("Gene_id")

star_resdata_DEGs <- as.data.frame(star_result_corrected) %>% dplyr::filter(abs(log2FoldChange) > 1 & pvalue < 0.05) %>% tibble::rownames_to_column("Gene_id")

#write results
write.csv(star_resdata_DEGs, "CORRECTED_STAR_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)
write.csv(hisat_resdata_DEGs, "CORRECTED_HISAT_CD44LowVsHigh_log2fc1_fdr05_DEGlist.csv", row.names = FALSE)

head(hisat_resdata_DEGs)

##                 Gene_id baseMean log2FoldChange    lfcSE      stat     pvalue
## 1 ENSMUSG00000031144.16 3.646668      -5.065336 2.043083 -2.479261 0.01316549
## 2 ENSMUSG00000046337.18 7.976448      -3.106371 1.401101 -2.217092 0.02661681
## 3  ENSMUSG00000109051.2 4.196344       4.644044 2.193342  2.117337 0.03423127
## 4 ENSMUSG00000038583.14 2.038767      -5.487404 2.610617 -2.101956 0.03555709
## 5 ENSMUSG00000044788.11 2.063696       5.270887 2.552780  2.064763 0.03894539
## 6  ENSMUSG00000083906.2 2.473435       4.883165 2.424976  2.013697 0.04404139
##   padj
## 1    1
## 2    1
## 3    1
## 4    1
## 5    1
## 6    1

Recreating volcano plots

The following volcano plots were created with the pvalue < 0.05 and L2FC > 1, which is an adjustment for the batch correction and different from the original volcano plots, which used padj < 0.05 and L2FC > 1

# Add Threshold column based on criteria
hisat_resdata_corrected$threshold <- abs(hisat_resdata_corrected$log2FoldChange) >= 1 & hisat_resdata_corrected$pvalue < 0.05


hisat_threshold_pvalue <- hisat_resdata_corrected$pvalue < 0.05 
## Which values in an object are considered true and what is the length?? 

length(which(hisat_threshold_pvalue))

## [1] 10

# Add Threshold column based on criteria
star_resdata_corrected$threshold <- abs(star_resdata_corrected$log2FoldChange) >= 1 & star_resdata_corrected$pvalue < 0.05


star_threshold_pvalue <- star_resdata_corrected$pvalue < 0.05 
## Which values in an object are considered true and what is the length?? 

length(which(star_threshold_pvalue))

## [1] 11

str(star_resdata_corrected)

## 'data.frame':    19908 obs. of  16 variables:
##  $ Gene_id       : 'AsIs' chr  "ENSMUSG00000060989.9" "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000109051.2" ...
##  $ baseMean      : num  7.72 3.77 8.44 4.4 2.03 ...
##  $ log2FoldChange: num  6.78 -4.41 -3.09 4.64 -5.49 ...
##  $ lfcSE         : num  2.62 1.95 1.38 2.19 2.6 ...
##  $ stat          : num  2.59 -2.26 -2.23 2.12 -2.11 ...
##  $ pvalue        : num  0.00961 0.02383 0.02571 0.03371 0.03463 ...
##  $ padj          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BMLow1        : num  0 18.2 36.5 0 10.9 ...
##  $ BMLow2        : num  0 1.909 6.364 0 0.636 ...
##  $ BMHigh1       : num  35.24 0 3.2 8.01 0 ...
##  $ BMHigh2       : num  0.878 0.585 2.049 0.878 0 ...
##  $ LNLow1        : num  15.46 3.86 0 11.59 3.86 ...
##  $ LNLow2        : num  7.968 5.578 0 3.187 0.797 ...
##  $ LNHigh1       : num  0 0 5.95 10.42 0 ...
##  $ LNHigh2       : num  2.25 0 13.48 1.12 0 ...
##  $ threshold     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

hisat_corrected_volcano <- ggplot(hisat_resdata_corrected) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(pvalue), colour=threshold)) + 
  theme_classic() + 
        ggtitle("HISAT2 DEG Expression (corrected)") +
        xlab("Log2 fold change") + 
        ylab("-Log10(p-value)") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) + 
  scale_color_manual(values = c("darkslategrey", "magenta"),
                     labels = c("Not Significant", "Differentially Expressed")) +
  ylim(0, 2)
                     
ggsave(filename = "hisat_corrected_volcano.png",
       plot = hisat_corrected_volcano,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_corrected_volcano, file = "hisat_corrected_volcano.png")

hisat_corrected_volcano_plot <- readRDS("hisat_corrected_volcano.png")

star_corrected_volcano <- ggplot(star_resdata_corrected) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(pvalue), colour=threshold)) + 
  theme_classic() + 
        ggtitle("STAR DEG Expression (corrected)") +
        xlab("Log2 fold change") + 
        ylab("-Log10(pvalue)") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) + 
  scale_color_manual(values = c("darkslategrey", "magenta"),
                     labels = c("Not Significant", "Differentially Expressed"))  +
  ylim(0, 2)


ggsave(filename = "star_corrected_volcano.png",
       plot = star_corrected_volcano,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_corrected_volcano, file = "star_corrected_volcano.png")

star_corrected_volcano_plot <- readRDS("star_corrected_volcano.png")

corrected_volcanos <- plot_grid(hisat_corrected_volcano_plot, star_corrected_volcano_plot,
                                ncol = 2,
                                labels = "F")

corrected_volcanos

Identified as a DEG by both aligners:

ENSMUSG00000031144.16 - syp
- Downregulated
ENSMUSG00000046337.18 - Fam178b
- Downregulated
ENSMUSG00000109051.2 - Gm44913
- Upregulated
ENSMUSG00000038583.14 - Pln
- Downregulated
ENSMUSG00000108688.2 - Gm44985
- Downregulated
ENSMUSG00000109302.2 - Gm44967
- Upregulated

Identified by HISAT2 only:

ENSMUSG00000044788.11
- Upregulated
ENSMUSG00000083906.2
- Upregulated
ENSMUSG00000044231.4
- Upregulated
ENSMUSG00000080801.3
- Upregulated

Identified by STAR only:

ENSMUSG00000060989.9
- Upregulated
ENSMUSG00000100927.2
- Upregulated
ENSMUSG00000082194.4
- Upregulated
ENSMUSG00000104843.2
- Upregulated
ENSMUSG00000084989.4
- Downregulated

# For HISAT data
hisat_upregulated <- hisat_resdata_corrected$Gene_id[hisat_resdata_corrected$threshold & hisat_resdata_corrected$log2FoldChange > 0]
hisat_downregulated <- hisat_resdata_corrected$Gene_id[hisat_resdata_corrected$threshold & hisat_resdata_corrected$log2FoldChange < 0]

# For STAR data
star_upregulated <- star_resdata_corrected$Gene_id[star_resdata_corrected$threshold & star_resdata_corrected$log2FoldChange > 0]
star_downregulated <- star_resdata_corrected$Gene_id[star_resdata_corrected$threshold & star_resdata_corrected$log2FoldChange < 0]

hisat_upregulated

## [1] "ENSMUSG00000109051.2"  "ENSMUSG00000044788.11" "ENSMUSG00000083906.2" 
## [4] "ENSMUSG00000044231.4"  "ENSMUSG00000080801.3"  "ENSMUSG00000109302.2"

hisat_downregulated

## [1] "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000038583.14"
## [4] "ENSMUSG00000108688.2"

star_upregulated

## [1] "ENSMUSG00000060989.9" "ENSMUSG00000109051.2" "ENSMUSG00000100927.2"
## [4] "ENSMUSG00000082194.4" "ENSMUSG00000104843.2" "ENSMUSG00000109302.2"

star_downregulated

## [1] "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000038583.14"
## [4] "ENSMUSG00000108688.2"  "ENSMUSG00000084989.4"

The amount of genes which meet the threshold (a p-adjusted value less than 0.05) is different for the HISAT2 alignment data and the STAR alignment data. From the HISAT2 alignment, we have 10 genes of significance, and from the STAR alignment we have 11 gene of significance. As is shown in the Venn Diagram below, these DEGs the two aligners have identified are not all the same. In fact, only 40% of the DEGs found are the same between the two aligners. Based on the aligner used, which genes are considered differentially expressed are much different.

Corrected Venn Diagrams

generating a venn diagram comparing the DEGs identified by HISAT2 vs STAR. I want to identify what genes each aligner considers a DEG

# Extract genes marked as differentially expressed in each dataset
hisat_threshold_genes_corrected <- subset(hisat_resdata_corrected, threshold)$Gene_id
star_threshold_genes_corrected <- subset(star_resdata_corrected, threshold)$Gene_id

# Create a list of gene IDs for both datasets
gene_lists <- list(
  HISAT2 = hisat_threshold_genes_corrected,
  STAR = star_threshold_genes_corrected
)

# Plot the Venn diagram
ggvenn(data = gene_lists)

To summarize the venn diagram’s findings below:

hisat_threshold_genes_corrected

##  [1] "ENSMUSG00000031144.16" "ENSMUSG00000046337.18" "ENSMUSG00000109051.2" 
##  [4] "ENSMUSG00000038583.14" "ENSMUSG00000044788.11" "ENSMUSG00000083906.2" 
##  [7] "ENSMUSG00000044231.4"  "ENSMUSG00000080801.3"  "ENSMUSG00000108688.2" 
## [10] "ENSMUSG00000109302.2"

star_threshold_genes_corrected

##  [1] "ENSMUSG00000060989.9"  "ENSMUSG00000031144.16" "ENSMUSG00000046337.18"
##  [4] "ENSMUSG00000109051.2"  "ENSMUSG00000038583.14" "ENSMUSG00000100927.2" 
##  [7] "ENSMUSG00000082194.4"  "ENSMUSG00000104843.2"  "ENSMUSG00000108688.2" 
## [10] "ENSMUSG00000109302.2"  "ENSMUSG00000084989.4"

The results of HISAT and STAR alignments look much more similar after the batch correction, but there are still significant differences in what each aligner found to be a DEG.

Boxplots

For the sake of simplicity, the following boxplots are created for the genes which both HISAT2 and STAR agree are differentially expressed, after batch correction.

Data Transformation

hisat_all_matrix <- subset(hisat_resdata_corrected, threshold)
star_all_matrix <- subset(star_resdata_corrected, threshold)


hisat_symbol_deg <- hisat_all_matrix %>% select(Gene_id, BMLow1:LNHigh2)
star_symbol_deg <- star_all_matrix %>% select(Gene_id, BMLow1:LNHigh2)

hisat_expression_plot <- pivot_longer(hisat_symbol_deg,
                                cols = 2:9,
                                values_to = "normalized_counts")

star_expression_plot <- pivot_longer(star_symbol_deg,
                                cols = 2:9,
                                values_to = "normalized_counts")

head(hisat_expression_plot)

## # A tibble: 6 × 3
##   Gene_id               name    normalized_counts
##   <I<chr>>              <chr>               <dbl>
## 1 ENSMUSG00000031144.16 BMLow1             18.4  
## 2 ENSMUSG00000031144.16 BMLow2              1.91 
## 3 ENSMUSG00000031144.16 BMHigh1             0    
## 4 ENSMUSG00000031144.16 BMHigh2             0.299
## 5 ENSMUSG00000031144.16 LNLow1              3.86 
## 6 ENSMUSG00000031144.16 LNLow2              4.72

combined_expression_plot <- left_join(x = hisat_expression_plot, 
                             y = star_expression_plot, 
                             by = "Gene_id")

head(combined_expression_plot)

## # A tibble: 6 × 5
##   Gene_id               name.x normalized_counts.x name.y  normalized_counts.y
##   <I<chr>>              <chr>                <dbl> <chr>                 <dbl>
## 1 ENSMUSG00000031144.16 BMLow1                18.4 BMLow1               18.2  
## 2 ENSMUSG00000031144.16 BMLow1                18.4 BMLow2                1.91 
## 3 ENSMUSG00000031144.16 BMLow1                18.4 BMHigh1               0    
## 4 ENSMUSG00000031144.16 BMLow1                18.4 BMHigh2               0.585
## 5 ENSMUSG00000031144.16 BMLow1                18.4 LNLow1                3.86 
## 6 ENSMUSG00000031144.16 BMLow1                18.4 LNLow2                5.58

Agreed DEGs boxplots

I chose to plot each DEG based on the aligner and the condition. Each boxplot has a y-axis scale of (0,20) and each value deals with the normalized counts values

hisat_syp_plot <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000031144.16")

hisat_syp <- ggplot(hisat_syp_plot) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT syp") + 
  ylim(0, 20)

ggsave(filename = "hisat_syp.png",
       plot = hisat_syp,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_syp, file = "hisat_syp.png")
hisat_syp <- readRDS("hisat_syp.png")

# ----------------------------------------------

star_syp_plot <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000031144.16")

star_syp <- ggplot(star_syp_plot) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR syp") + 
  ylim(0, 20)

ggsave(filename = "star_syp.png",
       plot = star_syp,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_syp, file = "star_syp.png")
star_syp <- readRDS("star_syp.png")

syp_plot <- plot_grid(hisat_syp, star_syp,
                      nrow = 2,
                      labels = "G")

syp_plot

hisat_Fam178b_plot <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000046337.18")

hisat_Fam178b <- ggplot(hisat_Fam178b_plot) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Fam178b") + 
  ylim(0, 20)


ggsave(filename = "hisat_Fam178b.png",
       plot = hisat_Fam178b,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Fam178b, file = "hisat_Fam178b.png")
hisat_Fam178b <- readRDS("hisat_Fam178b.png")

#-----------------------------------------------

star_Fam178b_plot <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000046337.18")

star_Fam178b <- ggplot(star_Fam178b_plot) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Fam178b") + 
  ylim(0, 20)

ggsave(filename = "star_Fam178b.png",
       plot = star_Fam178b,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Fam178b, file = "star_Fam178b.png")
star_Fam178b <- readRDS("star_Fam178b.png")

Fam178b_plots <- plot_grid(hisat_Fam178b, star_Fam178b,
                           nrow = 2,
                           labels = "H")

Fam178b_plots

hisat_Gm44913 <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000109051.2")

hisat_Gm44913 <- ggplot(hisat_Gm44913) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Gm44913") + 
  ylim(0, 20)

ggsave(filename = "hisat_Gm44913.png",
       plot = hisat_Gm44913,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Gm44913, file = "hisat_Gm44913.png")
hisat_Gm44913 <- readRDS("hisat_Gm44913.png")

#-------------------------------------------

star_Gm44913 <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000109051.2")

star_Gm44913 <- ggplot(star_Gm44913) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Gm44913") + 
  ylim(0, 20)

ggsave(filename = "star_Gm44913.png",
       plot = hisat_Gm44913,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Gm44913, file = "star_Gm44913.png")
star_Gm44913 <- readRDS("star_Gm44913.png")

Gm44913_plots <- plot_grid(hisat_Gm44913, star_Gm44913,
                           nrow = 2,
                           labels = "I")

Gm44913_plots

hisat_Pln <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000038583.14")

hisat_Pln <- ggplot(hisat_Pln) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Pln") + 
  ylim(0, 20)

ggsave(filename = "hisat_Pln.png",
       plot = hisat_Pln,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Pln, file = "hisat_Pln.png")
hisat_Pln <- readRDS("hisat_Pln.png")

#-------------------------------------

star_Pln <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000038583.14")

star_Pln <- ggplot(star_Pln) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Pln") + 
  ylim(0, 20)

ggsave(filename = "star_Pln.png",
       plot = star_Pln,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Pln, file = "star_Pln.png")
star_Pln <- readRDS("star_Pln.png")

Pln_plots <- plot_grid(hisat_Pln, star_Pln,
                           nrow = 2,
                       labels = "J")

Pln_plots

hisat_Gm44985 <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000108688.2")

hisat_Gm44985 <- ggplot(hisat_Gm44985) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Gm44985") + 
  ylim(0, 20)

ggsave(filename = "hisat_Gm44985.png",
       plot = hisat_Gm44985,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Gm44985, file = "hisat_Gm44985.png")
hisat_Gm44985 <- readRDS("hisat_Gm44985.png")

#---------------------------------------


star_Gm44985 <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000108688.2")

star_Gm44985 <- ggplot(star_Gm44985) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Gm44985") + 
  ylim(0, 20)

ggsave(filename = "star_Gm44985.png",
       plot = star_Gm44985,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Gm44985, file = "star_Gm44985.png")
star_Gm44985 <- readRDS("star_Gm44985.png")

Gm44985_plot <- plot_grid(hisat_Gm44985, star_Gm44985,
                           nrow = 2,
                          labels = "K")

Gm44985_plot

hisat_Gm44967 <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000109302.2")

hisat_Gm44967 <- ggplot(hisat_Gm44967) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Gm44967") + 
  ylim(0, 20)

ggsave(filename = "hisat_Gm44967.png",
       plot = hisat_Gm44967,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Gm44967, file = "hisat_Gm44967.png")
hisat_Gm44967 <- readRDS("hisat_Gm44967.png")

#---------------------------------------------


star_Gm44967 <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000109302.2")

star_Gm44967 <- ggplot(star_Gm44967) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Gm44967") + 
  ylim(0, 20)
ggsave(filename = "star_Gm44967.png",
       plot = star_Gm44967,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Gm44967, file = "star_Gm44967.png")
star_Gm44967 <- readRDS("star_Gm44967.png")

Gm44967_plots <- plot_grid(hisat_Gm44967, star_Gm44967,
                           nrow = 2,
                           labels = "L")

Gm44967_plots

All boxplots together

boxplots <- plot_grid(syp_plot, Fam178b_plots, Gm44913_plots, Pln_plots, Gm44985_plot, Gm44967_plots,
                      ncol = 6)

boxplots

This figure may be too small to use.

For the most part, the boxplots indicate that the normalized counts the DEGs found for each sample across both aligners are similar. This shows that when the aligners are in concensus, they are both aligning the dataset similarly. Further research is necessary to understand why the aligners results are so different, and what the implications of this could mean for transcriptomics and bioinformatic pipelines moving forward.

Disagreed DEGs Boxplots

There are several DEGs identified by either HISAT2 or STAR, but not both. I have chosen 2 to examine further:

Identified by HISAT2 only: * ENSMUSG00000044788.11 - Fads6

Identified by STAR only: * ENSMUSG00000084989.4 - Crocc2

I must first add them to the hisat and star expression plot data

# Filter out the genes not meeting the threshold
hisat_additional_genes <- hisat_resdata_corrected %>%
  filter(Gene_id %in% c("ENSMUSG00000084989.4", "ENSMUSG00000044788.11") & !threshold)

star_additional_genes <- star_resdata_corrected %>%
  filter(Gene_id %in% c("ENSMUSG00000084989.4", "ENSMUSG00000044788.11") & !threshold)

# Combine additional genes with hisat_expression_plot and star_expression_plot
hisat_expression_plot <- bind_rows(hisat_expression_plot, 
                                   pivot_longer(hisat_additional_genes,
                                                cols = 2:9,
                                                values_to = "normalized_counts"))

star_expression_plot <- bind_rows(star_expression_plot, 
                                  pivot_longer(star_additional_genes,
                                               cols = 2:9,
                                               values_to = "normalized_counts"))

hisat_Fads6 <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000044788.11")

hisat_Fads6 <- ggplot(hisat_Fads6) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Fads6") + 
  ylim(0, 20)

ggsave(filename = "hisat_Fads6.png",
       plot = hisat_Fads6,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Fads6, file = "hisat_Fads6.png")
hisat_Fads6 <- readRDS("hisat_Fads6.png")

#---------------------------------------------


star_Fads6 <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000044788.11")

star_Fads6 <- ggplot(star_Fads6) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Fads6") + 
  ylim(0, 20)
ggsave(filename = "star_Fads6.png",
       plot = star_Fads6,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Fads6, file = "star_Fads6.png")
star_Fads6 <- readRDS("star_Fads6.png")

Fads6_plots <- plot_grid(hisat_Fads6, star_Fads6,
                           nrow = 2,
                         labels = "M")

Fads6_plots

hisat_Crocc2 <- hisat_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000084989.4")

hisat_Crocc2 <- ggplot(hisat_Crocc2) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("HISAT Crocc2") + 
  ylim(0, 20)

ggsave(filename = "hisat_Crocc2.png",
       plot = hisat_Crocc2,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(hisat_Crocc2, file = "hisat_Crocc2.png")
hisat_Crocc2 <- readRDS("hisat_Crocc2.png")

#---------------------------------------------


star_Crocc2 <- star_expression_plot %>%
  filter(Gene_id == "ENSMUSG00000084989.4")

star_Crocc2 <- ggplot(star_Crocc2) +
  geom_boxplot(aes(x=name, 
                   y=normalized_counts, 
                   fill=name)) +
  ggtitle("STAR Crocc2") + 
  ylim(0, 20)
ggsave(filename = "star_Crocc2.png",
       plot = star_Crocc2,
       units = "in", 
       width = 5, 
       height = 6, 
       dpi = 300)

saveRDS(star_Crocc2, file = "star_Crocc2.png")
star_Crocc2 <- readRDS("star_Crocc2.png")

Crocc2_plots <- plot_grid(hisat_Crocc2, star_Crocc2,
                           nrow = 2,
                          labels = "N")

Crocc2_plots

It appears that when the aligners do not agree on whether a gene is a DEG, they have different results for the normalized counts as well. This means that they aligned the original data differently, and one may have spliced when another didnt.

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] sva_3.46.0                  BiocParallel_1.32.6        
##  [3] genefilter_1.80.3           mgcv_1.8-41                
##  [5] nlme_3.1-160                limma_3.54.2               
##  [7] ggvenn_0.1.10               ggplotify_0.1.2            
##  [9] patchwork_1.2.0             cowplot_1.1.3              
## [11] lubridate_1.9.3             forcats_1.0.0              
## [13] stringr_1.5.1               dplyr_1.1.4                
## [15] purrr_1.0.2                 readr_2.1.5                
## [17] tidyr_1.3.1                 tibble_3.2.1               
## [19] tidyverse_2.0.0             ggplot2_3.5.1              
## [21] pheatmap_1.0.12             RColorBrewer_1.1-3         
## [23] DESeq2_1.38.3               SummarizedExperiment_1.28.0
## [25] Biobase_2.58.0              MatrixGenerics_1.10.0      
## [27] matrixStats_1.3.0           GenomicRanges_1.50.2       
## [29] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [31] S4Vectors_0.36.2            BiocGenerics_0.44.0        
## [33] knitr_1.46                 
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           fs_1.6.4               bit64_4.0.5           
##  [4] httr_1.4.7             tools_4.2.2            bslib_0.7.0           
##  [7] utf8_1.2.4             R6_2.5.1               DBI_1.2.2             
## [10] colorspace_2.1-0       withr_3.0.0            tidyselect_1.2.1      
## [13] bit_4.0.5              compiler_4.2.2         textshaping_0.3.7     
## [16] cli_3.6.2              DelayedArray_0.24.0    labeling_0.4.3        
## [19] sass_0.4.9             scales_1.3.0           systemfonts_1.0.6     
## [22] digest_0.6.35          yulab.utils_0.1.4      rmarkdown_2.26        
## [25] XVector_0.38.0         pkgconfig_2.0.3        htmltools_0.5.8.1     
## [28] highr_0.10             fastmap_1.1.1          rlang_1.1.3           
## [31] rstudioapi_0.16.0      RSQLite_2.3.6          farver_2.1.1          
## [34] gridGraphics_0.5-1     jquerylib_0.1.4        generics_0.1.3        
## [37] jsonlite_1.8.8         RCurl_1.98-1.14        magrittr_2.0.3        
## [40] GenomeInfoDbData_1.2.9 Matrix_1.5-1           Rcpp_1.0.12           
## [43] munsell_0.5.1          fansi_1.0.6            lifecycle_1.0.4       
## [46] edgeR_3.40.2           stringi_1.8.3          yaml_2.3.8            
## [49] zlibbioc_1.44.0        blob_1.2.4             parallel_4.2.2        
## [52] crayon_1.5.2           lattice_0.20-45        splines_4.2.2         
## [55] Biostrings_2.66.0      annotate_1.76.0        hms_1.1.3             
## [58] KEGGREST_1.38.0        locfit_1.5-9.9         pillar_1.9.0          
## [61] geneplotter_1.76.0     codetools_0.2-18       XML_3.99-0.16.1       
## [64] glue_1.7.0             evaluate_0.23          png_0.1-8             
## [67] vctrs_0.6.5            tzdb_0.4.0             gtable_0.3.5          
## [70] cachem_1.0.8           xfun_0.43              xtable_1.8-4          
## [73] survival_3.4-0         ragg_1.3.0             AnnotationDbi_1.60.2  
## [76] memoise_2.0.1          timechange_0.3.0

AsreesFinalProject

Anna Rees

2024-04-22