Thyroid hormones and their receptors are important to regulating cell differentiation and metabolism in various tissues, with the receptor mainly acting as transcription factors to control different gene network in the presence of the thyroid hormone T3 (Triiodothyronine). With new evidence emerging that suggests a more complex model of transcriptional regulation of thyroid hormone receptors, Gillis et. al aim to further explore this complex model through various Next Generation Sequencing approach, carried out on the immortalized normal human thyroid epithelial cell line (Nthy-ORI). Gene expression analysis was done following RNA-sequencing, to compare the effect of T3 against vehicle control on Nthy-ORI cells.
The Nthy-ORI cells were cultured in RPMI-1640 media supplemented with growth factors and hormones, then hormone-deprived using phenol-red free and charcoal-stripped fetal bovine serum RPMI-1640 media for 24 h, and then treated with 10 nM T3 or 1 N NaOH vehicle for 6 or 24 h. Total RNA was extracted and purifed using RNeasy Plus Kit, for a total of two biological replicates per condition. Stranded Illumina cDNA libraries were prepared using the KAPA Stranded RNA-Seq library preparation kit, and sequenced on the Illumina HiSeq 1500 platform.
Citation: Gillis, N. E., Boyd, J. R., Tomczak, J. A., Frietze, S., & Carr, F. (2022). Thyroid hormone dependent transcriptional programming by TR² requires SWI/SNF chromatin remodelers. Nucleic Acids Research, 50(3), 13821395. https://doi.org/10.1093/nar/gkab1287
NCBI Gene Expression Omnibus accession code: GSE168954
#tximport is used to import the quantification data obtained from salmon output (quant.sf).
#ensembldb and EnsDb.Hsapiens.v86 is used to create tx2gene to import salmon output for DESeq2.
library("tximport")
library("ensembldb")
library("EnsDb.Hsapiens.v86")
library("readr")
library("DESeq2")
library("dplyr")
library("ggplot2")
library("patchwork")
#volcano
library("EnhancedVolcano")
library("dplyr")
library("tidyr")
library("ggplot2")
library("genekitr")
#heatmap
library("pheatmap")
library("RColorBrewer")
library("ggplot2")
library("cowplot")
library("tidyr")
library("dplyr")
library("patchwork")
library("ggplotify")
setwd("~/scratch/bioinformatics/noelle")
directory <- "/gpfs2/scratch/ihastomo/bioinformatics/noelle"
sampleTable <- data.frame(read.csv("NORI_annotation_salmon.csv", header = TRUE))
#set variables as factor
sampleTable$time <- factor(sampleTable$time)
sampleTable$treatment <- factor(sampleTable$treatment)
sampleTable$replicate <- factor(sampleTable$replicate)
rownames(sampleTable) <- sampleTable$samplename
files <- file.path(directory, sampleTable$filename, "quant.sf")
names(files) <- sampleTable$samplename
edb <- EnsDb.Hsapiens.v86 #load annotation database generated from Ensembl
Tx <- transcripts(edb, return.type = "DataFrame") #extract transcript info from database
tx2gene <- Tx[, c(1,7)] #extract transcript id and gene id
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
#ignoreTxVersion to remove version info to easily match transcript id with gene id.
tail(txi.salmon$counts)
## nori_24HR_T3_rep1 nori_24HR_T3_rep2 nori_24HR_vehicle_rep1
## ENSG00000283694 0.000 0.000 0.000
## ENSG00000283695 0.000 0.000 0.000
## ENSG00000283696 0.000 0.000 1.000
## ENSG00000283697 0.000 0.000 0.000
## ENSG00000283698 14.259 9.966 15.129
## ENSG00000283699 0.000 0.000 0.000
## nori_24HR_vehicle_rep2 nori_6HR_T3_rep1 nori_6HR_T3_rep2
## ENSG00000283694 0.000 0.000 0.000
## ENSG00000283695 0.000 0.000 0.000
## ENSG00000283696 1.000 2.000 4.197
## ENSG00000283697 0.000 0.000 0.000
## ENSG00000283698 3.252 2.332 3.479
## ENSG00000283699 0.000 0.000 0.000
## nori_6HR_vehicle_rep1 nori_6HR_vehicle_rep2
## ENSG00000283694 0.000 0.000
## ENSG00000283695 0.000 0.000
## ENSG00000283696 1.000 2.000
## ENSG00000283697 0.000 0.000
## ENSG00000283698 8.263 11.553
## ENSG00000283699 0.000 0.000
salmon_dds <- DESeqDataSetFromTximport(txi.salmon,
colData = sampleTable,
design = ~treatment + time)
keep <- rowSums(counts(salmon_dds)) >= 10
salmon_dds <- salmon_dds[keep,] #remove low counts
salmon_dds$treatment <- relevel(salmon_dds$treatment, ref = "vehicle")
salmon_dds <- DESeq(salmon_dds)
salmon_res_vehiclevsT3 <- results(salmon_dds, contrast = c('treatment', 'T3', 'vehicle'))
head(salmon_res_vehiclevsT3)
## log2 fold change (MLE): treatment T3 vs vehicle
## Wald test p-value: treatment T3 vs vehicle
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 247.383473 0.1647337 0.226486 0.727347 0.467013
## ENSG00000000419 426.846321 0.0291147 0.219182 0.132834 0.894325
## ENSG00000000457 89.361790 0.4255695 0.282939 1.504102 0.132555
## ENSG00000000460 227.464499 -0.2121521 0.239450 -0.885996 0.375619
## ENSG00000000938 0.918696 2.2082718 3.118336 0.708157 0.478848
## ENSG00000000971 2.088495 -0.7828573 1.488357 -0.525988 0.598897
## padj
## <numeric>
## ENSG00000000003 0.950111
## ENSG00000000419 0.996035
## ENSG00000000457 0.803926
## ENSG00000000460 0.924073
## ENSG00000000938 NA
## ENSG00000000971 NA
salmon_padj_res_vehiclevsT3 <- salmon_res_vehiclevsT3[order(salmon_res_vehiclevsT3$padj), ] #order based on padj value
salmon_resdata_vehiclevsT3 <- merge(as.data.frame(salmon_padj_res_vehiclevsT3),
as.data.frame(counts(salmon_dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(salmon_resdata_vehiclevsT3)[1] <- "Gene_id"
write.csv(salmon_resdata_vehiclevsT3,
file="salmon_nori_vehiclevsT3_unfiltered_matrix.csv",
row.names = FALSE)
salmon_resdata_vehiclevsT3 <- as.data.frame(salmon_padj_res_vehiclevsT3) %>%
dplyr::filter(abs(log2FoldChange) > 0.5 & padj < 0.05) %>% #cutoff value was chosen based on the NGS dataset paper
tibble::rownames_to_column("Gene_id")
write.csv(salmon_resdata_vehiclevsT3,
"salmon_nori_vehiclevsT3_log2fc1_fdr05_DEGlist.csv",
row.names = FALSE)
coldata <- read.csv("NORI_annotation_STAR.csv", header = TRUE)
coldata <- data.frame(coldata)
# Set the variables as factor
coldata$time <- factor(coldata$time)
coldata$treatment <- factor(coldata$treatment)
coldata$replicate <- factor(coldata$replicate)
count_matrix <- as.matrix(read.csv("all_counts.txt",
header=TRUE,
sep=""))
STAR_dds <- DESeqDataSetFromMatrix(countData = count_matrix,
colData = coldata,
design = ~treatment + time)
keep <- rowSums(counts(STAR_dds)) >= 10 #remove low counts
STAR_dds <- STAR_dds[keep,]
STAR_dds$treatment <- relevel(STAR_dds$treatment, ref = "vehicle")
STAR_dds <- DESeq(STAR_dds)
STAR_res_vehiclevsT3 <- results(STAR_dds, contrast = c('treatment', 'T3', 'vehicle'))
head(STAR_res_vehiclevsT3)
## log2 fold change (MLE): treatment T3 vs vehicle
## Wald test p-value: treatment T3 vs vehicle
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000241860.7 1.86719 0.275374 1.574065 0.174945 0.861123
## ENSG00000279457.4 5.84251 -1.078011 0.938354 -1.148833 0.250625
## ENSG00000290385.1 2.85677 0.621789 1.132516 0.549033 0.582983
## ENSG00000230021.10 4.05194 -1.108843 1.143393 -0.969783 0.332155
## ENSG00000225972.1 2.76696 2.159290 1.355773 1.592664 0.111236
## ENSG00000225630.1 494.60968 0.479414 0.932520 0.514106 0.607178
## padj
## <numeric>
## ENSG00000241860.7 NA
## ENSG00000279457.4 NA
## ENSG00000290385.1 NA
## ENSG00000230021.10 NA
## ENSG00000225972.1 NA
## ENSG00000225630.1 0.808182
STAR_padj_res_vehiclevsT3 <- STAR_res_vehiclevsT3[order(STAR_res_vehiclevsT3$padj), ] #order based on padj value
STAR_resdata_vehiclevsT3 <- merge(as.data.frame(STAR_padj_res_vehiclevsT3),
as.data.frame(counts(STAR_dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(STAR_resdata_vehiclevsT3)[1] <- "Gene_id"
write.csv(STAR_resdata_vehiclevsT3,
file="STAR_nori_vehiclevsT3_unfiltered_matrix.csv",
row.names = FALSE)
STAR_resdata_vehiclevsT3 <- as.data.frame(STAR_padj_res_vehiclevsT3) %>%
dplyr::filter(abs(log2FoldChange) > 0.5 & padj < 0.05) %>% #cutoff value was chosen based on the NGS dataset paper
tibble::rownames_to_column("Gene_id")
write.csv(STAR_resdata_vehiclevsT3,
"STAR_nori_vehiclevsT3_log2fc1_fdr05_DEGlist.csv",
row.names = FALSE)
#Salmon PCA Plot
salmon_vstcounts <- vst(salmon_dds, blind = FALSE)
salmon_PCA <- plotPCA(salmon_vstcounts, intgroup=c("treatment", "time"), returnData=TRUE)
percentVar <- round(100 * attr(salmon_PCA, "percentVar"))
salmon_PCA_plot <- ggplot(data = salmon_PCA, aes(PC1, PC2, color=treatment, shape=time)) +
geom_point(size=3) +
xlim(-25,25) +
ylim(-25,25) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
theme_classic() +
ggtitle("Salmon PCA Plot") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
#STAR PCA Plot
STAR_vstcounts <- vst(STAR_dds, blind = FALSE)
STAR_PCA <- plotPCA(STAR_vstcounts, intgroup=c("treatment", "time"), returnData=TRUE)
percentVar <- round(100 * attr(STAR_PCA, "percentVar"))
STAR_PCA_plot <- ggplot(data = STAR_PCA, aes(PC1, PC2, color=treatment, shape=time)) +
geom_point(size=3) +
xlim(-25,25) +
ylim(-25,25) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
theme_classic() +
ggtitle("STAR PCA Plot") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
#save PCA plots as png files
ggsave(filename = "salmon_PCA_plot.png",
plot = salmon_PCA_plot,
units = "in",
width = 5,
height = 6,
dpi = 300)
ggsave(filename = "STAR_PCA_plot.png",
plot = STAR_PCA_plot,
units = "in",
width = 5,
height = 6,
dpi = 300)
salmon_PCA_plot + STAR_PCA_plot
Differential analysis from DESeq2 uses raw counts. PCA extracts the most
important factors and allows the data to be more easily examined, using
variance stabilizing transformation (vst). The plot visualized the
overall effect of treatment and time, where the PC1 accounts for the
largest variation, and PC2 accounts for the second largest variation.
Ideally samples are clustered based on their conditions. In this case,
Salmon and STAR produced a similar pattern (simply inverted), where the
largest variance is time (left side are 6 HR), and the second largest
variance is the treatment (presence or absence of T3). All sample
replicates are clustered based on their condition, except for one of the
24 HR vehicle sample. These plots also suggest that there is a larger
variation due to time when using STAR, and a larger variation due to
treatment when using Salmon.
salmon_resdata_vehiclevsT3 <- read.csv("salmon_nori_vehiclevsT3_unfiltered_matrix.csv",
header = TRUE)
salmon_resdata_vehiclevsT3$threshold <- salmon_resdata_vehiclevsT3$padj < 0.05
salmon_resdata_vehiclevsT3$Gene_id <- sub("\\.\\d+", "", salmon_resdata_vehiclevsT3$Gene_id)
ids <- salmon_resdata_vehiclevsT3$Gene_id
gene_symbol_map <- transId(id = ids, transTo = "symbol")
names(salmon_resdata_vehiclevsT3)[1] = "input_id"
salmon_resdata_gene <- merge(salmon_resdata_vehiclevsT3, gene_symbol_map, by.x = "input_id", by.y = "input_id")
#Add a column to data frame to specify up or down regulated
salmon_resdata_gene$diffexpressed <- "Unchanged"
#if log2FC > 0.5 and padj < 0.05 set as upregulated
salmon_resdata_gene$diffexpressed[salmon_resdata_gene$padj < 0.05 & salmon_resdata_gene$log2FoldChange > 0.5] <- "Upregulated"
#if log2FC < -0.5 and padj < 0.05 set as downregulated
salmon_resdata_gene$diffexpressed[salmon_resdata_gene$padj < 0.05 & salmon_resdata_gene$log2FoldChange < -0.5] <- "Downregulated"
salmon_resdata_sig <- as.data.frame(salmon_resdata_gene) %>%
dplyr::filter(abs(log2FoldChange) > 0.5 & padj < 0.05)
write.csv(salmon_resdata_sig, file = "nori_salmon_vehiclevsT3_DEG.csv", row.names = FALSE)
salmon_upregulated_count <- nrow(salmon_resdata_sig[salmon_resdata_sig$diffexpressed == "Upregulated", ])
salmon_downregulated_count <- nrow(salmon_resdata_sig[salmon_resdata_sig$diffexpressed == "Downregulated", ])
salmon_upregulated_count
## [1] 15
salmon_downregulated_count
## [1] 250
STAR_resdata_vehiclevsT3 <- read.csv("STAR_nori_vehiclevsT3_unfiltered_matrix.csv",
header = TRUE)
STAR_resdata_vehiclevsT3$threshold <- STAR_resdata_vehiclevsT3$padj < 0.05
STAR_resdata_vehiclevsT3$Gene_id <- sub("\\.\\d+", "", STAR_resdata_vehiclevsT3$Gene_id)
ids <- STAR_resdata_vehiclevsT3$Gene_id
gene_symbol_map <- transId(id = ids, transTo = "symbol")
names(STAR_resdata_vehiclevsT3)[1] = "input_id"
STAR_resdata_gene <- merge(STAR_resdata_vehiclevsT3, gene_symbol_map, by.x = "input_id", by.y = "input_id")
#Add a column to dataframe to specify up or down regulated
STAR_resdata_gene$diffexpressed <- "Unchanged"
#if log2FC > 0.5 and padj < 0.05 set as upregulated
STAR_resdata_gene$diffexpressed[STAR_resdata_gene$padj < 0.05 & STAR_resdata_gene$log2FoldChange > 0.5] <- "Upregulated"
#if log2FC < -0.5 and padj < 0.05 set as downregulated
STAR_resdata_gene$diffexpressed[STAR_resdata_gene$padj < 0.05 & STAR_resdata_gene$log2FoldChange < -0.5] <- "Downregulated"
STAR_resdata_sig <- as.data.frame(STAR_resdata_gene) %>%
dplyr::filter(abs(log2FoldChange) > 0.5 & padj < 0.05)
write.csv(STAR_resdata_sig, file = "nori_STAR_vehiclevsT3_DEG.csv", row.names = FALSE)
STAR_upregulated_count <- nrow(STAR_resdata_sig[STAR_resdata_sig$diffexpressed == "Upregulated", ])
STAR_downregulated_count <- nrow(STAR_resdata_sig[STAR_resdata_sig$diffexpressed == "Downregulated", ])
STAR_upregulated_count
## [1] 57
STAR_downregulated_count
## [1] 104
#Salmon
salmon_volcano <-
ggplot(salmon_resdata_gene) +
geom_point(aes(x=log2FoldChange,
y=-log10(padj), colour=diffexpressed)) +
theme_classic() +
ggtitle("Salmon Vehicle versus T3") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("blue", "black", "red"),
labels = c("Downregulated", "Not Significant", "Upregulated")) +
scale_x_continuous(limits = c(-20, 20)) +
coord_cartesian(ylim=c(0, 10), clip = "off")
# Add annotation
salmon_volcano <- salmon_volcano +
annotate("text", x = 15, y = 10, label = "15", colour = "red", size = 4, hjust = 0) +
annotate("text", x = -10, y = 10, label = "250", colour = "blue", size = 4, hjust = 1)
STAR_volcano <-
ggplot(STAR_resdata_gene) +
geom_point(aes(x=log2FoldChange,
y=-log10(padj), colour=diffexpressed)) +
theme_classic() +
ggtitle("STAR Vehicle vs T3") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("blue", "black", "red"),
labels = c("Downregulated", "Not Significant", "Upregulated")) +
scale_x_continuous(limits = c(-20, 20)) +
coord_cartesian(ylim=c(0, 10), clip = "off")
# Add annotation
STAR_volcano <- STAR_volcano +
annotate("text", x = 15, y = 10, label = "57", colour = "red", size = 4, hjust = 0) +
annotate("text", x = -10, y = 10, label = "104", colour = "blue", size = 4, hjust = 1)
#save as PNG files
ggsave(filename = "salmon_volcano.png",
plot = salmon_volcano,
units = "in",
width = 5,
height = 6,
dpi = 300)
ggsave(filename = "STAR_volcano.png",
plot = STAR_volcano,
units = "in",
width = 5,
height = 6,
dpi = 300)
salmon_volcano / STAR_volcano + plot_layout(nrow = 2,
widths = c(4, 4),
heights = c(5, 5))
Salmon quantifies transcript from RNA-seq data, by mapping it against transcriptome index, while STAR counts the number of reads per gene while mapping to the whole genome. Volcano plot visualizes every gene and plots their -log10(padj) against the log2(FoldChange). Genes that have a padj of < 0.05 and a log2(FoldChange) > 0.5 or < -0.5 are labeled as upregulated or downregulated respectively. All the other genes are considered not significant. This highlights the effect of the treatment on gene regulation. The volcano plots differ in that Salmon has a higher downregulated genes and lower upregulated genes compared to STAR.
salmon_all_matrix <- data.frame(read.csv("nori_salmon_vehiclevsT3_DEG.csv", header = TRUE))
salmon_hm_matrix <- data.frame(dplyr::select(salmon_all_matrix, symbol, 8:15), row.names = 1)
salmon_hm_matrix <- salmon_hm_matrix[-c(53:264),] #remove ribosomal genes
salmon_heatmap_meta <- sampleTable
rownames(salmon_heatmap_meta) <- salmon_heatmap_meta$samplename
salmon_heatmap_meta <- dplyr::select(salmon_heatmap_meta, treatment, time)
salmon_heatmap_meta$treatment <- factor(salmon_heatmap_meta$treatment, levels = c("T3", "vehicle"))
salmon_heatmap_meta$time <- factor(salmon_heatmap_meta$time, levels = c("6HR", "24HR"))
salmon_heatmap_colors <- rev(brewer.pal(10, "RdBu"))
salmon_ann_colors <- list(treatment = c("T3" = "#D11313", "vehicle" = "#6DB0F8"),
time = c("6HR" = "#fff2cc", "24HR" = "#93c47d"))
STAR_all_matrix <- data.frame(read.csv("nori_STAR_vehiclevsT3_DEG.csv", header = TRUE))
STAR_all_matrix <- STAR_all_matrix[-c(150,155), ] #remove duplicate row names
STAR_hm_matrix <- data.frame(dplyr::select(STAR_all_matrix, symbol, 8:15), row.names = 1)
STAR_heatmap_meta <- coldata
rownames(STAR_heatmap_meta) <- STAR_heatmap_meta$samplename
STAR_heatmap_meta <- dplyr::select(STAR_heatmap_meta, treatment, time)
STAR_heatmap_meta$treatment <- factor(STAR_heatmap_meta$treatment, levels = c("T3", "vehicle"))
STAR_heatmap_meta$time <- factor(STAR_heatmap_meta$time, levels = c("6HR", "24HR"))
STAR_heatmap_colors <- rev(brewer.pal(10, "RdBu"))
STAR_ann_colors <- list(treatment = c("T3" = "#D11313", "vehicle" = "#6DB0F8"),
time = c("6HR" = "#fff2cc", "24HR" = "#93c47d"))
#salmon
salmon_heatmap <- as.ggplot(pheatmap(salmon_hm_matrix,
color = salmon_heatmap_colors,
annotation_col = salmon_heatmap_meta,
annotation_colors = salmon_ann_colors,
show_colnames = F,
show_rownames = F,
scale = "row",
annotation_names_col = F,
main = "Salmon Heatmap"))
ggsave(filename = "salmon_heatmap.png",
plot = salmon_heatmap,
units = "in",
width = 5,
height = 6,
dpi = 300)
#STAR
STAR_heatmap <- as.ggplot(pheatmap(STAR_hm_matrix,
color = STAR_heatmap_colors,
annotation_col = STAR_heatmap_meta,
annotation_colors = STAR_ann_colors,
show_colnames = F,
show_rownames = F,
scale = "row",
annotation_names_col = F,
main = "STAR Heatmap"))
ggsave(filename = "STAR_heatmap.png",
plot = STAR_heatmap,
units = "in",
width = 5,
height = 6,
dpi = 300)
salmon_heatmap + STAR_heatmap
Heatmap plots the treatment and time as column, and every genes as row. They are plotted according to their Z-scores, by substracting the mean from each gene, and dividing them by the standard deviation. Heatmap visualizes gene expression clustering based on their conditions, ideally with little differences between replicates. Based on the Salmon heatmap, there is a clear difference in gene expression profile between treatments (T3 and vehicle), while the STAR heatmap suggests a difference in profile between the two time points and treatments only at 24 HR.
Although both PCA plots indicate similarities in variation from the different sample conditions, subsequent downstream analysis showed that there are differences in the analysis result depending on whether salmon or STAR was used. These differences would likely impact lowly expressed genes the most, resulting in different numbers of differentially expressed genes, both upregulated and downregulated. These differences arise from the nature of each tools, with STAR being an aligner that aligns to the whole genome, while Salmon is more of a pseudo-aligner that maps to the transcriptome. Both of them have their specific purposes and output, along with their own advantages and disadvantages, such as Salmon generally working faster than STAR.