“Analyzing RNA-Seq of model media and MRS strains with DESeq2”

date: “4/27/2023”

output: html_document, png images

overview: Dataset in this file includes isolates grown in three model media: Pineapple Juice extract, Drosophila melanogaster extract and A. mellifera L. worker bee extract. It also includes the MRS medium in which one chosen isolated from each extract was grown. Differential gene expression analysis for PCA construction done by using MRS as relevel. The GEO accession numbers for these dataset are:

Geo Accession Identification
GSM3178354 BEE1ST
GSM3178355 BEE2ST
GSM3178356 BEE3ST
GSM3178357 DE12
GSM3178358 DE12II
GSM3178359 DE24
GSM3178376 PJ16
GSM3178377 PJ20
GSM3178378 PJ1
GSM3178389 BEE1ST_MRS
GSM3178383 DE12_MRS
GSM3178388 PJ16_MRS

This file also continues to explore the data set to visualize it in volcano plots, boxplots, and heatmaps. The analysis helps highlight the regulation of the transcriptome when strains are switched from model medias to corresponding MRS media. Overall limitations for this DEG analysis revolved around annotation of the reference file not including specific gene ids that could be sourced in BacteriaEnsembl and also a limitation in the experimental design itself where only one MRS strain was grown for each three corresponding model media strains, thus skewing visualization of some plots (hence I assume why they were not included in the paper).

Citation: Filannino, P., De Angelis, M., Di Cagno, R., Gozzi, G., Riciputi, Y., & Gobbetti, M. (2018). How Lactobacillus plantarum shapes its transcriptome in response to contrasting habitats. Environmental microbiology, 20(10), 3700–3716. https://doi.org/10.1111/1462-2920.14372

Install the required R libraries

#Install CRAN packages
install.packages(c("knitr", "RColorBrewer", "pheatmap"))
## Installing packages into '/users/g/e/gewhitne/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Load the required libraries

library(knitr)
library(DESeq2) 
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(RColorBrewer)
library(pheatmap)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse()     masks IRanges::collapse()
## ✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::count()        masks matrixStats::count()
## ✖ dplyr::desc()         masks IRanges::desc()
## ✖ tidyr::expand()       masks S4Vectors::expand()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ dplyr::first()        masks S4Vectors::first()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename()       masks S4Vectors::rename()
## ✖ lubridate::second()   masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice()        masks IRanges::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(EnhancedVolcano)
## Loading required package: ggrepel
library(biomaRt)
library(dplyr)
library(tidyr)
library(cowplot)
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
  1. Generate sample table

    To start the analysis, a sample table with references is needed.

sampleTable <- read.csv("Annotation.csv", header = T)
#Convert to data frame 
sampleTable <- data.frame(sampleTable)
  1. Create factors for condition, batch, replicate, and color.
#specify as factor 
condition <- factor(sampleTable$condition)
replicate <- factor(sampleTable$replicate)
batch <- factor(sampleTable$batch)
color <- factor(sampleTable$color)
  1. Run DESeq2 with condition set as design. The sampleTable in this case is built from the annotation file that contains both medel media and MRS. Only condition is included in the design because were only comparing media strains to corresponding MRS strain. These files can be found in the same directory which is why “.” is specified.
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = ".",
                                  design = ~ condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet 
## dim: 3174 12 
## metadata(1): version
## assays(1): counts
## rownames(3174): LP_RS00005 LP_RS00010 ... LP_RS16120 LP_RS16125
## rowData names(0):
## colnames(12): GSM3178354 GSM3178355 ... GSM3178378 GSM3178388
## colData names(4): condition replicate batch color
  1. Pre-filtering to discard any counts of less than 10.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
  1. Set the three different MRS grown strains as controls for each growing condition.
dds$condition <- relevel(dds$condition, ref = "MRS1","MRS2","MRS3")
  1. Run differential analysis.

    Three contrasts are created because there are three conditions being avaluated in reference to three controls (with one replicate per control). It wouldnt make sense to contrast model media to an MRS control that doesnt correpond to that model media, so three different groups were created.

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res1 <- results(dds, (contrast=c('condition','Model_Media1','MRS1')))
res2 <- results(dds,(contrast=c('condition','Model_Media2','MRS2')))
res3 <- results(dds,(contrast=c('condition','Model_Media3','MRS3')))


#ordering results table by the smallest *padj* value:
result1 <- res1[order(res1$padj), ]
result2 <- res2[order(res2$padj), ]
result3 <- res3[order(res3$padj), ]


write.csv(res1,"BeevsMRS1.csv")
write.csv(res2,"DrosphilavsMRS2.csv")
write.csv(res3,"PinapplevsMRS3.csv")
  1. p-values were adjusted with Benjamini-Hochberg’s FDR correction
table(result1$padj<0.05)
## 
## FALSE  TRUE 
##  2505   378
table(result2$padj<0.05)
## 
## FALSE  TRUE 
##  2154   729
table(result3$padj<0.05)
## 
## FALSE  TRUE 
##  2440   385
  1. Merge all three different results with normalized data.
resdata1 <- merge(as.data.frame(result1), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata1)[1] <- "Gene_id"
resdata2 <- merge(as.data.frame(result2), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata2)[1] <- "Gene_id"
resdata3 <- merge(as.data.frame(result3), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata3)[1] <- "Gene_id"



#write results into csv files
write.csv(resdata1, file="Model1vsMRS1_unfiltered_matrix.csv")
write.csv(resdata2, file="Model2vsMRS2_unfiltered_matrix.csv")
write.csv(resdata3, file="Model3vsMRS3_unfiltered_matrix.csv")

The authors of the paper set the criteria of a transcript being identified as significantly differentially expressed if it had fold change ≥ 2 (or ≤ -2) and FDR q-value ≤ 0.1. Filtering will be done to reflect this.

#Get DEG up/down lists
resdata1 <- as.data.frame(result1) %>% dplyr::filter(abs(log2FoldChange) >= 2 & padj <= 0.1) %>% tibble::rownames_to_column("Gene_id")
resdata2 <- as.data.frame(result2) %>% dplyr::filter(abs(log2FoldChange) >= 2 & padj <= 0.1) %>% tibble::rownames_to_column("Gene_id")
resdata3 <- as.data.frame(result3) %>% dplyr::filter(abs(log2FoldChange) >= 2 & padj <= 0.1) %>% tibble::rownames_to_column("Gene_id")

#write results
write.csv(resdata1, "Model1vsMRS1_log2fc2_fdr01_DEGlist.csv", row.names = TRUE)
write.csv(resdata2, "Model2vsMRS2_log2fc2_fdr01_DEGlist.csv", row.names = TRUE)
write.csv(resdata3, "Model3vsMRS3_log2fc2_fdr01_DEGlist.csv", row.names = TRUE)

Data transformation and visualization

Principal components analysis (PCA)

  1. Perform variance stabilizing transformation and specify to be blind. Generate a PCA plot once transformation is performed.
vstcounts <- rlog(dds, blind=FALSE)
plotPCA(vstcounts, intgroup=c("condition","batch"))

  1. Customize the ggplot to show each condition: pineapple juice, drosphilia extract, or bee extract. Set all the MRS as one different condition to highlight as the “generalized” transcriptome its supposed to have.
#Correct some more formattin for labels and making sure the x and y axis are similar using `ylim` argument.
pcaData <- plotPCA(vstcounts, intgroup=c("condition", "batch"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=color)) +
  scale_colour_discrete(name="Strain")+
  geom_point(size=3) +
  geom_text(label=batch, color = 'black', size=2.5, position = position_nudge(y = -3.5))+
  ylim(-60,60) + 
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  
  coord_fixed()

# Save the PCA plot
ggsave(path = "Figs", filename = "PCA2.png",  height=5, width=6, units='in', dpi = 300, bg = "transparent", device='png')
  1. Generating correlation heatmap, this is not an included figure of the paper but its nice to see that transcripts originating in the same media have higher correlation. It also helps highlight how some of the isolates, while not grown is MRS are still very generalized suggesting nomadic adaptations can also tend to generalized transcripts too.
library(pheatmap)
rld <- rlog(dds, blind=FALSE)
rld_mat <- assay(rld)
rld_cor <- cor(rld_mat)
pheatmap(rld_cor)

Volcano Plots

For this dataset I will continue analysis to include volcano plots to visualize the DEG between the model media and the MRS. I thought it would be more significant as it would help analyze how the strains return to a generalized transcriptome in the general nutrition media.

Simple volcano plots

  1. Prepping the data for volcano plot
resdata1 <- read.csv("Model1vsMRS1_unfiltered_matrix.csv", header = T)
resdata1 <- resdata1[, c(2:20)]


resdata2 <- read.csv("Model2vsMRS2_unfiltered_matrix.csv", header = T)
resdata2 <- resdata2[, c(2:20)]


resdata3 <- read.csv("Model3vsMRS3_unfiltered_matrix.csv", header = T)
resdata3 <- resdata3[, c(2:20)]
  1. Set differential expression level based on padj. Again, authors used a value of 0.1 to be considered as significant.
threshold_padj1 <- resdata1$padj < 0.1
length(which(threshold_padj1)) ## Determine the number of TRUE values
## [1] 482
threshold_padj2 <- resdata2$padj < 0.1
length(which(threshold_padj2)) ## Determine the number of TRUE values
## [1] 907
threshold_padj3 <- resdata3$padj < 0.1
length(which(threshold_padj3)) ## Determine the number of TRUE values
## [1] 521
  1. Setting thresholds.
resdata1$threshold <- threshold_padj1 ## Add vector as a column (threshold) to the resdata


resdata2$threshold <- threshold_padj2 ## Add vector as a column (threshold) to the resdata


resdata3$threshold <- threshold_padj3 ## Add vector as a column (threshold) to the resdata
  1. Now that threshold is set, we can visualize through ggplot.
ggplot(resdata1) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic() + 
        ggtitle("Apis mellifera vs BEE1ST_MRS") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("black", "red")) +
  scale_x_continuous(limits = c(-20, 20)) + 
  coord_cartesian(ylim=c(0, 80), clip = "off")                                            
## Warning: Removed 97 rows containing missing values (`geom_point()`).

ggplot(resdata2) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic() + 
        ggtitle("Drosophila melanogaster vs DE12_MRS") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("black", "red")) +
  scale_x_continuous(limits = c(-40, 40)) + 
  coord_cartesian(ylim=c(0, 40), clip = "off") 
## Warning: Removed 96 rows containing missing values (`geom_point()`).

ggplot(resdata3) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic() + 
        ggtitle("Pineapple Juice vs PJ16_MRS") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("black", "red")) +
  scale_x_continuous(limits = c(-20, 40)) + 
  coord_cartesian(ylim=c(0, 50), clip = "off") 
## Warning: Removed 154 rows containing missing values (`geom_point()`).

Enhanced volcano plot

While the above are good for a quick visualization of the DEG, enhanced Volcano Plots are better at creating literature quality figures. In these figures I will highlight some of the most differential expressed genes that are also involved in energy production or de novo amino acid synthesis.

EnhancedVolcano(resdata1,
                lab = as.character(resdata1$Gene_id),
                x = 'log2FoldChange',
                y = 'padj',
                selectLab = c("LP_RS11495"),
                title = "A. mellifera strains vs MRS1",
                subtitle = "Visualizing purF", 
                FCcutoff = 2,
                pCutoff = 0.1,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                labCol = 'black',
                labFace = 'bold',
                boxedLabels = TRUE,
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black', 
                gridlines.major = FALSE,
                gridlines.minor = FALSE)

ggsave(path = "Figs",filename ='Bee-volcano.png')
## Saving 7 x 5 in image
EnhancedVolcano(resdata2,
                lab = as.character(resdata2$Gene_id),
                x = 'log2FoldChange',
                y = 'padj',
                selectLab = c("LP_RS15355"),
                title = "Drosophila strains vs MRS2",
                subtitle = "Visualizing adhE", 
                FCcutoff = 2,
                pCutoff = 0.1,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                labCol = 'black',
                labFace = 'bold',
                boxedLabels = TRUE,
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black', 
                gridlines.major = FALSE,
                gridlines.minor = FALSE)

ggsave(path = "Figs",filename ='Drosophila-volcano.png')
## Saving 7 x 5 in image
EnhancedVolcano(resdata3,
                lab = as.character(resdata3$Gene_id),
                x = 'log2FoldChange',
                y = 'padj',
                selectLab = c("LP_RS01415"),
                title = "Pineapple Juice strains vs MRS3",
                subtitle = "Visualizing biotin-[acetyl-CoA-carboxylase] ligase", 
                FCcutoff = 2,
                pCutoff = 0.1,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                labCol = 'black',
                labFace = 'bold',
                boxedLabels = TRUE,
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black', 
                gridlines.major = FALSE,
                gridlines.minor = FALSE)

ggsave(path = "Figs",filename ='Pineapple-volcano.png')
## Saving 7 x 5 in image

Boxplots

To further visualize the specific DEG we can use boxplots.

In my opinion boxplot grids wouldnt be as useful to visualize this data because the comparison to one MRS sample for every each model media limits the possible variation. Therefore, all the MRS boxes will be seen as a line will the model media variation will not be visualized properly. However, for the sake of looking at individual gene expression, its still useful.

  1. The first step would be to create filtered csv files and top20 DEG csv file.
#Top20 DEG
resdata1_sig <- as.data.frame(resdata1) %>% dplyr::filter(abs(log2FoldChange) > 2 & padj < 0.1)
write.csv(resdata1_sig, file="Model1vsMRS1_DEG.csv")

top20_Model1vsMRS1 <- resdata1_sig [c(1:20),c(1,8:11)]
write.csv(top20_Model1vsMRS1, file="top_20_DEG_Model1vsMRS1.csv")

#Contrast2
resdata2_sig <- as.data.frame(resdata2) %>% dplyr::filter(abs(log2FoldChange) > 2 & padj < 0.1)
write.csv(resdata2_sig, file="Model2vsMRS2_DEG.csv")

top20_Model2vsMRS2 <- resdata2_sig [c(1:20),c(1,12:15)]
write.csv(top20_Model2vsMRS2, file="top_20_DEG_Model2vsMRS2.csv")

#Contrast3
resdata3_sig <- as.data.frame(resdata3) %>% dplyr::filter(abs(log2FoldChange) > 2 & padj < 0.1)
write.csv(resdata3_sig, file="Model3vsMRS3_DEG.csv")

top20_Model3vsMRS3 <- resdata3_sig [c(1:20),c(1,16:19)]
write.csv(top20_Model3vsMRS3, file="top_20_DEG_Model3vsMRS3.csv")
  1. Create top_deg objects
top_deg1 <- read.csv("top_20_DEG_Model1vsMRS1.csv", header = T)
top_deg2 <- read.csv("top_20_DEG_Model2vsMRS2.csv", header = T)
top_deg3 <- read.csv("top_20_DEG_Model3vsMRS3.csv", header = T)
  1. Create annotation file
meta <- read.csv("Annotation_DEG.csv", header = T)
  1. Create Expression Plots for each contrast.
expression_plot1 <- pivot_longer(top_deg1,
                                cols = 3:6,
                                values_to = "normalized_counts1")

expression_plot2 <- pivot_longer(top_deg2,
                                cols = 3:6,
                                values_to = "normalized_counts2")

expression_plot3 <- pivot_longer(top_deg3,
                                cols = 3:6,
                                values_to = "normalized_counts3")
  1. Join expression plot and meta table
# Join metadata for visualizing groups or features for DEG 1
expression_plot1 <- left_join(x = expression_plot1, 
                             y = meta, 
                             by = "name")

# Join metadata for visualizing groups or features for DEG 2
expression_plot2 <- left_join(x = expression_plot2, 
                             y = meta, 
                             by = "name")


# Join metadata for visualizing groups or features for DEG 3
expression_plot3 <- left_join(x = expression_plot3, 
                             y = meta, 
                             by = "name")
  1. Look at genes of interest. I focused on one example for each contrasts. These were picked from the top 20 DEG and clearly stood out in the volcano plots. I also decided to focus on these because they were involved in either energy sources (AdheE and purF) or de novo amino acid synthesis (the Biotin-[acetyl-CoA-carboxylase] ligase)

    a. Look at gene of interest for Bee Extract.

#This would be the purF
LP_RS11495 <- expression_plot1 %>%
filter(Gene_id == "LP_RS11495")

# Plot into box chart
boxplot_purF<-ggplot(LP_RS11495) +
  geom_boxplot(aes(x=condition, 
                   y=normalized_counts1, 
                   fill=condition)) +
  labs(title = "purF",
              subtitle = "A. mellifera vs MRS1")+
     theme_bw() + scale_fill_manual(values = c("blue", "red")) +
        ggtitle("purF") +
        xlab("Condition") + 
        ylab("Normalized Counts") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5),
              plot.subtitle = element_text(size=rel(0.7),hjust = 0.5),#subtitle
              axis.title = element_text(size = rel(0.75)),
              axis.text = element_text(size = rel(0.75))) #axis

b. Look at a gene of interest for Fly Extract

#This would be the Acetaldehyde-alcohol dehydrogenase (AdhE gene)
LP_RS15355_exp <- expression_plot2 %>%
filter(Gene_id == "LP_RS15355")

# Plot into box chart
boxplot_AdhE<-ggplot(LP_RS15355_exp) +
  geom_boxplot(aes(x=condition, 
                   y=normalized_counts2, 
                   fill=condition)) +
  labs(title = "AdhE",
              subtitle = "Drosophila vs MRS2")+
     theme_bw() + scale_fill_manual(values = c("blue", "red")) +
        ggtitle("AdhE") +
        xlab("Condition") + 
        ylab("Normalized Counts") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5),
              plot.subtitle = element_text(size=rel(0.7),hjust = 0.5),#subtitle
              axis.title = element_text(size = rel(0.75)),
              axis.text = element_text(size = rel(0.75))) #axis

c. Look at a gene of interest for Pineapple juice extract

#This would be the biotin-[acetyl-CoA-carboxylase] ligase.
LP_RS01415_exp <- expression_plot3 %>%
filter(Gene_id == "LP_RS01415")

# Plot into box chart
boxplot_biotin<-ggplot(LP_RS01415_exp) +
  geom_boxplot(aes(x=condition, 
                   y=normalized_counts3, 
                   fill=condition)) +
     theme_bw() + scale_fill_manual(values = c("blue", "red")) +
        ggtitle("Biotin-[acetyl-CoA-carboxylase] ligase") +
    labs(title = "Biotin-[acetyl-CoA-carboxylase] ligase",
              subtitle = "Pineapple Juice vs MRS3")+
        xlab("Condition") + 
        ylab("Normalized Counts") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(0.9), hjust = 0.5), #title
              plot.subtitle = element_text(size=rel(0.7),hjust = 0.5),
              axis.title = element_text(size = rel(0.75)),
              axis.text = element_text(size = rel(0.75))) #axis
  1. Arrange into a grid for ease of visualization
# Arranging multiple plots in a figure
boxplot_grid <- plot_grid(boxplot_purF,
                          boxplot_AdhE,
                          boxplot_biotin,
                          ncol = 2)
boxplot_grid

ggsave(path = "Figs",filename ='boxplot_grid.png')
## Saving 7 x 5 in image

Heatmaps

Maybe a better way to visualize the DEG in this experiment is through heat maps. As comapring the different contrasts is better than looking at differences between groups.

  1. Start by placing the normalized DEG into a heatmap file. This is repeated for each contrast.
#Set conditions for Bee contrast

BeeFilteredModelvsMRS <- resdata1_sig [,c(8:11)]
write.csv(BeeFilteredModelvsMRS, file="BeeModelvsMRS_DEG.csv")


Beeheatmap_normDEG <- read.csv("BeeModelvsMRS_DEG.csv", header = T, row.names = 1)
heatmap_meta <- read.csv("BeeMap.csv", header = T)

#Set conditions for Bee contrast

FlyFilteredModelvsMRS <- resdata2_sig [,c(12:15)]
write.csv(FlyFilteredModelvsMRS, file="FlyModelvsMRS_DEG.csv")


Flyheatmap_normDEG <- read.csv("FlyModelvsMRS_DEG.csv", header = T, row.names = 1)
heatmap_meta2 <- read.csv("FlyMap.csv", header = T)

#Set conditions for Pineapple contrast

PineappleFilteredModelvsMRS <- resdata3_sig [,c(16:19)]
write.csv(PineappleFilteredModelvsMRS, file="PineappleModelvsMRS_DEG.csv")


Pineappleheatmap_normDEG <- read.csv("PineappleModelvsMRS_DEG.csv", header = T, row.names = 1)
heatmap_meta3 <- read.csv("PineappleMap.csv", header = T)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'PineappleMap.csv'
  1. Set the conditions for each file. This includes names, reference levels, colors to indicate up or down regulation, and colors for annotation.
# Assign the sample names to be the row names
rownames(heatmap_meta) <- heatmap_meta$name
rownames(heatmap_meta2) <- heatmap_meta2$name
rownames(heatmap_meta3) <- heatmap_meta3$name

# Re-level factors
heatmap_meta$condition <- factor(heatmap_meta$condition, levels = c("Bee", "MRS"))
heatmap_meta2$condition <- factor(heatmap_meta2$condition, levels = c("Drosophila", "MRS"))
heatmap_meta3$condition <- factor(heatmap_meta3$condition, levels = c("Pineapple", "MRS"))

# Colors for heatmap
heatmap_colors <- colorRampPalette(c("blue", "white", "red"))(6)


# Colors for denoting annotations
ann_colors <- list(condition = c(Bee= "#D11313", MRS= "#6DB0F8"))
ann_colors2 <- list(condition = c(Drosophila= "#D11313", MRS= "#6DB0F8"))
ann_colors3 <- list(condition = c(Pineapple= "#D11313", MRS= "#6DB0F8"))
  1. Perform pheatmap and attribute each plot to an object.
bee<-pheatmap(Beeheatmap_normDEG,
         color = heatmap_colors,
         annotation_col = heatmap_meta,
         annotation_colors = ann_colors,
         show_colnames = F, 
         show_rownames = F,
         main = ('A. mellifera vs BEE1ST_MRS'),
         scale = "row")

drosophila<-pheatmap(Flyheatmap_normDEG,
         color = heatmap_colors,
         annotation_col = heatmap_meta2,
         annotation_colors = ann_colors2,
         show_colnames = F, 
         show_rownames = F,
        main = ('Drosophila vs DE12_MRS'),
         scale = "row")

pineapple<-pheatmap(Pineappleheatmap_normDEG,
         color = heatmap_colors,
         annotation_col = heatmap_meta3,
         annotation_colors = ann_colors3,
         show_colnames = F, 
         show_rownames = F,
         main = ('Pineapple Juice vs PJ16_MRS'),
         scale = "row")

save_pheatmap_pdf <- function(x, filename, width=7, height=7) {
   stopifnot(!missing(x))
   stopifnot(!missing(filename))
   pdf(filename, width=width, height=height)
   grid::grid.newpage()
   grid::grid.draw(x$gtable)
   dev.off()
}
save_pheatmap_pdf(bee, "bee.pdf")
## png 
##   2
save_pheatmap_pdf(drosophila, "drosophila.pdf")
## png 
##   2
save_pheatmap_pdf(pineapple, "pineapple.pdf")
## png 
##   2

Session info

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] cowplot_1.1.1               biomaRt_2.54.1             
##  [3] EnhancedVolcano_1.16.0      ggrepel_0.9.3              
##  [5] lubridate_1.9.2             forcats_1.0.0              
##  [7] stringr_1.5.0               dplyr_1.1.1                
##  [9] purrr_1.0.1                 readr_2.1.4                
## [11] tidyr_1.3.0                 tibble_3.2.1               
## [13] tidyverse_2.0.0             ggplot2_3.4.2              
## [15] pheatmap_1.0.12             RColorBrewer_1.1-3         
## [17] DESeq2_1.38.3               SummarizedExperiment_1.28.0
## [19] Biobase_2.58.0              MatrixGenerics_1.10.0      
## [21] matrixStats_0.63.0          GenomicRanges_1.50.2       
## [23] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [25] S4Vectors_0.36.2            BiocGenerics_0.44.0        
## [27] knitr_1.42                 
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           bit64_4.0.5            filelock_1.0.2        
##  [4] progress_1.2.2         httr_1.4.5             tools_4.2.2           
##  [7] bslib_0.4.2            utf8_1.2.3             R6_2.5.1              
## [10] DBI_1.1.3              colorspace_2.1-0       withr_2.5.0           
## [13] tidyselect_1.2.0       prettyunits_1.1.1      curl_5.0.0            
## [16] bit_4.0.5              compiler_4.2.2         textshaping_0.3.6     
## [19] cli_3.6.1              xml2_1.3.3             DelayedArray_0.24.0   
## [22] labeling_0.4.2         sass_0.4.5             scales_1.2.1          
## [25] rappdirs_0.3.3         systemfonts_1.0.4      digest_0.6.31         
## [28] rmarkdown_2.21         XVector_0.38.0         pkgconfig_2.0.3       
## [31] htmltools_0.5.5        highr_0.10             dbplyr_2.3.2          
## [34] fastmap_1.1.1          rlang_1.1.0            rstudioapi_0.14       
## [37] RSQLite_2.3.1          farver_2.1.1           jquerylib_0.1.4       
## [40] generics_0.1.3         jsonlite_1.8.4         BiocParallel_1.32.6   
## [43] RCurl_1.98-1.12        magrittr_2.0.3         GenomeInfoDbData_1.2.9
## [46] Matrix_1.5-4           Rcpp_1.0.10            munsell_0.5.0         
## [49] fansi_1.0.4            lifecycle_1.0.3        stringi_1.7.12        
## [52] yaml_2.3.7             zlibbioc_1.44.0        BiocFileCache_2.6.1   
## [55] grid_4.2.2             blob_1.2.4             parallel_4.2.2        
## [58] crayon_1.5.2           lattice_0.20-45        Biostrings_2.66.0     
## [61] annotate_1.76.0        hms_1.1.3              KEGGREST_1.38.0       
## [64] locfit_1.5-9.7         pillar_1.9.0           geneplotter_1.76.0    
## [67] codetools_0.2-18       XML_3.99-0.14          glue_1.6.2            
## [70] evaluate_0.20          png_0.1-8              vctrs_0.6.1           
## [73] tzdb_0.3.0             gtable_0.3.3           cachem_1.0.7          
## [76] xfun_0.38              xtable_1.8-4           ragg_1.2.5            
## [79] AnnotationDbi_1.60.2   memoise_2.0.1          timechange_0.2.0

Citation

Research paper:

Filannino, P., De Angelis, M., Di Cagno, R., Gozzi, G., Riciputi, Y., & Gobbetti, M. (2018). How Lactobacillus plantarum shapes its transcriptome in response to contrasting habitats. Environmental microbiology, 20(10), 3700–3716. https://doi.org/10.1111/1462-2920.14372

Reference script:

title: “Analyzing RNA-Seq with DESeq2” author: “Michael I. Love, Simon Anders, and Wolfgang Huber”

Note: if you use DESeq2 in published research, please cite:

Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8