The Role of Urea Treatment on Arabidopsis thaliana Stress Response Over Time
About The Dataset
This data was retrieved from a study that focused on the role of exogenous glutamate (Glu) treatments in inducing immune responses in Arabidopsis thalia. The study consisted of treatments over time to Arabidopsis that was under stress by bacterial pathogens. The authors then examined the stress response genes impacted. They used urea as a treatment as well to ensure that nitrogen levels were accounted for in order to isolate genes that were only impacted by glutamate. In total, the authors had 3 replicates per treatment group, the treatment groups were administered at 6 hours, 24 hours and 48 hours, with the mock group administered at 0 hours (control of water).
This project was interested in the role of urea in the stress response of Arabidopsis thalia and downloaded data of only the 3 mock replicates, the 6 hour urea treatment replicates , and the 48 hour urea treatment replicates. The goal was to evaluate the different time applications of urea on Arabidopsis response to pathogen stress. The hypothesis being that the longer exposure to urea treatment induces pathogen stress tolerance.
Data for the project was downloaded into the Vermont Advanced Computing Cluster (VACC). From there FASTQC, MultiQC, trimming (using trimmomatic), alignment to the reference genome, RSeqC, and HTseq to generate read counts were all preformed. It is important to note that the first replicate of the 6 hour urea treatment was corrupted during generation of the read counts earlier, and was discarded. The workflow then continued into R, where DESeq was preformed and visualizations were generated. The workflow is seen below.
GEO Accession:
GSE123622
Citation:
Goto, Y., Maki, N., Ichihashi, Y., Kitazawa, D., Igarashi, D., Kadota, Y., & Shirasu, K. (2020). Exogenous Treatment with Glutamate Induces Immune Responses in Arabidopsis. Molecular plant-microbe interactions : MPMI, 33(3), 474–487. https://doi.org/10.1094/MPMI-09-19-0262-R
Setting Up the Workflow
Setting the Working Directory and Loading Libraries
Loading the Metadata Table
Meta data was originally sourced from the SRA Run Selector, but after having issues with implementation into R, Dr. Rodriguez provided the one below. This was then implemented using the read.csv function and named coldata.
## treatment time replicate group
## SRR8311214 mock 0h 1 mock_0h
## SRR8311215 mock 0h 2 mock_0h
## SRR8311216 mock 0h 3 mock_0h
## SRR8311218 urea 6h 1 urea_6h
## SRR8311219 urea 6h 2 urea_6h
## SRR8311226 urea 48h 1 urea_48h
Converting to a Data Frame
Titled the object “sampleTable”
## treatment time replicate group
## SRR8311214 mock 0h 1 mock_0h
## SRR8311215 mock 0h 2 mock_0h
## SRR8311216 mock 0h 3 mock_0h
## SRR8311218 urea 6h 1 urea_6h
## SRR8311219 urea 6h 2 urea_6h
## SRR8311226 urea 48h 1 urea_48h
## 'data.frame': 8 obs. of 4 variables:
## $ treatment: chr "mock" "mock" "mock" "urea" ...
## $ time : chr "0h" "0h" "0h" "6h" ...
## $ replicate: int 1 2 3 1 2 1 2 3
## $ group : chr "mock_0h" "mock_0h" "mock_0h" "urea_6h" ...
Loading the Counts Matrix
The counts matrix was created in a couple of steps, starting with the implementation of our raw unstranded counts files from the VACC, located at a file path that we named as “main_dir”. The files were then assigned to the object “count_data” by list.files using the object “main_dir”. Next readDGE was used to convert the .txt files into a matrix like table. We then corrected the column names to fit into our dataset better as not to have any interruptions down the line due to the necessity of formatting for DESeq. This was then saved as a .csv file. It was subsequently read in and set as a matrix.
The file path to the object “main_dir” was used since there were issues when assigning the file path to the “path =” part of the list.files function. When recreating the document, using the “main_dir” object may not be necessary as the files are located in folder titled “count_data” in the submission folder, and the counts matrix itself is attached there as well.
#lines 82-97 may not be needed as the counts matrix is attached in the submission, but it is included for reference.
main_dir <- "/gpfs1/home/r/v/rvanvran/final_proj/fastq/trimmed_output2/counts/Unstranded_Counts/count_data"
count_data <- list.files(path = main_dir,
pattern = "\\.txt$",
recursive = TRUE,
full.names = TRUE)
#print(count_data)
counts <- readDGE(count_data, columns = c(1,2))$counts## Meta tags detected: __no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
colnames(counts) <- c("SRR8311214", "SRR8311215", "SRR8311216", "SRR8311218", "SRR8311219", "SRR8311226", "SRR8311227", "SRR8311228"
)
#counts
#write.csv(counts, file = "counts_matrix.csv", row.names = TRUE)
countdata <- read.csv("counts_matrix.csv",
header=TRUE,
row.names=1)
countdata <- as.matrix(countdata)## [1] "SRR8311214" "SRR8311215" "SRR8311216" "SRR8311218" "SRR8311219"
## [6] "SRR8311226" "SRR8311227" "SRR8311228"
DESeq
We used DESeqDataSetFromMatrix because we set the count data as a matrix and the names of the data used are self explanatory, where coldata is our metadata table from above, and countdata is the counts matrix.
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## class: DESeqDataSet
## dim: 32837 8
## metadata(1): version
## assays(1): counts
## rownames(32837): AT1G01020 AT1G01030 ... __not_aligned
## __alignment_not_unique
## rowData names(0):
## colnames(8): SRR8311214 SRR8311215 ... SRR8311227 SRR8311228
## colData names(4): treatment time replicate group
Pre-Filtering
Although not always necessary, we chose to pre-filter mainly to ensure that future plots are cleaner by ensuring super low counts (or no count data cells) are not plotted. Also this helps to reduce file sizes and speed up the computation.
Running DESeq
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
Generating a PCA Plot
Preparing Data to Run a PCA Plot
In this chunk of code, the first step applies a variance-stabilizing transformation (vst) on the DESeqDataSet. Then plotPCA preforms the calculations of Principal Component Analysis.
vstcounts <- vst(dds, blind = FALSE)
pcaData <- plotPCA(vstcounts, intgroup=c("group"),
returnData=TRUE)## using ntop=500 top features by variance
Creating a PCA Plot with a Figure Caption
Final_PCA <- ggplot(pcaData, aes(PC1, PC2, color = group, shape = group)) +
geom_point(size = 4, alpha = 0.8) +
ylim(-15, 15) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
ggtitle("PCA of the Treatment of Urea Using RNASeq data") +
labs(caption = "PCA was preformed using variance-stabilized counts") +
scale_color_brewer(palette = "Set1") +
theme_classic(base_size = 10) +
theme(
legend.position = "right",
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.caption = element_text(hjust = 0.5, size = 8)
)
Final_PCAInterpreting the DESeq Data
Running the Results Function on the DESeq Output
This was done in order to further plot and interpret data, as well as a checkpoint on the data. The results() function is used for each treatment group in order to start the pairwise comparisons.
res_urea6h_mock0h <- results(dds, contrast=c('group', 'urea_6h', 'mock_0h'))
#head(res_urea6h_mock0h)
res_urea48h_mock0h <- results(dds, contrast=c('group', 'urea_48h', 'mock_0h'))
#head(res_urea48h_mock0h)
res_urea48h_urea6h <- results(dds, contrast=c('group', 'urea_48h', 'urea_6h'))
#head(res_urea48h_urea6h)
#summary(res_urea6h_mock0h)
#summary(res_urea48h_mock0h)
#summary(res_urea48h_urea6h)
#str(res_urea6h_mock0h)
#str(res_urea48h_mock0h)
#str(res_urea48h_urea6h)Conducting plotMA
The plotMA outputs here can be used to compare and ensure that the lfcShrink function (used later) worked. It is also a checkpoint to make sure the code is working as intended by exploring the relationship between log2fold changes, significant DE genes and the genes mean count. The table() function is used to gauge sizes of significant DE genes.
## [1] "Intercept" "group_urea_48h_vs_mock_0h"
## [3] "group_urea_6h_vs_mock_0h"
##
## FALSE TRUE
## 14344 56
##
## FALSE TRUE
## 14282 1209
##
## FALSE TRUE
## 10727 762
GeneID Conversion - lfcShrink and Volcano Plots for Each Comparison.
This step was kindly preformed by Dr. Rodriguez due to errors in loading genekitr on my (rvanvran) VACC. In order for Dr. Rodriguez to preform this though, the data for each comparison needed to undergo lfcShrink which shrinks the log fold changes to a manageable and readable output (can be seen visually in the plotMA functions before and after this step). These outputs were then organized into a specific format, where GeneID and Differentially Expressed columns were created. The DE column also went through a format change where it was specified that only genes that were padj < 0.05 and had a lfc of more than 1 or less than -1 were present and named accordingly (up regulated-lfc>1 & padj < 0.05, down regulated-lfc<-1 & padj < 0.05). All these steps were done to all three comparisons (6h Urea v Mock, 48h Urea v Mock, and 48h Urea v 6h Urea) and sent to Dr. Rodriguez for the GeneID conversions. The files that we received back were then loaded back in. A label column was created to select for only certain genes of our choosing in order to help visualize specific genes, but as not to over crowd the plot.
Urea 48h v 6h lfcShrink
shrunk_urea_compare <- lfcShrink(dds,
contrast = c("group", "urea_48h", "urea_6h"),
type = "ashr") # or "apeglm" if available## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
result_urea_compare <- shrunk_urea_compare[order(shrunk_urea_compare$padj), ]
table(shrunk_urea_compare$padj<0.05)##
## FALSE TRUE
## 10727 762
result_urea_compare <- merge(as.data.frame(result_urea_compare),
as.data.frame(counts(dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(result_urea_compare)[1] <- "Gene_id"
str(result_urea_compare)## 'data.frame': 18767 obs. of 14 variables:
## $ Gene_id : 'AsIs' chr "AT4G10340" "AT5G64120" "AT4G30270" "AT2G34430" ...
## $ baseMean : num 1637.4 298.5 744.2 1218.3 93.2 ...
## $ log2FoldChange: num -1.49 2.41 2.36 -1.55 2.68 ...
## $ lfcSE : num 0.158 0.276 0.279 0.182 0.34 ...
## $ pvalue : num 2.38e-23 1.19e-22 1.29e-21 5.06e-20 4.61e-19 ...
## $ padj : num 2.74e-19 6.85e-19 4.96e-18 1.45e-16 1.06e-15 ...
## $ SRR8311214 : num 2280.8 105.9 469.1 1774.9 10.8 ...
## $ SRR8311215 : num 2109.93 94.3 465.6 1473.42 5.89 ...
## $ SRR8311216 : num 1970.8 114.4 233.4 1657.3 19.1 ...
## $ SRR8311218 : num 2219 125.8 237.9 1750.9 27.2 ...
## $ SRR8311219 : num 2261 83 268.3 1532.2 34.6 ...
## $ SRR8311226 : num 710 813 1539 568 267 ...
## $ SRR8311227 : num 769 576 1603 438 216 ...
## $ SRR8311228 : num 779 475 1138 552 164 ...
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 AT4G10340 1637.44498 -1.494123 0.1583094 2.383608e-23 2.738527e-19
## 2 AT5G64120 298.46213 2.409674 0.2760458 1.192882e-22 6.852508e-19
## 3 AT4G30270 744.18880 2.356695 0.2789628 1.294432e-21 4.957242e-18
## 4 AT2G34430 1218.30165 -1.552136 0.1815641 5.058129e-20 1.452821e-16
## 5 AT4G15610 93.20197 2.684037 0.3397821 4.608409e-19 1.058920e-15
## 6 AT3G54890 747.15947 -1.411976 0.1687273 6.221532e-19 1.191320e-15
## SRR8311214 SRR8311215 SRR8311216 SRR8311218 SRR8311219 SRR8311226 SRR8311227
## 1 2280.78084 2109.931062 1970.82013 2219.04765 2261.01927 709.9491 769.4071
## 2 105.93200 94.298595 114.38382 125.83018 82.97318 812.6091 576.4408
## 3 469.12743 465.599313 233.36929 237.89170 268.27996 1538.8589 1602.7266
## 4 1774.90149 1473.415546 1657.25068 1750.91351 1532.23814 567.7512 437.5542
## 5 10.80939 5.893662 19.06397 27.15484 34.57216 267.4014 216.3189
## 6 964.19740 978.347923 844.73111 1043.35781 1045.46212 389.1367 336.7692
## SRR8311228
## 1 778.6047
## 2 475.2295
## 3 1137.6572
## 4 552.3885
## 5 164.4013
## 6 375.2734
result_urea_compare$diffexpressed <- "Unchanged"
result_urea_compare$diffexpressed[result_urea_compare$padj < 0.05 & result_urea_compare$log2FoldChange > 1] <- "Upregulated"
result_urea_compare$diffexpressed[result_urea_compare$padj < 0.05 & result_urea_compare$log2FoldChange < -1] <- "Downregulated"
str(result_urea_compare)## 'data.frame': 18767 obs. of 15 variables:
## $ Gene_id : 'AsIs' chr "AT4G10340" "AT5G64120" "AT4G30270" "AT2G34430" ...
## $ baseMean : num 1637.4 298.5 744.2 1218.3 93.2 ...
## $ log2FoldChange: num -1.49 2.41 2.36 -1.55 2.68 ...
## $ lfcSE : num 0.158 0.276 0.279 0.182 0.34 ...
## $ pvalue : num 2.38e-23 1.19e-22 1.29e-21 5.06e-20 4.61e-19 ...
## $ padj : num 2.74e-19 6.85e-19 4.96e-18 1.45e-16 1.06e-15 ...
## $ SRR8311214 : num 2280.8 105.9 469.1 1774.9 10.8 ...
## $ SRR8311215 : num 2109.93 94.3 465.6 1473.42 5.89 ...
## $ SRR8311216 : num 1970.8 114.4 233.4 1657.3 19.1 ...
## $ SRR8311218 : num 2219 125.8 237.9 1750.9 27.2 ...
## $ SRR8311219 : num 2261 83 268.3 1532.2 34.6 ...
## $ SRR8311226 : num 710 813 1539 568 267 ...
## $ SRR8311227 : num 769 576 1603 438 216 ...
## $ SRR8311228 : num 779 475 1138 552 164 ...
## $ diffexpressed : chr "Downregulated" "Upregulated" "Upregulated" "Downregulated" ...
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 AT4G10340 1637.44498 -1.494123 0.1583094 2.383608e-23 2.738527e-19
## 2 AT5G64120 298.46213 2.409674 0.2760458 1.192882e-22 6.852508e-19
## 3 AT4G30270 744.18880 2.356695 0.2789628 1.294432e-21 4.957242e-18
## 4 AT2G34430 1218.30165 -1.552136 0.1815641 5.058129e-20 1.452821e-16
## 5 AT4G15610 93.20197 2.684037 0.3397821 4.608409e-19 1.058920e-15
## 6 AT3G54890 747.15947 -1.411976 0.1687273 6.221532e-19 1.191320e-15
## SRR8311214 SRR8311215 SRR8311216 SRR8311218 SRR8311219 SRR8311226 SRR8311227
## 1 2280.78084 2109.931062 1970.82013 2219.04765 2261.01927 709.9491 769.4071
## 2 105.93200 94.298595 114.38382 125.83018 82.97318 812.6091 576.4408
## 3 469.12743 465.599313 233.36929 237.89170 268.27996 1538.8589 1602.7266
## 4 1774.90149 1473.415546 1657.25068 1750.91351 1532.23814 567.7512 437.5542
## 5 10.80939 5.893662 19.06397 27.15484 34.57216 267.4014 216.3189
## 6 964.19740 978.347923 844.73111 1043.35781 1045.46212 389.1367 336.7692
## SRR8311228 diffexpressed
## 1 778.6047 Downregulated
## 2 475.2295 Upregulated
## 3 1137.6572 Upregulated
## 4 552.3885 Downregulated
## 5 164.4013 Upregulated
## 6 375.2734 Downregulated
Urea 48h v 6h Volcano Plot
result_urea_compare <- read.csv("urea6h_vs_urea48h_full_matrix_042525.csv")
genes_to_label <- c("AAT1", "FRK1", "ASN1", "MLO12", "MES18", "LDOX", "LTP6", "LOX2", "THAS1", "THA1")
result_urea_compare$label <- ifelse(result_urea_compare$symbol %in% genes_to_label,
result_urea_compare$symbol,
NA)
u48h_v_u6h_Volcano <- ggplot(result_urea_compare) +
geom_point(aes(x = log2FoldChange, y = -log10(padj), color = diffexpressed, alpha = 0.8)) +
geom_text_repel(aes(x = log2FoldChange,
y = -log10(padj),
label = label),
size = 3,
max.overlaps = Inf) +
theme_classic() +
ggtitle("Urea treated (48h vs 6h)") +
labs(caption = "Significant differentially expressed genes of the 48 hour urea treatment group in reference to the 6 hour urea treatment group.") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("lightblue", "grey", "blue"),
labels = c("Downregulated", "Not Significant", "Upregulated")) +
scale_x_continuous(limits = c(-5, 5)) +
coord_cartesian(ylim = c(0, 25), clip = "off")
u48h_v_u6h_Volcano## Warning: Removed 3578 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 11241 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
Urea 6h v Mock lfcShrink
shrunk_urea6hVmock_compare <- lfcShrink(dds,
contrast = c("group", "urea_6h", "mock_0h"),
type = "ashr") # or "apeglm" if available## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
result_urea6hVmock_compare <- shrunk_urea6hVmock_compare[order(shrunk_urea6hVmock_compare$padj), ]
table(shrunk_urea6hVmock_compare$padj<0.05)##
## FALSE TRUE
## 14344 56
result_urea6hVmock_compare <- merge(as.data.frame(result_urea6hVmock_compare),
as.data.frame(counts(dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(result_urea6hVmock_compare)[1] <- "Gene_id"
str(result_urea6hVmock_compare)## 'data.frame': 18767 obs. of 14 variables:
## $ Gene_id : 'AsIs' chr "AT3G01420" "AT4G32940" "AT3G10040" "AT5G65010" ...
## $ baseMean : num 155.7 451.7 30.2 71.5 870.4 ...
## $ log2FoldChange: num 2.27 1.48 -2.83 -1.84 1.17 ...
## $ lfcSE : num 0.351 0.218 0.473 0.309 0.195 ...
## $ pvalue : num 2.53e-12 2.51e-12 4.61e-12 2.33e-10 7.54e-10 ...
## $ padj : num 1.82e-08 1.82e-08 2.21e-08 8.40e-07 2.17e-06 ...
## $ SRR8311214 : num 21.6 160 67 192.4 462.6 ...
## $ SRR8311215 : num 17.7 117.9 106.1 129.7 477.4 ...
## $ SRR8311216 : num 26.3 152.5 51.9 121 499 ...
## $ SRR8311218 : num 95.23 455.13 4.97 31.36 1241.47 ...
## $ SRR8311219 : num 206.05 420.4 5.53 41.49 997.06 ...
## $ SRR8311226 : num 361.738 908.333 0.694 16.994 1297.469 ...
## $ SRR8311227 : num 240.9 749.74 1.23 14.75 1016.45 ...
## $ SRR8311228 : num 275.76 649.28 3.95 24.11 971.5 ...
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 AT3G01420 155.65903 2.274726 0.3512658 2.528340e-12 1.820405e-08
## 2 AT4G32940 451.65522 1.479497 0.2180476 2.509496e-12 1.820405e-08
## 3 AT3G10040 30.17612 -2.830457 0.4730992 4.609729e-12 2.212670e-08
## 4 AT5G65010 71.46617 -1.839989 0.3089873 2.332454e-10 8.396836e-07
## 5 AT1G47128 870.36722 1.167158 0.1945547 7.543727e-10 2.172594e-06
## 6 AT2G38170 82.39496 -1.598946 0.2905717 5.775610e-09 1.386146e-05
## SRR8311214 SRR8311215 SRR8311216 SRR8311218 SRR8311219 SRR8311226
## 1 21.61878 17.68099 26.29513 95.233172 206.050074 361.7376182
## 2 159.97894 117.87324 152.51176 455.130422 420.397467 908.3325235
## 3 67.01820 106.08592 51.93289 4.972013 5.531546 0.6936484
## 4 192.40710 129.66057 120.95761 31.361928 41.486592 16.9943847
## 5 462.64180 477.38664 498.95013 1241.473403 997.061098 1297.4692518
## 6 198.89274 159.12888 119.64285 43.983192 48.401024 24.9713409
## SRR8311227 SRR8311228
## 1 240.900625 275.755837
## 2 749.741742 649.275666
## 3 1.229085 3.945632
## 4 14.749018 24.112196
## 5 1016.453149 971.502280
## 6 35.643460 28.496231
result_urea6hVmock_compare$diffexpressed <- "Unchanged"
result_urea6hVmock_compare$diffexpressed[result_urea6hVmock_compare$padj < 0.05 & result_urea6hVmock_compare$log2FoldChange > 1] <- "Upregulated"
result_urea6hVmock_compare$diffexpressed[result_urea6hVmock_compare$padj < 0.05 & result_urea6hVmock_compare$log2FoldChange < -1] <- "Downregulated"
str(result_urea6hVmock_compare)## 'data.frame': 18767 obs. of 15 variables:
## $ Gene_id : 'AsIs' chr "AT3G01420" "AT4G32940" "AT3G10040" "AT5G65010" ...
## $ baseMean : num 155.7 451.7 30.2 71.5 870.4 ...
## $ log2FoldChange: num 2.27 1.48 -2.83 -1.84 1.17 ...
## $ lfcSE : num 0.351 0.218 0.473 0.309 0.195 ...
## $ pvalue : num 2.53e-12 2.51e-12 4.61e-12 2.33e-10 7.54e-10 ...
## $ padj : num 1.82e-08 1.82e-08 2.21e-08 8.40e-07 2.17e-06 ...
## $ SRR8311214 : num 21.6 160 67 192.4 462.6 ...
## $ SRR8311215 : num 17.7 117.9 106.1 129.7 477.4 ...
## $ SRR8311216 : num 26.3 152.5 51.9 121 499 ...
## $ SRR8311218 : num 95.23 455.13 4.97 31.36 1241.47 ...
## $ SRR8311219 : num 206.05 420.4 5.53 41.49 997.06 ...
## $ SRR8311226 : num 361.738 908.333 0.694 16.994 1297.469 ...
## $ SRR8311227 : num 240.9 749.74 1.23 14.75 1016.45 ...
## $ SRR8311228 : num 275.76 649.28 3.95 24.11 971.5 ...
## $ diffexpressed : chr "Upregulated" "Upregulated" "Downregulated" "Downregulated" ...
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 AT3G01420 155.65903 2.274726 0.3512658 2.528340e-12 1.820405e-08
## 2 AT4G32940 451.65522 1.479497 0.2180476 2.509496e-12 1.820405e-08
## 3 AT3G10040 30.17612 -2.830457 0.4730992 4.609729e-12 2.212670e-08
## 4 AT5G65010 71.46617 -1.839989 0.3089873 2.332454e-10 8.396836e-07
## 5 AT1G47128 870.36722 1.167158 0.1945547 7.543727e-10 2.172594e-06
## 6 AT2G38170 82.39496 -1.598946 0.2905717 5.775610e-09 1.386146e-05
## SRR8311214 SRR8311215 SRR8311216 SRR8311218 SRR8311219 SRR8311226
## 1 21.61878 17.68099 26.29513 95.233172 206.050074 361.7376182
## 2 159.97894 117.87324 152.51176 455.130422 420.397467 908.3325235
## 3 67.01820 106.08592 51.93289 4.972013 5.531546 0.6936484
## 4 192.40710 129.66057 120.95761 31.361928 41.486592 16.9943847
## 5 462.64180 477.38664 498.95013 1241.473403 997.061098 1297.4692518
## 6 198.89274 159.12888 119.64285 43.983192 48.401024 24.9713409
## SRR8311227 SRR8311228 diffexpressed
## 1 240.900625 275.755837 Upregulated
## 2 749.741742 649.275666 Upregulated
## 3 1.229085 3.945632 Downregulated
## 4 14.749018 24.112196 Downregulated
## 5 1016.453149 971.502280 Upregulated
## 6 35.643460 28.496231 Downregulated
Urea 6h v Mock Volcano Plot
result_urea6hVmock_compare <- read.csv("urea6h_vs_mock0h_full_matrix_042525.csv")
genes_to_label <- c("NIA1", "ARR7", "ASN2", "CIPK3", "EXL5", "DOX1", "GSTF7", "NCED4", "OSM34", "G3Pp1")
result_urea6hVmock_compare$label <- ifelse(result_urea6hVmock_compare$symbol %in% genes_to_label,
result_urea6hVmock_compare$symbol,
NA)
u6h_v_mock_Volcano <- ggplot(result_urea6hVmock_compare) +
geom_point(aes(x = log2FoldChange, y = -log10(padj), color = diffexpressed, alpha = 0.8)) +
geom_text_repel(aes(x = log2FoldChange,
y = -log10(padj),
label = label),
size = 3,
max.overlaps = Inf) +
theme_classic() +
ggtitle("Urea vs Mock (6h vs 0h)") +
labs(caption = "Significant differentially expressed genes of the 6 hour urea treatment group in reference to the control group.") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("lightblue", "grey", "blue"),
labels = c("Downregulated", "Not Significant", "Upregulated")) +
scale_x_continuous(limits = c(-3, 3)) +
coord_cartesian(ylim = c(0, 10), clip = "off")
u6h_v_mock_Volcano## Warning: Removed 2029 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 11241 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
Urea 48h v Mock lfcShrink
shrunk_urea48hVmock_compare <- lfcShrink(dds,
contrast = c("group", "urea_48h", "mock_0h"),
type = "ashr") # or "apeglm" if available## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
result_urea48hVmock_compare <- shrunk_urea48hVmock_compare[order(shrunk_urea48hVmock_compare$padj), ]
table(shrunk_urea48hVmock_compare$padj<0.05)##
## FALSE TRUE
## 14282 1209
result_urea48hVmock_compare <- merge(as.data.frame(result_urea48hVmock_compare),
as.data.frame(counts(dds, normalized=TRUE)),
by="row.names",
sort=FALSE)
names(result_urea48hVmock_compare)[1] <- "Gene_id"
str(result_urea48hVmock_compare)## 'data.frame': 18767 obs. of 14 variables:
## $ Gene_id : 'AsIs' chr "AT4G32940" "AT3G45140" "AT3G62030" "AT3G01420" ...
## $ baseMean : num 452 194 288 156 1637 ...
## $ log2FoldChange: num 2.32 -2.83 -1.86 3.49 -1.43 ...
## $ lfcSE : num 0.209 0.277 0.191 0.342 0.144 ...
## $ pvalue : num 8.32e-32 5.62e-26 5.76e-26 1.46e-25 2.98e-25 ...
## $ padj : num 1.29e-27 2.97e-22 2.97e-22 5.65e-22 9.24e-22 ...
## $ SRR8311214 : num 160 387 454 21.6 2280.8 ...
## $ SRR8311215 : num 117.9 282.9 377.2 17.7 2109.9 ...
## $ SRR8311216 : num 152.5 322.1 454.9 26.3 1970.8 ...
## $ SRR8311218 : num 455.1 162.5 323.6 95.2 2219 ...
## $ SRR8311219 : num 420 264 347 206 2261 ...
## $ SRR8311226 : num 908.3 26.7 106.8 361.7 709.9 ...
## $ SRR8311227 : num 749.7 54.1 130.3 240.9 769.4 ...
## $ SRR8311228 : num 649.3 55.7 108.3 275.8 778.6 ...
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 AT4G32940 451.65522 2.322686 0.2088205 8.316152e-32 1.288255e-27
## 2 AT3G45140 194.39095 -2.825656 0.2765005 5.622896e-26 2.972132e-22
## 3 AT3G62030 287.76910 -1.856688 0.1913241 5.755856e-26 2.972132e-22
## 4 AT3G01420 155.65903 3.490467 0.3422353 1.458632e-25 5.648917e-22
## 5 AT4G10340 1637.44498 -1.433637 0.1437147 2.980975e-25 9.235655e-22
## 6 AT4G15610 93.20197 3.660470 0.3652102 7.574763e-25 1.955677e-21
## SRR8311214 SRR8311215 SRR8311216 SRR8311218 SRR8311219 SRR8311226 SRR8311227
## 1 159.97894 117.873244 152.51176 455.13042 420.39747 908.33252 749.74174
## 2 386.97609 282.895785 322.11536 162.54658 264.13130 26.70546 54.07973
## 3 453.99429 377.194380 454.90578 323.56331 347.10449 106.82185 130.28299
## 4 21.61878 17.680987 26.29513 95.23317 206.05007 361.73762 240.90063
## 5 2280.78084 2109.931062 1970.82013 2219.04765 2261.01927 709.94909 769.40710
## 6 10.80939 5.893662 19.06397 27.15484 34.57216 267.40144 216.31893
## SRR8311228
## 1 649.27567
## 2 55.67725
## 3 108.28568
## 4 275.75584
## 5 778.60472
## 6 164.40133
result_urea48hVmock_compare$diffexpressed <- "Unchanged"
result_urea48hVmock_compare$diffexpressed[result_urea48hVmock_compare$padj < 0.05 & result_urea48hVmock_compare$log2FoldChange > 1] <- "Upregulated"
result_urea48hVmock_compare$diffexpressed[result_urea48hVmock_compare$padj < 0.05 & result_urea48hVmock_compare$log2FoldChange < -1] <- "Downregulated"
str(result_urea48hVmock_compare)## 'data.frame': 18767 obs. of 15 variables:
## $ Gene_id : 'AsIs' chr "AT4G32940" "AT3G45140" "AT3G62030" "AT3G01420" ...
## $ baseMean : num 452 194 288 156 1637 ...
## $ log2FoldChange: num 2.32 -2.83 -1.86 3.49 -1.43 ...
## $ lfcSE : num 0.209 0.277 0.191 0.342 0.144 ...
## $ pvalue : num 8.32e-32 5.62e-26 5.76e-26 1.46e-25 2.98e-25 ...
## $ padj : num 1.29e-27 2.97e-22 2.97e-22 5.65e-22 9.24e-22 ...
## $ SRR8311214 : num 160 387 454 21.6 2280.8 ...
## $ SRR8311215 : num 117.9 282.9 377.2 17.7 2109.9 ...
## $ SRR8311216 : num 152.5 322.1 454.9 26.3 1970.8 ...
## $ SRR8311218 : num 455.1 162.5 323.6 95.2 2219 ...
## $ SRR8311219 : num 420 264 347 206 2261 ...
## $ SRR8311226 : num 908.3 26.7 106.8 361.7 709.9 ...
## $ SRR8311227 : num 749.7 54.1 130.3 240.9 769.4 ...
## $ SRR8311228 : num 649.3 55.7 108.3 275.8 778.6 ...
## $ diffexpressed : chr "Upregulated" "Downregulated" "Downregulated" "Upregulated" ...
## Gene_id baseMean log2FoldChange lfcSE pvalue padj
## 1 AT4G32940 451.65522 2.322686 0.2088205 8.316152e-32 1.288255e-27
## 2 AT3G45140 194.39095 -2.825656 0.2765005 5.622896e-26 2.972132e-22
## 3 AT3G62030 287.76910 -1.856688 0.1913241 5.755856e-26 2.972132e-22
## 4 AT3G01420 155.65903 3.490467 0.3422353 1.458632e-25 5.648917e-22
## 5 AT4G10340 1637.44498 -1.433637 0.1437147 2.980975e-25 9.235655e-22
## 6 AT4G15610 93.20197 3.660470 0.3652102 7.574763e-25 1.955677e-21
## SRR8311214 SRR8311215 SRR8311216 SRR8311218 SRR8311219 SRR8311226 SRR8311227
## 1 159.97894 117.873244 152.51176 455.13042 420.39747 908.33252 749.74174
## 2 386.97609 282.895785 322.11536 162.54658 264.13130 26.70546 54.07973
## 3 453.99429 377.194380 454.90578 323.56331 347.10449 106.82185 130.28299
## 4 21.61878 17.680987 26.29513 95.23317 206.05007 361.73762 240.90063
## 5 2280.78084 2109.931062 1970.82013 2219.04765 2261.01927 709.94909 769.40710
## 6 10.80939 5.893662 19.06397 27.15484 34.57216 267.40144 216.31893
## SRR8311228 diffexpressed
## 1 649.27567 Upregulated
## 2 55.67725 Downregulated
## 3 108.28568 Downregulated
## 4 275.75584 Upregulated
## 5 778.60472 Downregulated
## 6 164.40133 Upregulated
Urea 48h v Mock Volcano Plot
result_urea48hVmock_compare <- read.csv("urea48h_vs_mock0h_full_matrix_042525.csv")
genes_to_label <- c("CYP79F1", "MES18", "GSTU20", "FTM1", "EXPA5", "DIN2", "UGT74E2", "CHX17", "DOX1", "AAT1")
result_urea48hVmock_compare$label <- ifelse(result_urea48hVmock_compare$symbol %in% genes_to_label,
result_urea48hVmock_compare$symbol,
NA)
u48h_v_mock_Volcano <- ggplot(result_urea48hVmock_compare) +
geom_point(aes(x = log2FoldChange, y = -log10(padj), color = diffexpressed, alpha = 0.8)) +
geom_text_repel(aes(x = log2FoldChange,
y = -log10(padj),
label = label),
size = 3,
max.overlaps = Inf) +
theme_classic() +
ggtitle("Urea 48h vs Mock (48h vs 0h)") +
labs(caption = "Significant differentially expressed genes of the 48 hour urea treatment group in reference to the control group.") +
xlab("Log2 fold change") +
ylab("-Log10(padj)") +
theme(legend.position = "right",
plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1.25))) +
scale_color_manual(values = c("lightblue", "grey", "blue"),
labels = c("Downregulated", "Not Significant", "Upregulated")) +
scale_x_continuous(limits = c(-5, 5)) +
coord_cartesian(ylim = c(0, 30), clip = "off")
u48h_v_mock_Volcano## Warning: Removed 1500 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 11241 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
Supplementary Barplot of the Amount of Significant DE per Treatment Group
This was conducted to visualize the size of the data being worked on.
sig_6h <- sum(res_urea6h_mock0h$padj < 0.05, na.rm = TRUE)
sig_48h <- sum(res_urea48h_mock0h$padj < 0.05, na.rm = TRUE)
sig_UvU <- sum(res_urea48h_urea6h$padj < 0.05, na.rm =TRUE)
de_summary <- data.frame(
comparison = c("urea_6h vs mock_0h", "urea_48h vs mock_0h", "urea_48h vs urea_6h"),
sig_genes = c(sig_6h, sig_48h, sig_UvU)
)
SigDE_barplot <- ggplot(de_summary, aes(x = comparison, y = sig_genes)) +
geom_col(fill = "steelblue") +
labs(caption = "Barplot showing amount of significant differentially expressed genes across treatment groups.", x = "Comparison", y = "Genes with padj < 0.05",
title = "Significant Differentially Expressed Genes per Treatment Group") +
theme_minimal()
SigDE_barplotBoxplots for Specific Genes
These genes were selected from the data used in the volcano plots to visualize thier counts per different treatment groups. Since it is a time series study, this helps visualize the counts over time. That is why the reordering of the group levels is important in this code so it makes logical sense when examining each boxplot.
Boxplot of Normalized Counts for AAT1
norm_counts_AAT1 <- plotCounts(dds,
gene = "AT1G02920",
intgroup = "group",
returnData = TRUE)
norm_counts_AAT1$group <- factor(norm_counts_AAT1$group,
levels = c("mock_0h", "urea_6h", "urea_48h"))
AAT1_plot <- ggplot(norm_counts_AAT1, aes(x = group, y = count, fill = group)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, size = 2) +
scale_fill_manual(values = c(
"mock_0h" = "gray70",
"urea_6h" = "skyblue",
"urea_48h" = "steelblue"
)) +
labs(caption = "Normalized counts of AAT1 expression across treatment groups.",
title = "Normalized Counts per Group for AAT1",
x = "Group",
y = "Normalized Counts"
) +
theme_minimal()
AAT1_plotBoxplot of Normalized Counts for DOX1
norm_counts_DOX1 <- plotCounts(dds,
gene = "AT3G01420",
intgroup = "group",
returnData = TRUE)
norm_counts_DOX1$group <- factor(norm_counts_DOX1$group,
levels = c("mock_0h", "urea_6h", "urea_48h"))
DOX1_plot <- ggplot(norm_counts_DOX1, aes(x = group, y = count, fill = group)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, size = 2) +
scale_fill_manual(values = c(
"mock_0h" = "gray70",
"urea_6h" = "pink",
"urea_48h" = "violet"
)) +
labs(caption = "Normalized counts of DOX1 expression across treatment groups.",
title = "Normalized Counts per Group for DOX1",
x = "Group",
y = "Normalized Counts"
) +
theme_minimal()
DOX1_plotBoxplot of Normalized Counts for FRK1
norm_counts_FRK1 <- plotCounts(dds,
gene = "AT2G19190",
intgroup = "group",
returnData = TRUE)
norm_counts_FRK1$group <- factor(norm_counts_FRK1$group,
levels = c("mock_0h", "urea_6h", "urea_48h"))
FRK1_plot <- ggplot(norm_counts_FRK1, aes(x = group, y = count, fill = group)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, size = 2) +
scale_fill_manual(values = c(
"mock_0h" = "gray70",
"urea_6h" = "lightgreen",
"urea_48h" = "darkgreen"
)) +
labs(caption = "Normalized counts of FRK1 expression across treatment groups.",
title = "Normalized Counts per Group for FRK1",
x = "Group",
y = "Normalized Counts"
) +
theme_minimal()
FRK1_plotSummary and Conclusion
The hypothesis of this project was that Arabidopsis exposed to urea for a longer duration of time will exhibit immune response to pathogen stress. This is partially seen, especially in the FRK1 box plot. The function of FRK1 was determined that it activates plant defense-related genes. The box plot showed that the 48 hour urea treatment groups counts were much larger than the mock and the 6h urea, which supports the hypothesis. The volcano plots also support the hypothesis presented, as there was a considerable increase in the amount of DE genes as time went on, and some of them were identified as relevant to immune and stress responses. These genes are all labeled on at least 1 of the volcano plots, including OSM34, LOX2, ASN1, and EDS1. Despite this, a true measure of success in this project would need to be determined from pathway enrichment data and graphs. This was a limitation in the project due to issues in downloading a reference genome for Arabidopsis thalia into R and issues with genekitr in general. The next steps with the data at this point would be to troubleshoot past issues in GeneID conversions and use gene ontology programs to generate pathway enrichment plots to further analyze the impacts of urea on Arabidopsis stress and immune responses. A larger change that may be necessary would to include the 12 hour and 24 hour replicates sourced from the original paper (Goto et al. 2020), as this project focused on only the upper and lower extremes of the time series, but expression could differ unexpectedly between the start and end points of the time series. Overall the dataset was a great test of abilities, it simply would not have been possible without Dr. Rodriguez being able to assist in troubleshooting where things went wrong, especially the issues surrounding genekitr.
Session Info
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 9.4 (Plow)
##
## Matrix products: default
## BLAS: /gpfs1/sw/rh9/pkgs/stacks/gcc/13.3.0/R/4.4.1/lib64/R/lib/libRblas.so
## LAPACK: /gpfs1/sw/rh9/pkgs/stacks/gcc/13.3.0/R/4.4.1/lib64/R/lib/libRlapack.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggrepel_0.9.6 ggpubr_0.6.0
## [3] ggplotify_0.1.2 patchwork_1.3.0
## [5] lubridate_1.9.3 forcats_1.0.0
## [7] stringr_1.5.1 dplyr_1.1.4
## [9] purrr_1.0.2 readr_2.1.5
## [11] tidyr_1.3.1 tibble_3.2.1
## [13] tidyverse_2.0.0 ggplot2_3.5.2
## [15] pheatmap_1.0.12 RColorBrewer_1.1-3
## [17] DESeq2_1.46.0 SummarizedExperiment_1.36.0
## [19] Biobase_2.66.0 MatrixGenerics_1.18.1
## [21] matrixStats_1.5.0 GenomicRanges_1.58.0
## [23] GenomeInfoDb_1.42.3 IRanges_2.40.1
## [25] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [27] knitr_1.50 edgeR_4.4.2
## [29] limma_3.62.2 rmdformats_1.0.4
##
## loaded via a namespace (and not attached):
## [1] rlang_1.1.4 magrittr_2.0.3 compiler_4.4.1
## [4] vctrs_0.6.5 pkgconfig_2.0.3 crayon_1.5.3
## [7] fastmap_1.2.0 backports_1.5.0 XVector_0.46.0
## [10] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.29
## [13] tzdb_0.4.0 UCSC.utils_1.2.0 xfun_0.52
## [16] zlibbioc_1.52.0 cachem_1.1.0 jsonlite_1.8.9
## [19] DelayedArray_0.32.0 BiocParallel_1.40.2 irlba_2.3.5.1
## [22] broom_1.0.7 parallel_4.4.1 R6_2.5.1
## [25] bslib_0.8.0 stringi_1.8.4 SQUAREM_2021.1
## [28] car_3.1-3 jquerylib_0.1.4 Rcpp_1.0.13
## [31] bookdown_0.42 Matrix_1.7-0 timechange_0.3.0
## [34] tidyselect_1.2.1 rstudioapi_0.16.0 abind_1.4-8
## [37] yaml_2.3.10 codetools_0.2-20 lattice_0.22-6
## [40] withr_3.0.2 evaluate_1.0.3 gridGraphics_0.5-1
## [43] pillar_1.9.0 carData_3.0-5 generics_0.1.3
## [46] invgamma_1.1 truncnorm_1.0-9 hms_1.1.3
## [49] munsell_0.5.1 scales_1.3.0 ashr_2.2-63
## [52] glue_1.7.0 tools_4.4.1 locfit_1.5-9.12
## [55] ggsignif_0.6.4 fs_1.6.4 grid_4.4.1
## [58] colorspace_2.1-1 GenomeInfoDbData_1.2.13 Formula_1.2-5
## [61] cli_3.6.3 fansi_1.0.6 mixsqp_0.3-54
## [64] S4Arrays_1.6.0 gtable_0.3.5 rstatix_0.7.2
## [67] yulab.utils_0.2.0 sass_0.4.9 digest_0.6.37
## [70] SparseArray_1.6.2 farver_2.1.2 htmltools_0.5.8.1
## [73] lifecycle_1.0.4 httr_1.4.7 statmod_1.5.0