DIABIMMUNE Cohort - microbiome
Getting Started
Code chunk options
Install the required R libraries
About the DIABIMMUNE Study
The DIABIMMUNE project (short for “Pathogenesis of Type 1 Diabetes: Testing the Hygiene Hypothesis”) is a longitudinal, prospective study aiming to understand how the gut microbiome influences the development of autoimmune diseases, particularly type 1 diabetes (T1D).
Study Goals
Investigate the relationship between early-life microbial colonization and the development of autoimmunity.
Test the hygiene hypothesis, which suggests that decreased microbial exposure in early life may lead to higher rates of autoimmune and allergic diseases.
Identify microbial and immunological markers that predict or influence T1D onset.
Study Design Highlights
Cohort: Infants recruited in Finland, Estonia, and Russian Karelia—countries with varying levels of T1D incidence and hygiene standards.
Finland has one of the highest rates of T1D globally.
Estonia and Russia provide contrasting environments in terms of microbial exposure and T1D incidence.
Longitudinal Sampling:
Stool samples collected regularly from birth through early childhood.
Blood samples and health data collected to monitor the development of islet autoantibodies and progression toward clinical T1D.
Microbiome Analysis: Use of metagenomic sequencing to analyze gut microbial composition and functional capacity. Particular attention paid to temporal changes in microbial diversity and the appearance of taxa potentially involved in immune regulation.
Scientific Impact
One of the first studies to integrate genetic risk, environmental exposure, and gut microbiota in the context of autoimmunity.
Findings suggest that early-life microbiome composition is different in infants who go on to develop T1D compared to controls.
Has provided public datasets that are widely used for microbiome research, including studies on taxonomic shifts, immune modulation, and biomarker discovery.
Data Availability
Data from the DIABIMMUNE cohort is available through repositories like the European Nucleotide Archive (ENA) and has been included in various bioinformatics tutorials and benchmarking datasets.
For educational and tutorial purposes, subsets of the data (e.g., taxonomic profiles, sequencing reads) are often used to teach microbiome data analysis workflows.
## Load count matrix (Bracken output)
counts <- read.csv("T1D_bracken_output.csv", check.names = FALSE)
## Load sample metadata
meta <- read.csv("T1D_metadata.csv", check.names = FALSE)
## Create amplicon object using ampvis2
t1d <- amp_load(counts, meta)
## Print summary of loaded object
t1d
## ampvis2 object with 3 elements.
## Summary of OTU table:
## Samples OTUs Total#Reads Min#Reads Max#Reads Median#Reads
## 12 6988 74218389 4668626 9120775 5926845.5
## Avg#Reads
## 6184865.75
##
## Assigned taxonomy:
## Kingdom Phylum Class Order Family Genus
## 6988(100%) 6988(100%) 6798(97.28%) 6973(99.79%) 6973(99.79%) 6925(99.1%)
## Species
## 6970(99.74%)
##
## Metadata variables: 15
## Sample_ID, Subject_ID, Case_Control, Gender, Delivery_Route, AbxAtCollection, AbxPreCollection, IllnessAtCollection, IAA_Level, GADA_Level, IA2A_Level, ZNT8A_Level, ICA_Level, Age_at_Collection, T1D_Status
Starting with a heat map is a quick way to explore whether there are major shifts in the most abundant taxa between your groups of interest. For example, we can compare case vs. control:
Much like heat maps, box plots are another great way to get an overview of the top taxa. They offer an additional advantage: showing within-group variation. Let’s use the same grouping as before:
amp_boxplot(t1d,
group_by = "Case_Control",
tax_aggregate = "Species",
tax_add = "Genus",
tax_show = 20)
Ordination
Now that we have meaningful metadata, we can start incorporating it into ordination analyses. In addition to PCA and PCoA, which are both unconstrained ordination methods, ampvis2 also supports constrained ordination methods like CCA and RDA. These are especially useful when you suspect that a specific variable—such as T1D status—might help explain variation in the microbial communities.
Let’s generate a simple PCA to observe natural clustering in the data:
amp_ordinate(t1d,
type="pca",
transform = "hellinger",
sample_color_by="T1D_Status",
sample_label_by= "Gender")
Constrained Ordination
As an example of constrained ordination, we’ll use Redundancy Analysis (RDA), which can be thought of as a constrained version of PCA.
While PCA summarizes overall variation in the dataset without considering sample groupings, RDA constrains the ordination based on a known explanatory variable (here, T1D status). In short, it combines multiple linear regression with PCA to identify species that vary in a way that is associated with the explanatory variable.
So, RDA helps answer the question: “Which species differentiate samples based on T1D status?”
amp_ordinate(t1d,
type="RDA",
transform = "hellinger",
constrain = "T1D_Status",
sample_color_by="T1D_Status",
sample_label_by= "Gender",
sample_colorframe = TRUE,
sample_colorframe_label = "T1D_Status")
Venn Diagram of Core Taxa
Another way to visualize the data is with a Venn diagram, which shows the number of taxa—Species in this example—that are shared between groups at a defined threshold.
amp_venn(
t1d,
group_by = "T1D_Status",
cut_a = 0.1, # Minimum relative abundance
cut_f = 80, # Minimum percent of samples the taxon must be present in
text_size = 5,
normalise = TRUE,
detailed_output = FALSE
)
This plot allows you to see which species are core to each group and which are shared between them.
Alpha Diversity
We’ve already explored the top 20 most abundant taxa, but in total, there are 6,988 unique species detected across all samples. In microbiome studies, we often need to examine the data compositionally, rather than just focusing on abundance.
One common compositional metric is alpha diversity, which measures the richness and/or evenness of species within individual samples.
## Sample_ID Subject_ID Case_Control Gender Delivery_Route AbxAtCollection
## G45108 G45108 E003251 case female vaginal no_abx
## G45068 G45068 E006673 control female vaginal no_abx
## G45071 G45071 E010590 control male vaginal no_abx
## G45075 G45075 T025418 case female vaginal no_abx
## G45250 G45250 E010629 case male vaginal no_abx
## G45226 G45226 E026079 case male vaginal no_abx
## AbxPreCollection IllnessAtCollection IAA_Level GADA_Level IA2A_Level
## G45108 no_abx no_illness 14.083 21.754 0.076
## G45068 no_abx no_illness 0.370 2.700 0.080
## G45071 Amoxicillin no_illness 0.410 0.020 0.080
## G45075 Penicillin V no_illness 8.590 189.700 0.390
## G45250 Amoxicillin no_illness 5.130 7.460 0.120
## G45226 Amoxicillin no_illness 12.280 11.570 0.110
## ZNT8A_Level ICA_Level Age_at_Collection T1D_Status Reads uniqueOTUs
## G45108 0.296 0 508 T1D 4668626 5891
## G45068 0.080 0 705 control 5043178 5883
## G45071 0.100 0 1045 control 5185390 5527
## G45075 0.130 512 1025 T1D 5825803 5660
## G45250 1.370 256 1089 seroconverted 5831702 5797
## G45226 0.100 4 760 seroconverted 5891057 5483
## Shannon Simpson invSimpson
## G45108 3.557845 0.9077125 10.835709
## G45068 3.779808 0.9455507 18.365706
## G45071 3.196606 0.8970294 9.711513
## G45075 3.301305 0.8953137 9.552344
## G45250 3.780163 0.9452665 18.270349
## G45226 3.907345 0.9530379 21.293772
library(ggplot2)
# Merge alpha diversity data with metadata
alpha_merged <- merge(alpha, t1d$metadata, by = "Sample_ID")
# Plot Shannon diversity by T1D status
ggplot(alpha_merged, aes(x = T1D_Status.y,
y = Shannon,
fill = T1D_Status.y)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.4) +
labs(title = "Alpha Diversity (Shannon Index)",
x = "T1D Status",
y = "Shannon Diversity") +
theme_grey() +
theme(legend.position = "none")
What does Shannon Diversity Measure?
Shannon diversity takes into account two key components of diversity:
Richness – the number of different species (or taxa) present in a sample.
Evenness – how evenly the individuals are distributed across those species.
So, it’s not just about how many species are there, but also how balanced their abundances are.
What does it tell you?
A higher Shannon index means more diversity — either more species, more even distribution, or both.
A lower Shannon index means that the community is less diverse — for example, dominated by just a few species.
Session Info
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [9] tidyverse_2.0.0 ampvis2_2.8.9 ggplot2_3.5.2
##
## loaded via a namespace (and not attached):
## [1] plotly_4.10.4 sass_0.4.10 generics_0.1.3 stringi_1.8.7
## [5] lattice_0.22-7 hms_1.1.3 digest_0.6.37 magrittr_2.0.3
## [9] timechange_0.3.0 evaluate_1.0.3 grid_4.4.3 RColorBrewer_1.1-3
## [13] bookdown_0.42 fastmap_1.2.0 Matrix_1.7-3 plyr_1.8.9
## [17] jsonlite_2.0.0 ggrepel_0.9.6 ape_5.8-1 mgcv_1.9-3
## [21] httr_1.4.7 viridisLite_0.4.2 scales_1.3.0 permute_0.9-7
## [25] lazyeval_0.2.2 jquerylib_0.1.4 cli_3.6.4 rlang_1.1.6
## [29] crayon_1.5.3 splines_4.4.3 munsell_0.5.1 remotes_2.5.0
## [33] withr_3.0.2 cachem_1.1.0 yaml_2.3.10 vegan_2.6-10
## [37] tools_4.4.3 parallel_4.4.3 tzdb_0.5.0 colorspace_2.1-1
## [41] curl_6.2.2 vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
## [45] htmlwidgets_1.6.4 MASS_7.3-65 cluster_2.1.8.1 pkgconfig_2.0.3
## [49] pillar_1.10.2 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
## [53] data.table_1.17.0 rmdformats_1.0.4 Rcpp_1.0.14 xfun_0.52
## [57] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.50 farver_2.1.2
## [61] htmltools_0.5.8.1 nlme_3.1-168 labeling_0.4.3 rmarkdown_2.29
## [65] compiler_4.4.3