DIABIMMUNE Cohort - microbiome

Getting Started

Code chunk options

Install the required R libraries

#install.packages("remotes")
remotes::install_github("MadsAlbertsen/ampvis2") #Enter 1

Libraries loaded

library(ampvis2)
library(tidyverse)

About the DIABIMMUNE Study

The DIABIMMUNE project (short for “Pathogenesis of Type 1 Diabetes: Testing the Hygiene Hypothesis”) is a longitudinal, prospective study aiming to understand how the gut microbiome influences the development of autoimmune diseases, particularly type 1 diabetes (T1D).

Study Goals

Investigate the relationship between early-life microbial colonization and the development of autoimmunity.
Test the hygiene hypothesis, which suggests that decreased microbial exposure in early life may lead to higher rates of autoimmune and allergic diseases.
Identify microbial and immunological markers that predict or influence T1D onset.

Study Design Highlights

Cohort: Infants recruited in Finland, Estonia, and Russian Karelia—countries with varying levels of T1D incidence and hygiene standards.
Finland has one of the highest rates of T1D globally.
Estonia and Russia provide contrasting environments in terms of microbial exposure and T1D incidence.

Longitudinal Sampling:

Stool samples collected regularly from birth through early childhood.
Blood samples and health data collected to monitor the development of islet autoantibodies and progression toward clinical T1D.
Microbiome Analysis: Use of metagenomic sequencing to analyze gut microbial composition and functional capacity. Particular attention paid to temporal changes in microbial diversity and the appearance of taxa potentially involved in immune regulation.

Scientific Impact

One of the first studies to integrate genetic risk, environmental exposure, and gut microbiota in the context of autoimmunity.
Findings suggest that early-life microbiome composition is different in infants who go on to develop T1D compared to controls.
Has provided public datasets that are widely used for microbiome research, including studies on taxonomic shifts, immune modulation, and biomarker discovery.

Data Availability

Data from the DIABIMMUNE cohort is available through repositories like the European Nucleotide Archive (ENA) and has been included in various bioinformatics tutorials and benchmarking datasets.
For educational and tutorial purposes, subsets of the data (e.g., taxonomic profiles, sequencing reads) are often used to teach microbiome data analysis workflows.

## Load count matrix (Bracken output)
counts <- read.csv("T1D_bracken_output.csv", check.names = FALSE)

## Load sample metadata
meta <- read.csv("T1D_metadata.csv", check.names = FALSE)

## Create amplicon object using ampvis2
t1d <- amp_load(counts, meta)

## Print summary of loaded object
t1d

## ampvis2 object with 3 elements. 
## Summary of OTU table:
##      Samples         OTUs  Total#Reads    Min#Reads    Max#Reads Median#Reads 
##           12         6988     74218389      4668626      9120775    5926845.5 
##    Avg#Reads 
##   6184865.75 
## 
## Assigned taxonomy:
##      Kingdom       Phylum        Class        Order       Family        Genus 
##   6988(100%)   6988(100%) 6798(97.28%) 6973(99.79%) 6973(99.79%)  6925(99.1%) 
##      Species 
## 6970(99.74%) 
## 
## Metadata variables: 15 
##  Sample_ID, Subject_ID, Case_Control, Gender, Delivery_Route, AbxAtCollection, AbxPreCollection, IllnessAtCollection, IAA_Level, GADA_Level, IA2A_Level, ZNT8A_Level, ICA_Level, Age_at_Collection, T1D_Status

Starting with a heat map is a quick way to explore whether there are major shifts in the most abundant taxa between your groups of interest. For example, we can compare case vs. control:

amp_heatmap(t1d,
            facet_by = "Case_Control",
            tax_aggregate = "Genus",
            tax_show = 20)

Much like heat maps, box plots are another great way to get an overview of the top taxa. They offer an additional advantage: showing within-group variation. Let’s use the same grouping as before:

amp_boxplot(t1d,
            group_by = "Case_Control",
            tax_aggregate = "Species",
            tax_add = "Genus",
            tax_show = 20)

Ordination

Now that we have meaningful metadata, we can start incorporating it into ordination analyses. In addition to PCA and PCoA, which are both unconstrained ordination methods, ampvis2 also supports constrained ordination methods like CCA and RDA. These are especially useful when you suspect that a specific variable—such as T1D status—might help explain variation in the microbial communities.

Let’s generate a simple PCA to observe natural clustering in the data:

amp_ordinate(t1d,
             type="pca",
             transform = "hellinger",
             sample_color_by="T1D_Status",
             sample_label_by= "Gender")

Constrained Ordination

As an example of constrained ordination, we’ll use Redundancy Analysis (RDA), which can be thought of as a constrained version of PCA.

While PCA summarizes overall variation in the dataset without considering sample groupings, RDA constrains the ordination based on a known explanatory variable (here, T1D status). In short, it combines multiple linear regression with PCA to identify species that vary in a way that is associated with the explanatory variable.

So, RDA helps answer the question: “Which species differentiate samples based on T1D status?”

amp_ordinate(t1d,
             type="RDA",
             transform = "hellinger",
             constrain = "T1D_Status",
             sample_color_by="T1D_Status",
             sample_label_by= "Gender",
             sample_colorframe = TRUE,
             sample_colorframe_label = "T1D_Status")

Venn Diagram of Core Taxa

Another way to visualize the data is with a Venn diagram, which shows the number of taxa—Species in this example—that are shared between groups at a defined threshold.

amp_venn(
  t1d,
  group_by = "T1D_Status",
  cut_a = 0.1,           # Minimum relative abundance
  cut_f = 80,            # Minimum percent of samples the taxon must be present in
  text_size = 5,
  normalise = TRUE,
  detailed_output = FALSE
)

This plot allows you to see which species are core to each group and which are shared between them.

Alpha Diversity

We’ve already explored the top 20 most abundant taxa, but in total, there are 6,988 unique species detected across all samples. In microbiome studies, we often need to examine the data compositionally, rather than just focusing on abundance.

One common compositional metric is alpha diversity, which measures the richness and/or evenness of species within individual samples.

alpha <- amp_alphadiv(t1d)
head(alpha)

##        Sample_ID Subject_ID Case_Control Gender Delivery_Route AbxAtCollection
## G45108    G45108    E003251         case female        vaginal          no_abx
## G45068    G45068    E006673      control female        vaginal          no_abx
## G45071    G45071    E010590      control   male        vaginal          no_abx
## G45075    G45075    T025418         case female        vaginal          no_abx
## G45250    G45250    E010629         case   male        vaginal          no_abx
## G45226    G45226    E026079         case   male        vaginal          no_abx
##        AbxPreCollection IllnessAtCollection IAA_Level GADA_Level IA2A_Level
## G45108           no_abx          no_illness    14.083     21.754      0.076
## G45068           no_abx          no_illness     0.370      2.700      0.080
## G45071      Amoxicillin          no_illness     0.410      0.020      0.080
## G45075     Penicillin V          no_illness     8.590    189.700      0.390
## G45250      Amoxicillin          no_illness     5.130      7.460      0.120
## G45226      Amoxicillin          no_illness    12.280     11.570      0.110
##        ZNT8A_Level ICA_Level Age_at_Collection    T1D_Status   Reads uniqueOTUs
## G45108       0.296         0               508           T1D 4668626       5891
## G45068       0.080         0               705       control 5043178       5883
## G45071       0.100         0              1045       control 5185390       5527
## G45075       0.130       512              1025           T1D 5825803       5660
## G45250       1.370       256              1089 seroconverted 5831702       5797
## G45226       0.100         4               760 seroconverted 5891057       5483
##         Shannon   Simpson invSimpson
## G45108 3.557845 0.9077125  10.835709
## G45068 3.779808 0.9455507  18.365706
## G45071 3.196606 0.8970294   9.711513
## G45075 3.301305 0.8953137   9.552344
## G45250 3.780163 0.9452665  18.270349
## G45226 3.907345 0.9530379  21.293772

library(ggplot2)

# Merge alpha diversity data with metadata
alpha_merged <- merge(alpha, t1d$metadata, by = "Sample_ID")

# Plot Shannon diversity by T1D status
ggplot(alpha_merged, aes(x = T1D_Status.y, 
                         y = Shannon, 
                         fill = T1D_Status.y)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.4) +
  labs(title = "Alpha Diversity (Shannon Index)",
       x = "T1D Status",
       y = "Shannon Diversity") +
  theme_grey() +
  theme(legend.position = "none")

What does Shannon Diversity Measure?

Shannon diversity takes into account two key components of diversity:

Richness – the number of different species (or taxa) present in a sample.
Evenness – how evenly the individuals are distributed across those species.

So, it’s not just about how many species are there, but also how balanced their abundances are.

What does it tell you?

A higher Shannon index means more diversity — either more species, more even distribution, or both.
A lower Shannon index means that the community is less diverse — for example, dominated by just a few species.

Session Info

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
##  [9] tidyverse_2.0.0 ampvis2_2.8.9   ggplot2_3.5.2  
## 
## loaded via a namespace (and not attached):
##  [1] plotly_4.10.4      sass_0.4.10        generics_0.1.3     stringi_1.8.7     
##  [5] lattice_0.22-7     hms_1.1.3          digest_0.6.37      magrittr_2.0.3    
##  [9] timechange_0.3.0   evaluate_1.0.3     grid_4.4.3         RColorBrewer_1.1-3
## [13] bookdown_0.42      fastmap_1.2.0      Matrix_1.7-3       plyr_1.8.9        
## [17] jsonlite_2.0.0     ggrepel_0.9.6      ape_5.8-1          mgcv_1.9-3        
## [21] httr_1.4.7         viridisLite_0.4.2  scales_1.3.0       permute_0.9-7     
## [25] lazyeval_0.2.2     jquerylib_0.1.4    cli_3.6.4          rlang_1.1.6       
## [29] crayon_1.5.3       splines_4.4.3      munsell_0.5.1      remotes_2.5.0     
## [33] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        vegan_2.6-10      
## [37] tools_4.4.3        parallel_4.4.3     tzdb_0.5.0         colorspace_2.1-1  
## [41] curl_6.2.2         vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4   
## [45] htmlwidgets_1.6.4  MASS_7.3-65        cluster_2.1.8.1    pkgconfig_2.0.3   
## [49] pillar_1.10.2      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [53] data.table_1.17.0  rmdformats_1.0.4   Rcpp_1.0.14        xfun_0.52         
## [57] tidyselect_1.2.1   rstudioapi_0.17.1  knitr_1.50         farver_2.1.2      
## [61] htmltools_0.5.8.1  nlme_3.1-168       labeling_0.4.3     rmarkdown_2.29    
## [65] compiler_4.4.3