ampvis2: Visualizing 16S amplicon
data
Getting Started
_ ### Code chunk options
Install the required R libraries
Click on Green Code button -> Download Zip
Unzip it
You will get a folder like “ampvis2-main/”
Install it from the local folder
The DIABIMMUNE Study
The DIABIMMUNE project (short for Pathogenesis of Type 1 Diabetes: Testing the Hygiene Hypothesis) is a longitudinal study designed to understand how early-life microbial exposure shapes immune development and the risk of autoimmune diseases such as Type 1 Diabetes (T1D).
Why this study matters
Type 1 Diabetes is an autoimmune disease in which the immune system attacks insulin-producing pancreatic β-cells. While genetics contribute to risk, genetics alone cannot explain the rapid increase in T1D incidence globally.
This study asks a key question:
Could differences in early microbial exposure influence whether the immune system develops tolerance or autoimmunity?
Scientific Impact
One of the first studies to integrate genetic risk, environmental exposure, and gut microbiota in the context of autoimmunity.
Findings suggest that early-life microbiome composition is different in infants who go on to develop T1D compared to controls.
Has provided public datasets that are widely used for microbiome research, including studies on taxonomic shifts, immune modulation, and biomarker discovery.
Data Availability
Data from the DIABIMMUNE cohort is available through repositories like the European Nucleotide Archive (ENA) and has been included in various bioinformatics tutorials and benchmarking datasets.
For educational and tutorial purposes, subsets of the data (e.g., taxonomic profiles, sequencing reads) are often used to teach microbiome data analysis workflows.
In this tutorial, you are working with:
- Taxonomic abundance data (Bracken output)
- Sample metadata (disease status, demographics)
## Load count matrix (Bracken output)
counts <- read.csv("T1D_bracken_output.csv", check.names = FALSE)
## Load sample metadata
meta <- read.csv("T1D_metadata.csv", check.names = FALSE)
## Create amplicon object using ampvis2
t1d <- amp_load(counts, meta)
## Print summary of loaded object
t1d## ampvis2 object with 3 elements.
## Summary of OTU table:
## Samples OTUs Total#Reads Min#Reads Max#Reads Median#Reads
## 12 6988 74218389 4668626 9120775 5926845.5
## Avg#Reads
## 6184865.75
##
## Assigned taxonomy:
## Kingdom Phylum Class Order Family Genus
## 6988(100%) 6988(100%) 6798(97.28%) 6973(99.79%) 6973(99.79%) 6925(99.1%)
## Species
## 6970(99.74%)
##
## Metadata variables: 15
## Sample_ID, Subject_ID, Case_Control, Gender, Delivery_Route, AbxAtCollection, AbxPreCollection, IllnessAtCollection, IAA_Level, GADA_Level, IA2A_Level, ZNT8A_Level, ICA_Level, Age_at_Collection, T1D_Status
This object now links:
- the metadata
- with what microbes are present (counts)
Dataset summary:
- This dataset contains 12 microbiome samples with 6,988 detected taxa (OTUs/species).
Sequencing Depth Total reads: ~74 million Per sample: Min: ~4.7 million Max: ~9.1 million Median: ~5.9 million
Interpretation: Sequencing depth is relatively high and consistent across samples, which is good for downstream comparisons. Nearly all taxa are classified across major levels: Kingdom & Phylum: 100% Genus: ~99% Species: ~99.7%
This is high-quality taxonomic annotation, meaning most reads can be confidently assigned to known organisms.
There are 15 sample variables, including:
Disease-related: T1D_Status, Case_Control Demographics: Gender, Age_at_Collection Clinical markers: autoantibody levels (e.g., IAA, GADA) Environmental factors: antibiotic use, illness, delivery route
This rich metadata allows you to test both biological and environmental influences on the microbiome.
Heatmap
Starting with a heat map is a quick way to explore whether there are major shifts in the most abundant taxa between your groups of interest.
For example, we can compare case vs. control:
What this shows
- Rows → taxa (collapsed to Genus level)
- Columns → samples
- Colors → relative abundance
Useful for:
- Identifying taxa that differ between groups
- Seeing patterns across many samples at once
Questions:
- Do certain genera appear enriched in cases vs controls?
- Are patterns consistent across individuals?
- Is there high variability within groups?
Boxplots
Much like heat maps, box plots are another great way to get an overview of the top taxa. They offer an additional advantage: showing within-group variation.
Below we will be: creating boxplots of the top 20 most abundant taxa, comparing cases vs controls.
#boxplot
amp_boxplot(t1d,
group_by = "Case_Control",
tax_aggregate = "Species",
tax_add = "Genus",
tax_show = 20)# t1d = full dataset
# group_by = splits samples into groups case and control
# we are looking at species level abundance
# tax_add will "add" extra labeling information Heatmaps show patterns across samples, but boxplots show:
- Distribution within groups
- Median differences
- Outliers
Are differences between Case and Control consistent? Are some taxa highly variable?
Heatmap = “Where are patterns?”; Boxplot = “How consistent are those patterns within groups?”
What is ordination?
Ordination reduces complex, high-dimensional data into 2D space so we can visualize relationships between samples.
- Each point = one microbiome sample
- Distance between points = similarity of microbial communities
PCA (Unconstrained)
amp_ordinate(t1d,
type="pca",
transform = "hellinger",
sample_color_by="Case_Control",
sample_label_by= "Case_Control")Why Hellinger transformation?
- Reduces impact of highly abundant taxa
Do Case and Control samples cluster separately? Or are they overlapping?
In addition to PCA and PCoA, which are both unconstrained ordination
methods, ampvis2 also supports constrained ordination
methods like CCA and RDA. These are especially useful when you suspect
that a specific variable—such as T1D status—might help explain variation
in the microbial communities.
Constrained Ordination
While PCA summarizes overall variation in the dataset without considering sample groupings, RDA constrains the ordination based on a known explanatory variable (here, T1D status).
So, RDA helps answer the question: “Which species differentiate samples based on T1D status?”
amp_ordinate(t1d,
type="RDA",
transform = "hellinger",
constrain = "T1D_Status",
sample_color_by="T1D_Status",
sample_label_by= "Gender",
sample_colorframe = TRUE,
sample_colorframe_label = "T1D_Status")Venn Diagram of Core Taxa
Another way to visualize the data is with a Venn diagram, which shows the number of taxa—Species in this example—that are shared between groups at a defined threshold.
amp_venn(
t1d,
group_by = "T1D_Status",
cut_a = 0.1, # Minimum relative abundance
cut_f = 80, # Minimum percent of samples the taxon must be present in
text_size = 5,
normalise = TRUE,
detailed_output = FALSE
)This plot allows you to see which species are core to each group and which are shared between them.
Alpha Diversity
We’ve already explored the top 20 most abundant taxa, but in total, there are 6,988 unique species detected across all samples. In microbiome studies, we often need to examine the data compositionally, rather than just focusing on abundance.
One common compositional metric is alpha diversity, which measures the richness and/or evenness of species within individual samples.
## Sample_ID Subject_ID Case_Control Gender Delivery_Route AbxAtCollection
## G45108 G45108 E003251 case female vaginal no_abx
## G45068 G45068 E006673 control female vaginal no_abx
## G45071 G45071 E010590 control male vaginal no_abx
## G45075 G45075 T025418 case female vaginal no_abx
## G45250 G45250 E010629 case male vaginal no_abx
## G45226 G45226 E026079 case male vaginal no_abx
## AbxPreCollection IllnessAtCollection IAA_Level GADA_Level IA2A_Level
## G45108 no_abx no_illness 14.083 21.754 0.076
## G45068 no_abx no_illness 0.370 2.700 0.080
## G45071 Amoxicillin no_illness 0.410 0.020 0.080
## G45075 Penicillin V no_illness 8.590 189.700 0.390
## G45250 Amoxicillin no_illness 5.130 7.460 0.120
## G45226 Amoxicillin no_illness 12.280 11.570 0.110
## ZNT8A_Level ICA_Level Age_at_Collection T1D_Status Reads uniqueOTUs
## G45108 0.296 0 508 T1D 4668626 5891
## G45068 0.080 0 705 control 5043178 5883
## G45071 0.100 0 1045 control 5185390 5527
## G45075 0.130 512 1025 T1D 5825803 5660
## G45250 1.370 256 1089 seroconverted 5831702 5797
## G45226 0.100 4 760 seroconverted 5891057 5483
## Shannon Simpson invSimpson
## G45108 3.557845 0.9077125 10.835709
## G45068 3.779808 0.9455507 18.365706
## G45071 3.196606 0.8970294 9.711513
## G45075 3.301305 0.8953137 9.552344
## G45250 3.780163 0.9452665 18.270349
## G45226 3.907345 0.9530379 21.293772
library(ggplot2)
# Merge alpha diversity data with metadata
alpha_merged <- merge(alpha, t1d$metadata, by = "Sample_ID")Final Class Exercise:
You will create a visualization to compare alpha diversity (Shannon Index) between individuals with different T1D statuses.
Using ggplot2, generate a figure that:
- Plots Shannon diversity on the y-axis
- Groups samples by
T1D_status.yon the x-axis - Uses boxplots to summarize distributions
- Overlays individual sample points
- Colors samples by T1D status
- Add a clear title “Alpha Diversity (Shannon Index)”
- Labels axis as T1D Status & Shannon Diversity
Submission:
upload PDF to brightspace!
“Do T1D cases have lower Shannon diversity compared to control?”
If so, this could suggest a disrupted microbiome. If not, diversity alone may not explain disease.
Low diversity = low Shannon = one species dominates High diversity = high Shannon = more species present
Session Info
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Tahoe 26.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
## [5] purrr_1.2.1 readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
## [9] tidyverse_2.0.0 ampvis2_2.8.11 ggplot2_4.0.1
##
## loaded via a namespace (and not attached):
## [1] plotly_4.11.0 sass_0.4.10 generics_0.1.4 stringi_1.8.7
## [5] lattice_0.22-9 hms_1.1.4 digest_0.6.39 magrittr_2.0.4
## [9] timechange_0.3.0 evaluate_1.0.5 grid_4.5.3 RColorBrewer_1.1-3
## [13] bookdown_0.46 fastmap_1.2.0 Matrix_1.7-4 plyr_1.8.9
## [17] jsonlite_2.0.0 ggrepel_0.9.6 ape_5.8-1 mgcv_1.9-4
## [21] httr_1.4.7 viridisLite_0.4.2 scales_1.4.0 permute_0.9-8
## [25] lazyeval_0.2.2 jquerylib_0.1.4 cli_3.6.5 rlang_1.1.7
## [29] crayon_1.5.3 splines_4.5.3 withr_3.0.2 cachem_1.1.0
## [33] yaml_2.3.12 vegan_2.7-2 otel_0.2.0 tools_4.5.3
## [37] parallel_4.5.3 tzdb_0.5.0 vctrs_0.7.0 R6_2.6.1
## [41] lifecycle_1.0.5 htmlwidgets_1.6.4 MASS_7.3-65 cluster_2.1.8.2
## [45] pkgconfig_2.0.3 pillar_1.11.1 bslib_0.9.0 gtable_0.3.6
## [49] glue_1.8.0 data.table_1.18.0 rmdformats_1.0.4 Rcpp_1.1.1
## [53] xfun_0.56 tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51
## [57] farver_2.1.2 htmltools_0.5.9 nlme_3.1-168 labeling_0.4.3
## [61] rmarkdown_2.30 compiler_4.5.3 S7_0.2.1