ampvis2: Visualizing 16S amplicon data

Getting Started

_ ### Code chunk options

Install the required R libraries

  1. Go to https://github.com/KasperSkytte/ampvis2

  2. Click on Green Code button -> Download Zip

  3. Unzip it

  4. You will get a folder like “ampvis2-main/”

  5. Install it from the local folder

#install.packages("/full/path/to/ampvis2-main", repos = NULL, type = "source")

#**EXAMPLE
#install.packages("~/Downloads/ampvis2-main", repos = NULL, type = "source")

Libraries loaded

library(ampvis2)
library(tidyverse)

The DIABIMMUNE Study

The DIABIMMUNE project (short for Pathogenesis of Type 1 Diabetes: Testing the Hygiene Hypothesis) is a longitudinal study designed to understand how early-life microbial exposure shapes immune development and the risk of autoimmune diseases such as Type 1 Diabetes (T1D).

Why this study matters

Type 1 Diabetes is an autoimmune disease in which the immune system attacks insulin-producing pancreatic β-cells. While genetics contribute to risk, genetics alone cannot explain the rapid increase in T1D incidence globally.

This study asks a key question:

Could differences in early microbial exposure influence whether the immune system develops tolerance or autoimmunity?

Scientific Impact

  • One of the first studies to integrate genetic risk, environmental exposure, and gut microbiota in the context of autoimmunity.

  • Findings suggest that early-life microbiome composition is different in infants who go on to develop T1D compared to controls.

  • Has provided public datasets that are widely used for microbiome research, including studies on taxonomic shifts, immune modulation, and biomarker discovery.

Data Availability

  • Data from the DIABIMMUNE cohort is available through repositories like the European Nucleotide Archive (ENA) and has been included in various bioinformatics tutorials and benchmarking datasets.

  • For educational and tutorial purposes, subsets of the data (e.g., taxonomic profiles, sequencing reads) are often used to teach microbiome data analysis workflows.

In this tutorial, you are working with:

  • Taxonomic abundance data (Bracken output)
  • Sample metadata (disease status, demographics)
## Load count matrix (Bracken output)
counts <- read.csv("T1D_bracken_output.csv", check.names = FALSE)

## Load sample metadata
meta <- read.csv("T1D_metadata.csv", check.names = FALSE)

## Create amplicon object using ampvis2
t1d <- amp_load(counts, meta)

## Print summary of loaded object
t1d
## ampvis2 object with 3 elements. 
## Summary of OTU table:
##      Samples         OTUs  Total#Reads    Min#Reads    Max#Reads Median#Reads 
##           12         6988     74218389      4668626      9120775    5926845.5 
##    Avg#Reads 
##   6184865.75 
## 
## Assigned taxonomy:
##      Kingdom       Phylum        Class        Order       Family        Genus 
##   6988(100%)   6988(100%) 6798(97.28%) 6973(99.79%) 6973(99.79%)  6925(99.1%) 
##      Species 
## 6970(99.74%) 
## 
## Metadata variables: 15 
##  Sample_ID, Subject_ID, Case_Control, Gender, Delivery_Route, AbxAtCollection, AbxPreCollection, IllnessAtCollection, IAA_Level, GADA_Level, IA2A_Level, ZNT8A_Level, ICA_Level, Age_at_Collection, T1D_Status

This object now links:

  • the metadata
  • with what microbes are present (counts)

Dataset summary:

  • This dataset contains 12 microbiome samples with 6,988 detected taxa (OTUs/species).

Sequencing Depth Total reads: ~74 million Per sample: Min: ~4.7 million Max: ~9.1 million Median: ~5.9 million

Interpretation: Sequencing depth is relatively high and consistent across samples, which is good for downstream comparisons. Nearly all taxa are classified across major levels: Kingdom & Phylum: 100% Genus: ~99% Species: ~99.7%

This is high-quality taxonomic annotation, meaning most reads can be confidently assigned to known organisms.

There are 15 sample variables, including:

Disease-related: T1D_Status, Case_Control Demographics: Gender, Age_at_Collection Clinical markers: autoantibody levels (e.g., IAA, GADA) Environmental factors: antibiotic use, illness, delivery route

This rich metadata allows you to test both biological and environmental influences on the microbiome.

Heatmap

Starting with a heat map is a quick way to explore whether there are major shifts in the most abundant taxa between your groups of interest.

For example, we can compare case vs. control:

#heatmap
amp_heatmap(t1d,
            facet_by = "Case_Control",
            tax_aggregate = "Genus",
            tax_show = 20)

What this shows

  • Rows → taxa (collapsed to Genus level)
  • Columns → samples
  • Colors → relative abundance

Useful for:

  • Identifying taxa that differ between groups
  • Seeing patterns across many samples at once

Questions:

  1. Do certain genera appear enriched in cases vs controls?
  2. Are patterns consistent across individuals?
  3. Is there high variability within groups?

Boxplots

Much like heat maps, box plots are another great way to get an overview of the top taxa. They offer an additional advantage: showing within-group variation.

Below we will be: creating boxplots of the top 20 most abundant taxa, comparing cases vs controls.

#boxplot
amp_boxplot(t1d,
            group_by = "Case_Control",
            tax_aggregate = "Species",
            tax_add = "Genus",
            tax_show = 20)

# t1d = full dataset 
# group_by = splits samples into groups case and control 
# we are looking at species level abundance 
# tax_add will "add" extra labeling information 

Heatmaps show patterns across samples, but boxplots show:

  • Distribution within groups
  • Median differences
  • Outliers

Are differences between Case and Control consistent? Are some taxa highly variable?

Heatmap = “Where are patterns?”; Boxplot = “How consistent are those patterns within groups?”

What is ordination?

Ordination reduces complex, high-dimensional data into 2D space so we can visualize relationships between samples.

  • Each point = one microbiome sample
  • Distance between points = similarity of microbial communities

PCA (Unconstrained)

amp_ordinate(t1d,
             type="pca",
             transform = "hellinger",
             sample_color_by="Case_Control",
             sample_label_by= "Case_Control")

Why Hellinger transformation?

  • Reduces impact of highly abundant taxa

Do Case and Control samples cluster separately? Or are they overlapping?

In addition to PCA and PCoA, which are both unconstrained ordination methods, ampvis2 also supports constrained ordination methods like CCA and RDA. These are especially useful when you suspect that a specific variable—such as T1D status—might help explain variation in the microbial communities.

Constrained Ordination

While PCA summarizes overall variation in the dataset without considering sample groupings, RDA constrains the ordination based on a known explanatory variable (here, T1D status).

So, RDA helps answer the question: “Which species differentiate samples based on T1D status?”

amp_ordinate(t1d,
             type="RDA",
             transform = "hellinger",
             constrain = "T1D_Status",
             sample_color_by="T1D_Status",
             sample_label_by= "Gender",
             sample_colorframe = TRUE,
             sample_colorframe_label = "T1D_Status")

Venn Diagram of Core Taxa

Another way to visualize the data is with a Venn diagram, which shows the number of taxa—Species in this example—that are shared between groups at a defined threshold.

amp_venn(
  t1d,
  group_by = "T1D_Status",
  cut_a = 0.1,           # Minimum relative abundance
  cut_f = 80,            # Minimum percent of samples the taxon must be present in
  text_size = 5,
  normalise = TRUE,
  detailed_output = FALSE
)

This plot allows you to see which species are core to each group and which are shared between them.

Alpha Diversity

We’ve already explored the top 20 most abundant taxa, but in total, there are 6,988 unique species detected across all samples. In microbiome studies, we often need to examine the data compositionally, rather than just focusing on abundance.

One common compositional metric is alpha diversity, which measures the richness and/or evenness of species within individual samples.

alpha <- amp_alphadiv(t1d)
head(alpha)
##        Sample_ID Subject_ID Case_Control Gender Delivery_Route AbxAtCollection
## G45108    G45108    E003251         case female        vaginal          no_abx
## G45068    G45068    E006673      control female        vaginal          no_abx
## G45071    G45071    E010590      control   male        vaginal          no_abx
## G45075    G45075    T025418         case female        vaginal          no_abx
## G45250    G45250    E010629         case   male        vaginal          no_abx
## G45226    G45226    E026079         case   male        vaginal          no_abx
##        AbxPreCollection IllnessAtCollection IAA_Level GADA_Level IA2A_Level
## G45108           no_abx          no_illness    14.083     21.754      0.076
## G45068           no_abx          no_illness     0.370      2.700      0.080
## G45071      Amoxicillin          no_illness     0.410      0.020      0.080
## G45075     Penicillin V          no_illness     8.590    189.700      0.390
## G45250      Amoxicillin          no_illness     5.130      7.460      0.120
## G45226      Amoxicillin          no_illness    12.280     11.570      0.110
##        ZNT8A_Level ICA_Level Age_at_Collection    T1D_Status   Reads uniqueOTUs
## G45108       0.296         0               508           T1D 4668626       5891
## G45068       0.080         0               705       control 5043178       5883
## G45071       0.100         0              1045       control 5185390       5527
## G45075       0.130       512              1025           T1D 5825803       5660
## G45250       1.370       256              1089 seroconverted 5831702       5797
## G45226       0.100         4               760 seroconverted 5891057       5483
##         Shannon   Simpson invSimpson
## G45108 3.557845 0.9077125  10.835709
## G45068 3.779808 0.9455507  18.365706
## G45071 3.196606 0.8970294   9.711513
## G45075 3.301305 0.8953137   9.552344
## G45250 3.780163 0.9452665  18.270349
## G45226 3.907345 0.9530379  21.293772
library(ggplot2)

# Merge alpha diversity data with metadata
alpha_merged <- merge(alpha, t1d$metadata, by = "Sample_ID")

Final Class Exercise:

You will create a visualization to compare alpha diversity (Shannon Index) between individuals with different T1D statuses.

Using ggplot2, generate a figure that:

  • Plots Shannon diversity on the y-axis
  • Groups samples by T1D_status.y on the x-axis
  • Uses boxplots to summarize distributions
  • Overlays individual sample points
  • Colors samples by T1D status
  • Add a clear title “Alpha Diversity (Shannon Index)”
  • Labels axis as T1D Status & Shannon Diversity

Submission:

upload PDF to brightspace!

“Do T1D cases have lower Shannon diversity compared to control?”

If so, this could suggest a disrupted microbiome. If not, diversity alone may not explain disease.

Low diversity = low Shannon = one species dominates High diversity = high Shannon = more species present

# Plot Shannon diversity by T1D status using ggplot2

Session Info

sessionInfo()
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Tahoe 26.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [5] purrr_1.2.1     readr_2.1.6     tidyr_1.3.2     tibble_3.3.1   
##  [9] tidyverse_2.0.0 ampvis2_2.8.11  ggplot2_4.0.1  
## 
## loaded via a namespace (and not attached):
##  [1] plotly_4.11.0      sass_0.4.10        generics_0.1.4     stringi_1.8.7     
##  [5] lattice_0.22-9     hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
##  [9] timechange_0.3.0   evaluate_1.0.5     grid_4.5.3         RColorBrewer_1.1-3
## [13] bookdown_0.46      fastmap_1.2.0      Matrix_1.7-4       plyr_1.8.9        
## [17] jsonlite_2.0.0     ggrepel_0.9.6      ape_5.8-1          mgcv_1.9-4        
## [21] httr_1.4.7         viridisLite_0.4.2  scales_1.4.0       permute_0.9-8     
## [25] lazyeval_0.2.2     jquerylib_0.1.4    cli_3.6.5          rlang_1.1.7       
## [29] crayon_1.5.3       splines_4.5.3      withr_3.0.2        cachem_1.1.0      
## [33] yaml_2.3.12        vegan_2.7-2        otel_0.2.0         tools_4.5.3       
## [37] parallel_4.5.3     tzdb_0.5.0         vctrs_0.7.0        R6_2.6.1          
## [41] lifecycle_1.0.5    htmlwidgets_1.6.4  MASS_7.3-65        cluster_2.1.8.2   
## [45] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0        gtable_0.3.6      
## [49] glue_1.8.0         data.table_1.18.0  rmdformats_1.0.4   Rcpp_1.1.1        
## [53] xfun_0.56          tidyselect_1.2.1   rstudioapi_0.18.0  knitr_1.51        
## [57] farver_2.1.2       htmltools_0.5.9    nlme_3.1-168       labeling_0.4.3    
## [61] rmarkdown_2.30     compiler_4.5.3     S7_0.2.1