Home / Research Centers, Institutes, and Labs / Harvard Chan Microbiome in Public Health Center / Poster Session 2026

Poster Session 2026

IRS: Iterative Reference Selection Normalizes Compositional Microbiome Data
Presented By: Yiming Shi

Microbiome sequencing data are inherently compositional, capturing relative rather than absolute microbial abundances. Standard normalization approaches such as total-sum scaling or log-ratio transformations implicitly assume that the total microbial load is comparable across samples; when this assumption fails, compositional bias causes estimated fold changes to systematically diverge from true absolute differences, inflating false discovery rates in downstream analyses. Reference-based normalization offers a principled remedy by identifying a subset of non-differential taxa and using the summed read counts of this selected reference subset—rather than the full community—to estimate sample-specific size factors. However, existing reference-based methods such as DACOMP and RSim select references in a single pass and remain susceptible to reference contamination: when differentially abundant taxa are inadvertently retained in the reference subset, size factor estimation is distorted and the very bias these methods aim to correct is reintroduced. Experimental solutions such as spike-in controls and flow cytometry can recover absolute abundances directly but are costly, labor-intensive, and infeasible for large-scale studies.
We present Iterative Reference Selection (IRS), a computational normalization method that iteratively refines the reference subset to exclude differentially abundant taxa, enabling accurate estimation of sample-specific size factors from standard relative sequencing data. IRS applies a dual-filter strategy combining Kendall rank correlation (non-parametric) and a Poisson generalized linear model with sandwich variance estimation (parametric), retaining a taxon in the reference only if it is non-significant under both tests. Starting from the full taxa set, IRS successively removes differentially abundant features until the reference composition stabilizes, providing robust protection against compositional bias even when a large fraction of taxa are truly differential between conditions.
We benchmarked IRS against widely used methods—including TSS, CLR, GMPR, TMM, DACOMP, and RSim—using Dirichlet-multinomial simulations and two real datasets with flow-cytometry-based absolute-abundance ground truth. Across all settings, IRS achieved the lowest mean squared error in log₂ fold-change estimation, near-negligible systematic bias, and the strongest FDR control among all practical methods, closely tracking an idealized oracle reference while preserving sensitivity. Integration with standard differential abundance frameworks (DESeq2, edgeR, Wilcoxon, MaAsLin2) consistently improved F1 scores over competing normalization strategies.
IRS provides a practical, computationally accessible alternative to experimental absolute quantification, enabling reliable recovery of absolute microbial signals in large-scale studies where spike-in or flow-cytometry measurements are unavailable. By making absolute-scale inference routinely tractable, IRS supports the translation of microbiome discovery into reproducible, quantitatively interpretable findings for clinical and therapeutic applications. An open-source R package with reproducible analysis code is provided.

Shi. Yiming poster pdf

Unleash your potential at Harvard Chan School.

In addition to our degree programs, we offer highly targeted programs through our Advanced Learning Academy, directed and taught by Harvard faculty.

Degree Programs

How to Apply

Advanced Learning Academy