Skip to main content

The Lin Lab, led by Dr. Xihong Lin at the Harvard T.H. Chan School of Public Health, advances genomics and human disease research using innovative statistical and machine learning methods. Our team analyzes large-scale genetic, genomic, and health data to study complex diseases, focusing on areas like whole genome sequencing, functional variant annotation, polygenic risk prediction, and gene-environment interactions. We develop scalable tools, including FAVOR and STAAR, and prioritize improving prediction accuracy for underrepresented populations.

Projects

Ruoyu Wang, Haoyu Zhang

Debiased Estimating Equation Method for Versatile and Efficient Mendelian Randomization Using Large Numbers of Correlated and Invalid SNPs with Weak Effects

Mendelian randomization (MR) is a powerful tool for uncovering the causal effects in the presence of unobserved confounding. It utilizes single nucleotide polymorphisms (SNPs) as instrumental variables (IVs) to estimate the causal effect. However, SNPs often have small effects on complex traits, leading to bias and low statistical efficiency in MR analysis. The strong linkage disequilibrium among SNPs is compounding this issue, which poses additional statistical hurdles. To address these challenges, this paper proposes DEEM (Debiased Estimating Equation Method), a summary statistics-based MR approach that can incorporate numerous correlated SNPs with weak effects. DEEM effectively eliminates the weak IV bias, adequately accounts for the correlations among SNPs, and enhances efficiency by leveraging information from correlated weak IVs. DEEM is a versatile method that allows adjustment for pleiotropic effects and applies to both two-sample and one-sample MR analyses. We establish the consistency and asymptotic normality of the resulting estimator. Extensive simulations and two real data examples demonstrate that DEEM can improve the efficiency and robustness of MR analysis.

Presented at ENAR 2024 & JSM 2024

Ruoyu Wang

Divide-and-shrink: a heterogeneity-agnostic approach for safe data integration

Data integration is garnering significant interest in modern data science due to the increasing accessibility of data from diverse sources. However, the effectiveness of data integration is often challenged by the widespread and intricate heterogeneity among data sources. This paper focuses on integrating individual-level data from a target population with summary statistics from a source population. We propose the divide-and-shrink (dShrink) method to incorporate the summary statistics. The dShrink method is distinctive for being tuning-free, model-free, and robust to broad types of heterogeneity between data sources. The dShrink method is guaranteed to outperform the estimator without data integration and can achieve a significant error reduction when the target and source populations are similar. Moreover, it offers flexibility to incorporate auxiliary information and to operate effectively even when the covariance matrix of the summary statistics is not accessible. Empirical evaluations of the dShrink estimator reveal its advantageous performance, underscoring its potential as a robust tool for data integration tasks.

Presented at 2024 IMS China Meeting

Tony Chen

ALL-Sum: Fast and scalable ensemble learning method for versatile polygenic risk prediction

Polygenic risk scores (PRS) using data from genome-wide association studies (GWAS) have garnered significant interest for predicting individuals’ genetic predisposition to complex traits and diseases. PRS present great potential for early disease detection, high-risk subject identification, and disease prevention and intervention. Existing PRS methodology faces a challenge of balancing prediction accuracy and computational efficiency. To tackle these challenges, we propose ALL-Sum, which integrates L0L2-penalized regression and ensemble learning using GWAS summary statistics. Our extensive simulations and analysis of 11 traits and diseases demonstrate ALL-Sum’s robustness to a broad spectrum of phenotypes and genetic architectures, offering improved prediction accuracy, runtime, and memory usage compared to the most commonly used methods. ALL-Sum stands as a promising tool to refine clinical risk assessment.

Presented at JSM 2023, ASHG 2023 & ILCCO/LC3/INTEGRAL Annual Meeting 2023
GitHub: https://github.com/chen-tony/ALL-Sum
Published in PNAS: https://www.pnas.org/doi/10.1073/pnas.2403210121

Tony Chen

SPLENDID incorporates continuous genetic ancestry in biobank-scale data to improve polygenic risk prediction across diverse populations

Polygenic risk scores are widely used in disease risk stratification, but their accuracy varies across diverse populations. Recent methods large-scale leverage multi-ancestry data to improve accuracy in under-represented populations but require labelling individuals by ancestry for prediction. This poses challenges for practical use, as clinical practices are typically not based on ancestry. We propose SPLENDID, a novel penalized regression framework for diverse biobank-scale data. Our method utilizes ancestry principal component interactions to model genetic ancestry as a continuum within a single prediction model for all ancestries, eliminating the need for discrete labels. In extensive simulations and analyses of 9 traits from the All of Us Research Program and UK Biobank, SPLENDID significantly outperformed existing methods in prediction accuracy and model sparsity. By directly incorporating continuous genetic ancestry in model training, SPLENDID stands as a valuable tool for robust risk prediction across diverse populations and fairer clinical implementation.

Presented at ASHG 2024 & PRIMED Methods Working Group

Eric Van Buren, Hufeng Zhou, Yi Zhang

Integration of Single-Cell-Sequencing Data into Rare Variant Association Tests

Existing methods for performing rare variant association testing in candidate cis-regulatory element regions do not integrate single-cell-sequencing data, and therefore cannot capture the variability in regulatory element activity between cell types. cellSTAAR, a new method in development, uses functional annotations and variant sets constructed using single-cell-sequencing data to conduct functionally-informed rare variant association testing by cell type.

Presented at CHARGE 2023 & ASHG 2023

Eric Van Buren, Hufeng Zhou

Computational Prioritization of IGVF Experimental Variants using Functional Annotation & Variant Characterization Data

The IGVF consortium aims to massively enhance our understanding of how genetic variation influences disease through the use of experimental techniques and computational modelling. To help experimental center prioritize variants for future study, and to benchmark the value of various annotation data and computational methods, we are working collaboratively with multiple IGVF centers to build a predictive model designed to prioritize variants which are likely to be functional for experimental study.

Presented at IGVF Annual Meeting in 2023

Rebecca Danning

LACE-UP: An Ensembling Method for Clustering Binary Symptom Data

Neuropsychiatric and behavioral conditions often lack biomarkers and are diagnosed based on the presence or absence of a variety of traits, leading to highly heterogeneous phenotypes that may comprise latent subtypes. Detecting these subtypes can be challenging due to a lack of existing methods for clustering binary data that are robust to various realistic data settings, including not knowing the number of subtypes, the inclusion of symptoms that are unrelated to the true underlying disease, and correlation of symptoms within individuals. LACE-UP is a novel and robust method for clustering binary data that does not require prespecifying the number of clusters and outperforms gold standard techniques in the presence of extraneous variables and local dependence.

Poster presentation at ASHG 2023

Hui Li

HEELS: Accurate and Efficient Estimation of Local Heritability using Summary Statistics and LD Matrix

Existing SNP-heritability estimators that leverage summary statistics from genome-wide association studies (GWAS) are much less efficient (i.e., have larger standard errors) than the restricted maximum likelihood (REML) estimators which require access to individual-level data. We introduce a new method for local heritability estimation – Heritability Estimation with high Efficiency using LD and association Summary Statistics (HEELS) – that significantly improves the statistical efficiency of summary-statistics-based heritability estimator and attains comparable statistical efficiency as REML (with a relative statistical efficiency greater than $92%). Moreover, we propose representing the empirical LD matrix as the sum of a low-rank matrix and a banded matrix. We show that this way of modeling the LD can only reduce the storage and memory cost, but also improve the computational efficiency of heritability estimation. We demonstrate the statistical efficiency of HEELS and the advantages of our proposed LD approximation strategies both in simulations and through empirical analyses of the UK Biobank data.

Github repository: https://github.com/huilisabrina/HEELS
Presented at ENAR 2023ASHG 2023 & NESS 2022
Published in Nature Communications

Hui Li

graphREML: Heritability partitioning and enrichment analyses with higher precision

Heritability enrichment analysis has been one of the most valuable approaches to understand genetic architecture and to link functional genomic datasets with disease genetics. Stratified LD score regression (S-LDSC) is the most widely used method for heritability partitioning and enrichment analyses, but S-LDSC has low statistical power; moreover, S-LDSC assumes an unrealistic linear relationship between the heritability of a SNP and its annotations. Recently, Salehi et al. proposed “LD graphical models (LDGMs)”, which represent LD patterns using extremely sparse matrices derived from genome-wide genealogies (Kelleher et al. 2019). LDGMs enable the use of efficient sparse matrix operations, potentially addressing the challenge of likelihood-based heritability partitioning. We introduce graphREML, a new heritability partitioning method that operates on GWAS summary statistics and sparse representations of the LD via the LDGM precision matrices, allowing for overlapping and continuous annotations. In both simulation studies and analyses of real traits, we found that graphREML improves upon S-LDSC by modeling the full likelihood of the summary statistics, and is robust to out-of-sample application of the LD.

Github repository: https://github.com/huilisabrina/graphREML
Presented at ASHG 2023JSM 2023 & Probabilistic Modeling in Genomics 2023