Program in Quantitative Genomics
The Program in Quantitative Genomics (PQG) develops and applies quantitative methods to help handle massive genetic, genomic, and health data. Based in the Harvard Chan School and Longwood Medical Area, its goal is to improve health through the interdisciplinary study of genetics, behavior, environment, and health.
255 Huntington Ave
Building 2, 4th floor
Boston, MA 02115
PQG Working Group
Each year, the PQG organizes a less formal PQG Working Group for all local students, postdocs, and faculty. The goal is to provide the opportunity to present and participate in the discussion of works-in-progress, and to focus on the methods and analysis of high-dimensional data in genetics and genomics.
2024/2025 Student and Postdoc Seminar organizers: Tony Chen, Kerner Gaspard, Xinan Wang
Please direct any logistical questions to Amanda King
Note: Harvard Chan School seeks to bring in speakers with a wide range of experiences and perspectives. They’re here to share their own insights; they do not speak for the school or the university.
Upcoming Seminar
Tuesday, December 17, 2024
1:00-2:00 PM
Yosuke Tanigawa
Research Scientist, PhD
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Enhancing polygenic prediction with flexible models on individual-level data
Accurate prediction of disease liability and medically relevant traits using genetic, demographic, and environmental factors is critical for advancing precision medicine. Polygenic score (PGS), a statistical approach to aggregate the genetic effects across multiple genetic variants, attracts substantial research interest given its improved predictive accuracy and potential medical relevance. However, the limited transferability of PGS across diverse genetic ancestry groups remains the key challenge. To address the challenges of limited transferability, I hypothesize that flexible predictive models applied directly to individual-level data offer unique opportunities. I will present a series of recently developed models: (1) promoting ancestry diversity and inclusion of admixed individuals, (2) leveraging biological knowledge in variable selection, (3) modeling nonlinear genetic dominance effects, and (4) exploring gene-by-sex interactions. The proposed methods leverage supervised learning applied directly to individual-level data, enabling the modeling of nonlinear genetic effects, which is not feasible with conventional PGS approaches that rely on univariate linear association summary statistics. Finally, I will discuss the challenges and opportunities in leveraging large-scale cohort data to construct flexible and equitable polygenic predictors. Overall, these methods pave the way toward equitable genomic predictions, advancing precision medicine for diverse global populations.
2024-2025 Dates
Phillip Nicol
PhD Candidate
Harvard T.H. Chan School of Public Health
Identifying spatially variable genes by projecting to morphologically relevant directions
Spatial transcriptomics allows for high-resolution sequencing while retaining two-dimensional sample coordinates. A common goal is to identify spatially variable genes within a predefined cell type or domain. However, in many cases this region is implicitly one-dimensional, and consequently standard two-dimensional coordinate-based methods may lack statistical power and precision as they ignore tissue organization. In this talk, we introduce a spectral approach to find the optimal one-dimensional curve approximating the spatial transcriptomics sample coordinates. We then leverage this curve to define a new coordinate system that better represents the tissue morphology. A generalized additive model (GAM) is developed to pinpoint genes exhibiting variable expression in this new coordinate system. Our method directly models gene counts, eliminating the need for normalization or preprocessing steps. Our results indicate superior performance compared to existing hypothesis tests for identifying spatially variable genes, while also accurately pinpointing the precise location and mode of relevant expression patterns. We validate our approach through comprehensive simulations and real data analysis, encompassing diverse platforms such as Visium, Slide-seq, and MERFISH.
Postdoctoral Fellow
MGH and Broad Institute
Genetic drivers and cellular selection of female mosaic X chromosome loss
Mosaic loss of the X chromosome (mLOX) is the most commonly occurring clonal somatic alteration detected in the leukocytes of women, yet little is known about its genetic determinants or phenotypic consequences. To address this, we estimated mLOX in > 880,000 women across eight biobanks, identifying 12% of women with detectable X loss in approximately 2% of their leukocytes. Out of 1,253 diseases examined, women with mLOX had an elevated risk of myeloid and lymphoid leukemias. Genetic analyses identified 56 common variants influencing mLOX, implicating genes with established roles in chromosomal missegregation, cancer predisposition, and autoimmune diseases. A small fraction of these associations were shared with mosaic Y chromosome loss in men, suggesting different biological processes drive the formation and clonal expansion of sex chromosome missegregation events. Allelic shift analyses identified alleles on the X chromosome which are preferentially retained, demonstrating that variation at many loci across the X chromosome is under cellular selection. A novel polygenic score including 44 independent X chromosome allelic shift loci correctly inferred the retained X chromosomes in 80.7% of mLOX cases in the top decile. Collectively our results support a model where germline variants predispose women to acquiring mLOX, with the allelic content of the X chromosome possibly shaping the magnitude of subsequent clonal expansion.
Instructor, MD/PhD,
Divisions of Genetics and Rheumatology, Brigham and Women’s Hospital
Genetic determinants of RNA expression are critical for understanding disease mechanisms. However, conventional expression quantitative loci (eQTL) using bulk RNA-seq lacks mRNA lifecycle details. While eQTL can affect transcription by promoters or enhancers, they may also impact posttranscriptional modifications impacting RNA stability. To address this, we compared eQTL from matured cell RNA with nascent nucleus RNA. We used (i) bulk RNA-seq and single-nucleus (sn)RNA-seq from brain and (ii) single-cell (sc)RNA-seq and snRNA-seq from kidney. Using fine-mapped causal probability, cell RNA eQTL variants in the brain were significantly enriched in transcribed regions (P=4.0×10-145) and RBP binding site (P=1.2×10-51), indicating regulation at the posttranscriptional modification level. This enrichment was replicated in an independent kidney dataset. Conversely, nucleus eQTL were enriched in distant cCREs, suggesting regulation at the level of transcription whose effect may be diluted once RNAs are exported outside of the nuclei.
We identified eQTL by stop-gain variants causing nonsense-mediated decay only in cell RNA, and causal variants in distant enhancers in nucleus RNA but in transcribed regions in cell RNA. Interestingly, there were examples of multiple (as many as 18) eQTL causal variants in linkage disequilibrium (LD), all in the transcribed regions and RBP binding sites, potentially affecting stability of mature RNA molecules synergistically. These examples potentially suggest a novel concept of multiple causal variant hypothesis in eQTLs, in contrast to the conventional hypothesis where conditionally independent variants act on distinct molecular mechanisms (e.g., promoter and enhancer effects). Indeed, we found that eQTL variants in the transcribed regions have more variants tagged by LD than those in the promoters or enhancers. Overall, cellular and nucleus RNA eQTL revealed distinct genetic determinants of expression, even within the same cell type and tissue.
Yosuke Tanigawa
Research Scientist, PhD
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Enhancing polygenic prediction with flexible models on individual-level data
Accurate prediction of disease liability and medically relevant traits using genetic, demographic, and environmental factors is critical for advancing precision medicine. Polygenic score (PGS), a statistical approach to aggregate the genetic effects across multiple genetic variants, attracts substantial research interest given its improved predictive accuracy and potential medical relevance. However, the limited transferability of PGS across diverse genetic ancestry groups remains the key challenge. To address the challenges of limited transferability, I hypothesize that flexible predictive models applied directly to individual-level data offer unique opportunities. I will present a series of recently developed models: (1) promoting ancestry diversity and inclusion of admixed individuals, (2) leveraging biological knowledge in variable selection, (3) modeling nonlinear genetic dominance effects, and (4) exploring gene-by-sex interactions. The proposed methods leverage supervised learning applied directly to individual-level data, enabling the modeling of nonlinear genetic effects, which is not feasible with conventional PGS approaches that rely on univariate linear association summary statistics. Finally, I will discuss the challenges and opportunities in leveraging large-scale cohort data to construct flexible and equitable polygenic predictors. Overall, these methods pave the way toward equitable genomic predictions, advancing precision medicine for diverse global populations.