Software
Software
TGFM
(September 2024) The TGFM software can be downloaded here. TGFM is a fine-mapping method that infers the posterior probability (PIP) for each gene-tissue pair to mediate a disease locus by analyzing GWAS summary statistics and leveraging eQTL data from diverse tissues to build cis-predicted expression models; TGFM also assigns PIPs to causal variants that are not mediated by gene expression in assayed genes and tissues. Further details are provided in our manuscript, “Fine-mapping causal tissues and genes at disease-associated loci” (Strober et al. medrxiv).
pgBoost
(May 2024) The pgBoost software can be downloaded here. pgBoost is a gradient boosting approach for linking regulatory SNPs to target genes that trains a non-linear combination of existing single-cell multiome methods, plus genomic distance, on fine-mapped eQTL data to assign a probabilistic score to each candidate SNP-gene link. Further details are provided in our manuscript, “Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance” (Dorans et al. medrxiv).
MultiSuSiE
(May 2024) The MultiSuSiE software can be downloaded here. MultiSuSiE is a method for multi-ancestry fine-mapping, allowing causal effect sizes to vary across ancestries based on a multivariate normal prior informed by empirical data. Further details are provided in our manuscript, “MultiSuSiE improves multi-ancestry fine-mapping in All of Us whole-genome sequencing data” (Rossen et al. medrxiv).
DDx-PRS
(February 2024) The DDx-PRS software can be downloaded here. Differential Diagnosis-Polygenic Risk Score (DDx-PRS) is a method that jointly estimates posterior probabilities of each possible diagnostic category (e.g. SCZ=50%, BIP=25%, MDD=15%, control=10%) by modeling variance/covariance structure across disorders, leveraging case-control polygenic risk scores for each disorder (computed using existing methods) and prior clinical probabilities for each diagnostic category. Further details are provided in our manuscript, “Distinguishing different psychiatric disorders using DDx-PRS” (Peyrot et al. medrxiv).
LDSPEC
(December 2023) The LDSPEC software can be downloaded here. LD SNP-pair effect correlation regression (LDSPEC) is a method for estimating the correlation of causal disease effect sizes of derived alleles between proximal SNPs, depending on their allele frequencies, LD, and functional annotations. Further details are provided in our manuscript, “Pervasive correlations between causal disease effects of proximal SNPs vary with functional annotations and implicate stabilizing selection” (Zhang et al. medrxiv).
ATM
(January 2023) The ATM software can be downloaded here. The Age-dependent Topic Modeling (ATM) method provides a low-rank representation of longitudinal records of hundreds of distinct diseases in large EHR data sets, enabling the identification of disease subtypes based on heterogeneous comorbidity profiles. Further details are provided in our manuscript, “Age-dependent topic modelling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk” (Jiang et al. 2023).
TCSC
(August 2022) The TCSC software can be downloaded here. Tissue co-regulation score regression (TCSC) is a method that jointly analyzes GWAS and gene expression data to disentangle causal tissues from tagging tissues and partition disease heritability (or covariance between two diseases/traits) into tissue-specific components, as described in our manuscript “Modeling tissue co-regulation to estimate tissue specific contributions to disease” (Amariuta et al. 2023).
scDRS
(September 2021) The scDRS software can be downloaded here. scDRS is a method that links scRNA-seq with polygenic risk of disease at individual cell resolution, as described in our manuscript “Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data” (Zhang*,Hou* et al. 2022).
S2G
(August 2021) The S2G software can be downloaded here. The S2G software can be used to estimate the precision and recall of SNP-to-gene linking (S2G strategies), and to construct combined S2G strategies to optimize their informativeness for disease risk, as described in our manuscript “Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity” (Gazal et al. 2022).
GCSC
(July 2021) The GCSC software can be downloaded here. GCSC is a method for leveraging gene co-regulation to identify gene sets enriched for disease heritability, as described in our manuscript “Leveraging gene co-regulation to identify gene sets enriched for disease heritability” (Siewert-Rocks et al. 2022).
Sc-linker
(May 2021) Sc-linker is a framework for integrating scRNA-seq, epigenetic maps and GWAS summary statistics to infer the underlying cell types and processes by which genetic variants influence disease, as described in our manuscript “Identifying disease-critical cell types and cellular processes across the human body by integration of single-cell profiles and human genetics” (Jagadeesh*, Dey* et al. 2022). Software for constructing cell type, disease progression and cellular process gene programs can be downloaded here and here. Postprocessed scRNA-seq data, gene programs, and enhancer-gene linking annotations can be downloaded here.
PRS-FH
(April 2021) The PRS-FH software can be downloaded here. PRS-FH is a method for leveraging family history information to increase polygenic risk score prediction accuracy, as described in our manuscript “Incorporating family history of disease improves polygenic risk scores in diverse populations” (Hujoel et al. 2022).
PolyFun, PolyPred and PolyLoc
(January 2021) The PolyFun software, implementing PolyFun and PolyPred, can be downloaded here. PolyFun is a method that leverages genome-wide functional annotations to improve fine-mapping power, as described in our manuscript “Functionally-informed fine-mapping and polygenic localization of complex trait heritability” (Weissbrod et al. 2020). PolyPred is a method that leverages fine-mapping via PolyFun to improve polygenic prediction accuracy, particularly in diverse populations, as described in our manuscript “Leveraging fine-mapping and non-European training data to improve trans-ethnic polygenic risk scores” (Weissbrod*,Kanai*,Shi* et al. 2022).
(October 2019) The PolyLoc software can be downloaded here. PolyLoc is a method that leverages the results of PolyFun to perform polygenic localization of complex trait heritability, as described in our manuscript “Functionally-informed fine-mapping and polygenic localization of complex trait heritability (Weissbrod et al. 2020).
DeepBoost/Imperio
(September 2020) The DeepBoost/Imperio software can be downloaded here. This software includes (i) DeepBoost, a gradient boosting method for constructing boosted deep learning annotations by integrating deep learning allelic-effect annotations with fine-mapped SNPs; (ii) tools to combine these deep learning annotations with SNP-to-gene (S2G) linking strategies and relevant gene sets, and (iii) Imperio, a method for integrating deep learning annotations with S2G strategies to predict gene expression in whole blood and construct allelic-effect annotations based on changes in predicted expression. Applications of these 3 approaches to blood-related traits are described in our manuscript “Integrative approaches to improve the informativeness of deep learning models for human complex diseases” (Dey et al. biorxiv). The annotations can be downloaded here.
GSSG
(September 2020) The GSSG software can be downloaded here. GSSG consists of tools to generate enhancer-driven and master-regulator gene scores in blood, and combine these gene scores with distal and proximal SNP-to-gene (S2G) linking strategies to construct SNP annotations for blood-related traits, as described in our manuscript “Unique contribution of enhancer-driven and master-regulator genes to autoimmune disease revealed using functionally informed SNP-to-gene linking strategies” (Dey et al. 2022). The gene scores, S2G links, and SNP annotations can be downloaded here.
CC-GWAS
(March 2020) The CC-GWAS software can be downloaded here. CC-GWAS is a method to test for differences in allele frequency among cases of two different disorders using summary statistics from the respective case-control GWAS, as described in our manuscript “Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS” (Peyrot et al. 2021).
AnnotBoost
(January 2020) The AnnotBoost software and annotations can be downloaded here. AnnotBoost is a a gradient boosting-based framework to impute and denoise Mendelian disease-derived pathogenicity scores to improve their informativeness for common disease, as described in our manuscript “Improving the informativeness of Mendelian disease-derived pathogenicity scores for common disease” (Kim et al. 2020 ).
S-LDXR
(October 2019) The S-LDXR software can be downloaded here. S-LDXR is a method for stratifying squared trans-ethnic genetic correlation across genomic annotations, as described in our manuscript “Population-specific causal disease effect sizes in functionally important regions impacted by selection” (Shi et al. 2021 ).
LT-FH
(July 2019) The LT-FH software can be downloaded here. LT-FH is a method for leveraging family history information to improve association power, as described in our manuscript “Combining case-controls status and family history of disease increases association power” (Hujoel et al. 2020)
LDpred-funct
(January 2019) The LDpred-funct software can be downloaded here. LDpred-funct is a method for leveraging functional enrichment to improve polygenic prediction accuracy, as described in our manuscript “Modeling functional enrichment improves polygenic prediction accuracy in UK Biobank and 23andMe data sets” (Marquez-Luna et al. 2021).
LD SCORE
(updated October 2018) The ldsc software can be downloaded here. LD Score regression (Bulik-Sullivan et al. 2015a) is a method for distinguishing confounding from polygenicity in genome-wide association studies. Stratified LD Score regression (Finucane et al. 2015; functional annotations here) is a method for partitioning heritability by functional category using GWAS summary statistics. Cross-trait LD Score regression (Bulik-Sullivan et al. 2015b) is a method for estimating genetic correlations using GWAS summary statistics. We have extended stratified LD score regression to gene expression phenotypes (Liu et al. 2017). We have also extended stratified LD score regression to continuous annotations (Gazal et al. 2017). We have also developed an approach that uses stratified LD score regression to identify disease-relevant tissues and cell types with heritability enrichment in specifically expressed genes (Finucane et al. 2018; functional annotations here). We have also extended stratified LD score regression to low-frequency variants (Gazal et al. 2018; functional annotations here). We have also identified conditionally independent signals of disease heritability enrichment for ancient enhancers, enhancers that are conserved across species, ancient promoters, and promoters of loss-of-function intolerant ExAC genes (Hujoel et al. 2019; functional annotations here). We have also identified enriched pathway, network, and pathway+network annotations, concluding that genes with high network connectivity are enriched for disease heritability (Kim et al. 2019; annotations here). We have also identified conditionally independent signals of disease heritability for transposable elements (Hormozdiari et al. 2019; annotations here). We have also evaluated conditionally independent signals of disease heritability for deep learning allelic-effect annotations, concluding that deep learning models have yet to achieve their full potential for complex disease and that their informativeness cannot be inferred from their accuracy in predictive regulatory annotations (Dey et al. 2020). The deep learning annotations can be downloaded here.
S-LD4M
(September 2018) The S-LD4M software can be downloaded here. This software implements our Stratified LD 4th moments regression (S-LD4M) method for estimating polygenicity across allele frequencies and functional categories, as described in our manuscript “Polygenicity of complex traits is explained by negative selection” (O’Connor et al. 2019).
FINDOR
(January 2018) The FINDOR software can be downloaded here. This software implements our Functionally Informed Novel Discovery Of Risk loci (FINDOR) method, as described in our manuscript “Leveraging polygenic functional enrichment to improve GWAS power” (Kichaev et al. 2019).
LCV
(October 2017) The LCV software can be downloaded here. This software implements our Latent Causal Variable (LCV) causal inference method, as described in our manuscript “Distinguishing genetic correlation from causation across 52 diseases and complex traits” (O’Connor et al. 2018).
Signed LD profile regression
(October 2017) Signed LD profile regression software can be downloaded here. Signed LD profile regression is a method for identifying genome-wide directional effects of signed functional annotations on diseases and complex traits, as described in our manuscript “Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk” (Reshef et al. 2018).
Molecular QTL annotations
(October 2017) Molecular QTL annotations described in our manuscript “Leveraging molecular QTL to understand the genetic architecture of diseases and complex traits” (Hormozdiari et al. 2018) can be downloaded here. The molecular QTL annotations include MaxCPP annotations constructed from GTEx eQTL data and MaxCPP annotations constructed from BLUEPRINT eQTL, hQTL (H3K27ac and H3K4me1), sQTL and meQTL data.
BOLT-LMM and BOLT-REML
(September 2017) The BOLT-LMM v2.3 software package (Loh et al. 2018), which includes multi-threaded support for the BGEN v1.2 imputed file format used by UK Biobank, can be downloaded here. Summary association statistics from BOLT-LMM analyses of all N=459K European-ancestry samples in UK Biobank are available here. The BOLT-LMM algorithm (Loh et al. 2015a) rapidly computes statistics for association between phenotype and genotypes using a linear mixed model (LMM). The BOLT-REML algorithm partitions SNP-heritability and estimates genetic correlations using a Monte Carlo algorithm for fast multi-component, multi-trait modeling (Loh et al. 2015b). By default, BOLT-LMM association analysis assumes a Bayesian mixture-of-normals prior for the random effect attributed to SNPs other than the one being tested. This model generalizes the standard “infinitesimal” mixed model used by previous mixed model association methods, providing an opportunity for increased power to detect associations while controlling false positives. Both algorithms are implemented in the BOLT-LMM v2.3 software package; see link for update log.
EIGENSOFT
(June 2017): EIGENSOFT version 7.2.1 is now available for download. The EIGENSOFT package combines functionality from our population genetics methods (Patterson et al. 2006) and our EIGENSTRAT stratification correction method (Price et al. 2006). The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker’s variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes.
The latest version of EIGENSOFT (7.2.1) can be downloaded here. Source code, documentation and executables for using EIGENSOFT 6.1.4 on a Linux platform can be downloaded here. New features of EIGENSOFT 6.x include fastmode option which implements a very fast pca approximation (Galinsky et al. 2016a, Galinsky et al. 2016b) and support for multi-threading. EIGENSOFT 6.1.4 includes bug fixes for pcaselection and a better out of memory diagnostic message. Our previous release, version 6.1.3, can be downloaded here.
The EIGENSOFT FAQ (Frequently Asked Questions) is available here.
SNP loadings computed from samples with European ancestry from the GERA cohort can be downloaded here.
LTSOFT
(January 2017): LTSOFT version 4.0 can be downloaded here. Changes to version 4.0 include the addition of LT-Fam to the LTMLM software implementing a multivariate liability threshold mixed linear model for settings with related individuals (Hayeck et al. 2017). Changes to version 3.0 include the addition of LTMLM, a new piece of software implementing a multivariate liability threshold mixed linear model association statistics for additional increase in power in settings of case control diseases (Hayeck et al. 2015). LTSOFT is a software suite designed to more powerfully leverage clinical-covariates such as age, bmi, smoking status, and gender as well as genetic-covariates such as known associated variants when conducting case-control association studies. Including these covariates in standard regression models is not only suboptimal, but can in many instances reduce power. LTSOFT employs a liability threshold model approach that takes advantage of known epidemiological results to better model the covariates’ relationship to the phenotype of interest (Zaitlen et al. 2012 PLoS Genet and Zaitlen et al. 2012 Bioinformatics).
Eagle
(May 2016) The Eagle v2.0 software (Loh et al. 2016b) estimates haplotype phase either using a phased reference panel or within a genotyped cohort. Eagle2 is now the default phasing method on the Sanger and Michigan imputation servers and uses a new, very fast HMM-based algorithm that improves speed and accuracy over existing methods via two key ideas. : a new data structure based on the positional Burrows-Wheeler transform and a rapid search algorithm that explores only the most relevant paths through the HMM. Compared to the Eagle1 algorithm (Loh et al. 2016a), Eagle2 has similar speed but much greater accuracy at sample sizes <50,000. The Eagle software can be downloaded here.
haploSNP
(July 2015) The haploSNP algorithm (Bhatia et al. biorxiv) constructs a set of haplotype polymorphisms (haploSNPs) from phased genotype data. haploSNPs are haplotypes of adjacent SNPs excluding a subset of masked sites that arise from skipped mismatches. Mismatches are skipped only if they can be potentially explained as mutations on a shared background. This is tested using a 4-gamete test between the haploSNP being extended and the mismatch SNP. If all 4 possible allelic combinations are observed, the mismatch cannot be explained as a mutation on a shared background, and the haploSNP is terminated.
Individuals are considered to carry 0,1, or 2 copies of the haploSNP if none, one or both of their chromosomes matches the haplotype at all unmasked sites. As haploSNPs are biallelic, they can be used in downstream analyses such as heritability estimation and association.
The haploSNP software can be downloaded here.
Efficient PCGC Regression
(June 2015) PCGC regression (Golan et al. 2014 PNAS) is designed to avoid biases in REML estimation of heritability in the context of ascertained case-control studies. We have released an efficient implementation of the PCGC regression method. Subject to the restriction that all GRMs be computed over identical lists of individuals (all *.grm.id files must be identical), this implementation eliminates in-memory storage of N x N matrices by accumulating dot products among regressors on-the-fly (i.e., streaming the GRM inputs), speeds up jackknife computations (storing partition results on the fly), and eliminates storage of “cleaned” GRMs (i.e., with PCs projected out) by projecting PCs on-the-fly. The Efficient PCGC Regression software is used in Loh et al. 2015b and Bhatia et al. biorxiv, and can be downloaded here.
LDpred
(March 2015) The LDpred software can be downloaded here. LDpred (Vilhjalmsson et al. 2015) is a method for computing polygenic risk scores from summary association statistics while accounting for LD between markers. The method infers the posterior mean causal effect size of each marker using a non-infinitesimal prior distribution on effect sizes and LD information from an external reference panel.
SNPWEIGHTS
(May 2014): SNPweights version 2.1 can be downloaded here. SNPweights is a software package for inferring genome-wide genetic ancestry using SNP weights precomputed from large external reference panels (Chen et al. 2013 Bioinformatics). Changes to version 2.0 include new SNP weights for Native American reference samples, a new format for SNP weights files, and new software for users to derive SNP weights using their own reference samples. Version 2.1 incorporates a bug fix in the inferanc program, which now works with all snpwt files. SNP weights for European and West African ancestral populations can be downloaded here. SNP weights for European, West African and East Asian ancestral populations can be downloaded here. SNP weights for European, West African, East Asian and Native American ancestral populations can be downloaded here. SNP weights for NW, SE and AJ ancestral populations of European Americans can be downloaded here.
FUNCTIONAL ANNOTATIONS
(March 2014): Functional annotations of SNPs and regions from our functional heritability paper “Regulatory variants explain much more heritability than coding variants across 11 common diseases” (Gusev et al. 2014) can be downloaded here.
IMPG-SUMMARY
(January 2024): ImpG-Summary version 1.0 can be downloaded here. ImpG-Summary is a software package for Gaussian imputation from summary association statistics, as described in our paper “Fast and accurate imputation from summary association statistics” (Pasaniuc et al. 2014).
MIXSCORE
(July 2012): MIXSCORE version 1.3 can be downloaded here. MIXSCORE is a method for combining SNP association and admixture association statistics to increase power in GWAS in admixed populations. For details, see the MIXSCORE paper (Pasaniuc et al. 2011 PLoS Genet, “Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a breast cancer consortium”).
TREESELECT
(April 2012): TreeSelect version 1.1 can be downloaded here. TreeSelect is a software package for inferring natural selection from unusual population differentiation between closely related populations. For details, see our Africa selection paper (Bhatia et al. 2011 AJHG, “Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection.”)
HAPMIX
(March 2011): HAPMIX version 1.2 can be downloaded here. Improvements to version 1.2 include an explicit check for discordance between admixed and reference population allele frequencies, and a script to interpolate estimates of local ancestry to a superset of SNPs. HAPMIX is an application for accurately inferring chromosomal segments of distinct continental ancestry in admixed populations, using dense genetic data.For details, see the HAPMIX paper (Price et al. 2009).
GENE EXPRESSION HERITABILITY
(January 2011): Source code and gene-by-gene results from our gene expression heritability paper “Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals” (Price et al. 2011) can be downloaded here.