Analysis Tools

The PRIMED consortium is using the NHGRI Analysis Visualization and Informatics Lab-space (AnVIL) for shared data storage and collaborative analysis in the cloud. One of the primary ways to perform data analysis and/or share analysis code on AnVIL is via a workflow. Workflows developed as part of PRIMED are located in the PRIMED organization on Dockstore. To use a workflow, click on its Dockstore link and then “Launch with -> AnVIL”.

PRS

Tutorial on applying a PRS to a PRIMED dataset in AnVIL (slides) (video)

The tutorial covers running the primed_fetch_pgs_catalog and primed_calc_pgs workflows. These workflows are illustrated in an example AnVIL workspace.

Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.

PRSMix

The PRSmix workflow encapsulates the PRSmix R package which integrate multiple PRS to improve prediction accuracy for a target trait.
Workflow steps are: 1) harmonizing SNP effects, 2) computing the PRS, 3) combining multiple PRS.

Ask questions or report issues on the PRIMED #anvil Slack channel or the PRSmix github repository.

HAUDI

The HAUDI collection is a series of workflows to create local ancestry informed PRS models and apply them to a target dataset.

Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.

GWAS

GENESIS

The primed_genesis_gwas workflow runs an association test using the GENESIS software and formats the resulting summary statistics in the PRIMED data model.

Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.

Genetic Ancestry Inference

PCA

The create_pca_projection workflow creates SNP loadings from a reference dataset.
The projected_PCA workflow projects another dataset onto the reference PCA space.

Admixture

The basic_admixture workflow runs the ADMIXTURE software in unsupervised or supervised mode.
The projected_admixture workflow estimates ancestry fractions relative to the set of clusters generated from basic_admixture.

Ask questions or report issues on the PRIMED #anvil slack channel or the Ancestry Inference github repository.

FLARE

The run_flare workflow runs the FLARE local ancestry inference software.

Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.

Simulation

sim_admixed is a WDL workflow encapsulating the admix-kit software package.

The workflow uses HAPGEN2 to simulate ancestral populations from reference data, and then simulates generations of admixture from the simulated ancestral populations.
The final step in the workflow generates data tables in the PRIMED data model.
The run_hapgen and run_admix steps may also be run separately.
1000 Genomes data formatted for input to this workflow is available (see below).

Ask questions or report issues on the PRIMED #anvil slack channel or the admix-kit github repository.

Reference Panel

Public AnVIL workspace with 1000 Genomes data in the PRIMED Data Model:

Phase 3 low-coverage sequencing of 2504 samples
High-coverage sequencing of 3202 samples

Each callset has data in the following formats:

VCF files for all populations combined, per chromosome (23 files, chroms 1-22 and X)
PLINK2 files for all populations combined, with all chromosomes (3 files, pgen/psam/pvar)
PLINK2 files per population and chromosome, subset to HapMap3 or MEGA+HapMap3 SNPs (chroms 1-22, ALL x SNP sets hm3, mega+hm3 x pgen/psam/pvar = 138 files) * (26 populations + 5 superpopulations). These files can be used as input for simulation workflows (see above).
File with list of samples related to other samples in the dataset at a threshold >= 2nd degree, as estimated by KING. Exclude samples from this list to establish an unrelated dataset.

Search

PRS

GWAS

Genetic Ancestry Inference

Simulation

Reference Panel