Analysis Tools | PRIMED Consortium

The PRIMED consortium is using the NHGRI Analysis Visualization and Informatics Lab-space (AnVIL) for shared data storage and collaborative analysis in the cloud. One of the primary ways to perform data analysis and/or share analysis code on AnVIL is via a workflow. Workflows developed as part of PRIMED are located in the PRIMED organization on Dockstore. To use a workflow, click on its Dockstore link and then “Launch with -> AnVIL”.

PRS

PRSMix

The PRSmix workflow encapsulates the PRSmix R package which integrate multiple PRS to improve prediction accuracy for a target trait.
Workflow steps are: 1) harmonizing SNP effects, 2) computing the PRS, 3) combining multiple PRS.

Ask questions or report issues on the PRIMED #anvil Slack channel or the PRSmix github repository.

GWAS

GENESIS

The primed_genesis_gwas workflow runs an association test using the GENESIS software and formats the resulting summary statistics in the PRIMED data model.

Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.

Genetic Ancestry Inference

PCA

The create_pca_projection workflow creates SNP loadings from a reference dataset.
The projected_PCA workflow projects another dataset onto the reference PCA space.

Admixture

The basic_admixture workflow runs the ADMIXTURE software in unsupervised or supervised mode.
The projected_admixture workflow estimates ancestry fractions relative to the set of clusters generated from basic_admixture.

Ask questions or report issues on the PRIMED #anvil slack channel or the Ancestry Inference github repository.

Simulation

sim_admixed is a WDL workflow encapsulating the admix-kit software package.

The workflow uses HAPGEN2 to simulate ancestral populations from reference data, and then simulates generations of admixture from the simulated ancestral populations.
The final step in the workflow generates data tables in the PRIMED data model.
The run_hapgen and run_admix steps may also be run separately.
1000 Genomes data formatted for input to this workflow is available (see below).

Ask questions or report issues on the PRIMED #anvil slack channel or the admix-kit github repository.

Reference Panel

Public AnVIL workspace with 1000 Genomes data in the PRIMED Data Model:

Phase 3 low-coverage sequencing of 2504 samples
High-coverage sequencing of 3202 samples

Each callset has data in the following formats:

VCF files for all populations combined, per chromosome (23 files, chroms 1-22 and X)
PLINK2 files for all populations combined, with all chromosomes (3 files, pgen/psam/pvar)
PLINK2 files per population and chromosome, subset to HapMap3 or MEGA+HapMap3 SNPs (chroms 1-22, ALL x SNP sets hm3, mega+hm3 x pgen/psam/pvar = 138 files) * (26 populations + 5 superpopulations). These files can be used as input for simulation workflows (see above).
File with list of samples related to other samples in the dataset at a threshold >= 2nd degree, as estimated by KING. Exclude samples from this list to establish an unrelated dataset.