The PRIMED consortium is using the NHGRI Analysis Visualization and Informatics Lab-space (AnVIL) for shared data storage and collaborative analysis in the cloud. One of the primary ways to perform data analysis and/or share analysis code on AnVIL is via a workflow. Workflows developed as part of PRIMED are located in the PRIMED organization on Dockstore. To use a workflow, click on its Dockstore link and then “Launch with -> AnVIL”.
PRS
PRSMix
- The PRSmix workflow encapsulates the PRSmix R package which integrate multiple PRS to improve prediction accuracy for a target trait.
- Workflow steps are: 1) harmonizing SNP effects, 2) computing the PRS, 3) combining multiple PRS.
Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.
GWAS
GENESIS
- The primed_genesis_gwas workflow runs an association test using the GENESIS software and formats the resulting summary statistics in the PRIMED data model.
Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.
Genetic Ancestry Inference
PCA
- The create_pca_projection workflow creates SNP loadings from a reference dataset.
- The projected_PCA workflow projects another dataset onto the reference PCA space.
Admixture
- The basic_admixture workflow runs the ADMIXTURE software in unsupervised or supervised mode.
- The projected_admixture workflow estimates ancestry fractions relative to the set of clusters generated from basic_admixture.
Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.
Simulation
sim_admixed is a WDL workflow encapsulating the admix-kit software package.
- The workflow uses HAPGEN2 to simulate ancestral populations from reference data, and then simulates generations of admixture from the simulated ancestral populations.
- The final step in the workflow generates data tables in the PRIMED data model.
- The run_hapgen and run_admix steps may also be run separately.
- 1000 Genomes data formatted for input to this workflow is available (see below).
Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.
Reference Panel
Public AnVIL workspace with 1000 Genomes data in the PRIMED Data Model:
- Phase 3 low-coverage sequencing of 2504 samples
- High-coverage sequencing of 3202 samples
Each callset has data in the following formats:
- VCF files for all populations combined, per chromosome (23 files, chroms 1-22 and X)
- PLINK2 files for all populations combined, with all chromosomes (3 files, pgen/psam/pvar)
- PLINK2 files per population and chromosome, subset to HapMap3 or MEGA+HapMap3 SNPs (chroms 1-22, ALL x SNP sets hm3, mega+hm3 x pgen/psam/pvar = 138 files) * (26 populations + 5 superpopulations). These files can be used as input for simulation workflows (see above).
- File with list of samples related to other samples in the dataset at a threshold >= 2nd degree, as estimated by KING. Exclude samples from this list to establish an unrelated dataset.