Skip to main content
The PRIMED website will be unavailable from Monday, April 1 at 5pm PT/8pm ET to Tuesday, April 2 at 6pm PT/9pm ET. You may encounter errors or may not be able to access the site during this time. Also, you will be unable to log in to the CC's AnVIL Consortium manager web app. We apologize for any inconvenience.

Analysis Tools

The PRIMED consortium is using the NHGRI Analysis Visualization and Informatics Lab-space (AnVIL) for shared data storage and collaborative analysis in the cloud. One of the primary ways to perform data analysis and/or share analysis code on AnVIL is via a workflow. Workflows developed as part of PRIMED are located in the PRIMED organization on Dockstore. To use a workflow, click on its Dockstore link and then “Launch with -> AnVIL”.

GWAS

GENESIS

  • The primed_genesis_gwas workflow runs an association test using the GENESIS software and formats the resulting summary statistics in the PRIMED data model.

Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.

Genetic Ancestry Inference

PCA

  • The create_pca_projection workflow creates SNP loadings from a reference dataset.
  • The projected_PCA workflow projects another dataset onto the reference PCA space.

Admixture

  • The basic_admixture workflow runs the ADMIXTURE software in unsupervised or supervised mode.
  • The projected_admixture workflow estimates ancestry fractions relative to the set of clusters generated from basic_admixture.

Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.

Simulation

sim_admixed is a WDL workflow encapsulating the admix-kit software package. 

  • The workflow uses HAPGEN2 to simulate ancestral populations from reference data, and then simulates generations of admixture from the simulated ancestral populations.
  • The final step in the workflow generates data tables in the PRIMED data model. 
  • The run_hapgen and run_admix steps may also be run separately.
  • 1000 Genomes data formatted for input to this workflow is available (see below).

Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.

Reference Panel

Public AnVIL workspace with 1000 Genomes data in the PRIMED Data Model:

  • Phase 3 low-coverage sequencing of 2504 samples
  • High-coverage sequencing of 3202 samples

Each callset has data in the following formats:

  • VCF files for all populations combined, per chromosome (23 files, chroms 1-22 and X)
  • PLINK2 files for all populations combined, with all chromosomes (3 files, pgen/psam/pvar)
  • PLINK2 files per population and chromosome, subset to HapMap3 or MEGA+HapMap3 SNPs (chroms 1-22, ALL x SNP sets hm3, mega+hm3 x pgen/psam/pvar = 138 files) * (26 populations + 5 superpopulations). These files can be used as input for simulation workflows (see above).
  • File with list of samples related to other samples in the dataset at a threshold >= 2nd degree, as estimated by KING. Exclude samples from this list to establish an unrelated dataset.