Skip to main content

Analysis Tools

The PRIMED consortium is using the NHGRI Analysis Visualization and Informatics Lab-space (AnVIL) for shared data storage and collaborative analysis in the cloud. One of the primary ways to perform data analysis and/or share analysis code on AnVIL is via a workflow. Workflows developed as part of PRIMED are located in the PRIMED organization on Dockstore. To use a workflow, click on its Dockstore link and then “Launch with -> AnVIL”.

GWAS

GENESIS

  • The primed_genesis_gwas workflow runs an association test using the GENESIS software and formats the resulting summary statistics in the PRIMED data model.

Ask questions or report issues on the PRIMED #anvil Slack channel or the github repository.

Genetic Ancestry Inference

PCA

  • The create_pca_projection workflow creates SNP loadings from a reference dataset.
  • The projected_PCA workflow projects another dataset onto the reference PCA space.

Admixture

  • The basic_admixture workflow runs the ADMIXTURE software in unsupervised or supervised mode.
  • The projected_admixture workflow estimates ancestry fractions relative to the set of clusters generated from basic_admixture.

Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.

Simulation

sim_admixed is a WDL workflow encapsulating the admix-kit software package. 

  • The workflow uses HAPGEN2 to simulate ancestral populations from reference data, and then simulates generations of admixture from the simulated ancestral populations.
  • The final step in the workflow generates data tables in the PRIMED data model. 
  • The run_hapgen and run_admix steps may also be run separately.
  • 1000 Genomes data formatted for input to this workflow is available (see below).

Ask questions or report issues on the PRIMED #anvil slack channel or the github repository.

Reference Panel

Public AnVIL workspace with 1000 Genomes data in the PRIMED Data Model:

  • Phase 3 low-coverage sequencing of 2504 samples
  • High-coverage sequencing of 3202 samples

Each callset has data in the following formats:

  • VCF files for all populations combined, per chromosome (23 files, chroms 1-22 and X)
  • PLINK2 files for all populations combined, with all chromosomes (3 files, pgen/psam/pvar)
  • PLINK2 files per population and chromosome, subset to HapMap3 or MEGA+HapMap3 SNPs (chroms 1-22, ALL x SNP sets hm3, mega+hm3 x pgen/psam/pvar = 138 files) * (26 populations + 5 superpopulations). These files can be used as input for simulation workflows (see above).
  • File with list of samples related to other samples in the dataset at a threshold >= 2nd degree, as estimated by KING. Exclude samples from this list to establish an unrelated dataset.