Skip to main content
The PRIMED website will be unavailable from Monday, April 1 at 5pm PT/8pm ET to Tuesday, April 2 at 6pm PT/9pm ET. You may encounter errors or may not be able to access the site during this time. Also, you will be unable to log in to the CC's AnVIL Consortium manager web app. We apologize for any inconvenience.

Data Overview

The PRIMED Consortium brings together over 75 new and existing studies and consortia with a broad range of phenotypes, molecular data types, and ancestral diversity. Consortium-generated data will be made available to the scientific community via the AnVIL platform.

Back to top

Participant Diversity

Over 40 countries are represented among study participants whose data will be used by the PRIMED Consortium to improve polygenic risk score development and use in diverse genetic ancestry populations.

Map of Earth with country boundaries marked. Blue shading indicates countries from which participants were recruited across studies, cohorts, and consortia with datasets proposed for use by PRIMED Study Sites.
Blue shading indicates countries from which participants were recruited across studies, cohorts, and consortia with datasets proposed for use by PRIMED Study Sites.

 

Back to top

Molecular Data

The PRIMED Consortium utilizes molecular data generated via numerous technologies:

  • Exome and Genome Sequencing
  • Genotyping Array
  • Genome-wide Imputation
  • Genomic Summary Results

The Genotype Harmonization Working Group leads the effort to harmonize, standardize, and perform quality control of this data. All individual-level genotype data is available as VCFs in genome build GRCh38.

Back to top

Phenotype Data

The PRIMED Consortium analyzes phenotypes across many domains. Current priority phenotype domains and traits are:

Domain

Phenotypes

Anthropometry

Height, Weight, BMI, Waist hip ratio

Blood Pressure

Systolic BP, Diastolic BP, Hypertension

Cancer

Breast cancer, Prostate cancer

Cardiovascular Disease Events

Coronary artery disease (CAD)

Diabetes

Type 1 diabetes, Type 2 diabetes

Glycemic Traits

Fasting plasma glucose, Fasting serum glucose, Fasting insulin, HbA1c

Kidney function

Cystatin C, Serum creatinine

Hematology

RBC, Hemoglobin, Hematocrit, MCV, MCH, MCHC, RDW, WBC, MPV, Basophil count, Eosinophil count, Lymphocyte count, Monocyte count, Neutrophil count, Platelet count

Lipids

HDL, LDL, Total cholesterol, Triglycerides, non-HDL cholesterol

The Phenotype Harmonization Working Group leads the effort to inventory, harmonize, standardize, and perform quality control of this data.

Back to top

PRIMED Data Model

AnVIL relies on data models to define a consistent structure of data and metadata in workspaces, including how data elements are linked across data types. AnVIL data models use the “data tables” feature to organize data, which creates a relational database-like structure that standardizes columns and defines how tables link to each other. This maximizes data findability and usefulness, and it simplifies the process of merging data across workspaces for harmonization and joint analysis. 

The PRIMED Genotype Harmonization, Phenotype Harmonization, and Population Descriptors Working Groups developed the PRIMED data model for use in Consortium data workspaces, and all data uploaded to these workspaces are required to conform to this data model. The PRIMED data model is available on GitHub and is depicted in the figure below. If you have questions or are interested in using the PRIMED data model for your own project, please contact the PRIMED Coordinating Center (primedconsortium@uw.edu).

The PRIMED data model. Each colored box represents a table: Genotype Tables, Genomic Summary Result Tables, Cohort-format Phenotype Tables, Unharmonized phenotype data, Population Descriptor Table, Subject Table. Lines represent a foreign key connection between tables.
The PRIMED data model. Each colored box represents a table. Genotype Tables (light blue) capture metadata about individual-level genotype datasets and provide links to genomic data files. Genomic Summary Result Tables (teal) capture metadata about GSR analyses and provide links to tabular data files. Cohort-format Phenotype Tables (light pink) are tabular data files that contain individual-level phenotype data in a wide-format familiar to cohort studies. Unharmonized phenotype data (dark pink) can be shared in tabular data files pre-harmonization. The Population Descriptor Table (orange) captures detailed population descriptor information. The Subject Table (purple) captures information on each subject/participant and is the linking point to the other components of the data model. Lines represent a foreign key connection between tables. The data model is flexible, so tables can be added or updated over time.

 

Back to top