Skip to main content

Data Overview

The PRIMED Consortium brings together over 75 new and existing studies and consortia with a broad range of phenotypes, molecular data types, and ancestral diversity. Consortium-generated data will be made available to the scientific community via the AnVIL platform.

Back to top

Participant Diversity

Over 40 countries are represented among study participants whose data will be used by the PRIMED Consortium to improve polygenic risk score development and use in diverse genetic ancestry populations.

Map of Earth with country boundaries marked. Blue shading indicates countries from which participants were recruited across studies, cohorts, and consortia with datasets proposed for use by PRIMED Study Sites.
Blue shading indicates countries from which participants were recruited across studies, cohorts, and consortia with datasets proposed for use by PRIMED Study Sites. Countries: Burkina Faso, Ghana, Kenya, Nigeria, South Africa, Uganda, Canada, Chile, Colombia, Honduras, Jamaica, Mexico, Peru, United States, Bangladesh, China, Hong Kong SAR China, India, Japan, Pakistan, Philippines, Qatar, Singapore, South Korea, Sri Lanka, Taiwan, Austria, Belgium, Denmark, Estonia, Finland, France, Germany, Greece, Iceland, Ireland, Italy, Latvia, Lithuania, Netherlands, Romania, Spain, Sweden, United Kingdom, Australia.

 

Back to top

Molecular Data

The PRIMED Consortium utilizes molecular data generated via numerous technologies:

  • Exome and Genome Sequencing
  • Genotyping Array
  • Genome-wide Imputation
  • Genomic Summary Results

The Genotype Harmonization Working Group leads the effort to harmonize, standardize, and perform quality control of this data. All individual-level genotype data is available as VCFs in genome build GRCh38.

Back to top

Phenotype Data

The PRIMED Consortium analyzes phenotypes across many domains. Current priority phenotype domains and traits are:

Domain Phenotypes
Anthropometry Height, Weight, BMI, Waist hip ratio
Blood Pressure Systolic BP, Diastolic BP, Hypertension
Cancer Breast cancer, Prostate cancer
Cardiovascular Disease Events Coronary artery disease (CAD)
Diabetes Type 1 diabetes, Type 2 diabetes
Glycemic Traits Fasting plasma glucose, Fasting serum glucose, Fasting insulin, HbA1c
Kidney function Cystatin C, Serum creatinine
Hematology RBC, Hemoglobin, Hematocrit, MCV, MCH, MCHC, RDW, WBC, MPV, Basophil count, Eosinophil count, Lymphocyte count, Monocyte count, Neutrophil count, Platelet count
Lipids HDL, LDL, Total cholesterol, Triglycerides, non-HDL cholesterol

The Phenotype Harmonization Working Group leads the effort to inventory, harmonize, standardize, and perform quality control of this data.

Back to top

PRIMED Data Model

AnVIL relies on data models to define a consistent structure of data and metadata in workspaces, including how data elements are linked across data types. AnVIL data models use the “data tables” feature to organize data, which creates a relational database-like structure that standardizes columns and defines how tables link to each other. This maximizes data findability and usefulness, and it simplifies the process of merging data across workspaces for harmonization and joint analysis. 

The PRIMED Genotype Harmonization, Phenotype Harmonization, and Population Descriptors Working Groups developed the PRIMED data model for use in Consortium data workspaces, and all data uploaded to these workspaces are required to conform to this data model. The PRIMED data model is available on GitHub and is depicted in the figure below. If you have questions or are interested in using the PRIMED data model for your own project, please contact the PRIMED Coordinating Center (primedconsortium@uw.edu).

The PRIMED data model. Each colored box represents a table, and lines represent links between tables.
The PRIMED data model. Each colored box represents a table, and lines represent links between tables. The Subject Table (purple) captures information on each subject/participant and is the linking point to the other components of the data model. The Population Descriptor Table (orange) captures detailed population descriptor information on each subject. The Phenotype Tables (dark pink) provide phenotype dataset metadata and provide links to phenotype data files, while the Phenotype Domain Tables (light pink) represent tabular data files containing individual-level phenotype data for specified harmonized variables in a wide-format familiar to cohort studies; unharmonized phenotype data can be shared in tabular data files pre-harmonization. The Genotype Tables (light blue) capture sample metadata and group samples into sets, which are linked to genotyping technology-specific dataset tables with associated metadata and file tables that provide links to individual-level genotype data files. Genomic Summary Result Tables (teal) capture metadata about analyses that generated the GSR and provide links to tabular data files containing the GSR.

 

Back to top