synthGWAS

Getting Started Guide v1.0

1. Select Cohort Type

Begin by choosing the type of cohort you want to simulate: Prospective or Retrospective.

2. Define Cohort Type Common Parameters

Regardless of the chosen cohort type, you'll need to configure the following:

Country: Select the country whose age distribution will be used for generating study entry ages. Current version will only support USA.
Gender: Choose the biological sex (Male, Female or Both) for the cohort. Current version will only support Female
Disease: Provide a Polygenic Score identifier from the PGS Catalog, such as PGS000004 or simply 4. Current version will only support Breast Cancer (PGS000004).
Age of Entry Range: Define the lower and upper bounds for the entry age of each profile.
Follow-up Interval: Set the minimum and maximum number of years each profile will be followed after study entry.

3. Cohort-Specific Settings

Depending on the selected cohort type, specify the appropriate parameters:

Prospective Cohort:
- Total Number of Profiles: Choose the number of profiles to be generated.
Retrospective Cohort:
- Total Number of Cases: Choose the number of cases to be generated.
- Number of Controls per Case: Choose the number of controls to match to each case. Current version only allows integers.

4. Generate Data

Click the Generate Data button to begin the synthetic data generation process. This may take a few moments depending on the size of the dataset.

5. View Statistics and Download Results

Once generation is complete, you can:

Inspect summary statistics:
- Observed vs Predicted Incidence Rate: Comparing the real-world data with the one estimated via Cox proportional hazard modeling
- Kaplan-Meier Curve: A step-function plot used to estimate the survival function and visualize time-to-event data, showing the proportion of individuals remaining event-free over time.
Download the output in your preferred format:

Prospective Cohort: All generated profiles in CSV or VCF format.
Retrospective Cohort: Matched case-control pairs in CSV or VCF format.

Methodology

Parameters:

Country: An API request is made to the World Bank to retrieve the list of countries with available demographic data.

PGS ID: An API request is made to the PGS Catalog. The corresponding score file is retrieved and fully parsed to extract variant identifiers, allele frequencies, and their associated effect sizes.

Age of Entry: An API request is made to the World Bank to retrieve the age distribution for the specified country. The distribution is separated by gender, and each age bin represents the percentage of the total population of that gender. This allows sampling within the selected age interval based on realistic demographic proportions.

Male and Female: For the selected age interval, the percentage distribution is used across age bins and normalize it so the values sum to 100%. These percentages are then multiplied by the total population of the selected gender to estimate the number of individuals in each age bin.
Both: The gender distribution percentages are used to split the total number of synthetic profiles between males and females. Then, for each gender, the age bin allocation follows the same methodology as described for a single gender.

Follow-up Period: For each synthetic profile, a follow-up duration is sampled uniformly from the specified interval.

Prospective Cohort Generation:

For a detailed explanation of how prospective cohorts are simulated using incidence rates, PGS data, and age-based risk models, please refer to Prospective Cohort Methodology.

Retrospective Cohort Generation:

In retrospective simulation, we begin by generating a specified number of cases using synthetic profiles that have developed the disease within their follow-up period, according to their age, polygenic risk score, and population-specific incidence rate.

Once the required number of cases is generated, each case is matched with one or more controls who did not develop the disease during the same follow-up interval. Matching is performed based on age at entry and gender to maintain comparability. Controls are drawn from the remaining pool of profiles.