Getting Started Guide v1.0
1. Define Population-Specific Parameters
Begin by configuring the parameters of the cohort you want to simulate:
- Country: Select the country whose age distribution will be used for generating study entry ages. Current version supports China (EAS), India (SAS), Portugal (EUR), Spain (EUR), United States of America (AMR), United States of America (EUR), and Zimbabwe (AFR).
- Gender: Choose Male or Female.
- Age of Entry Range: Set lower and upper bounds for entry age.
2. Define Cohort-Specific Parameters
Depending on the cohort type to simulate, specify the appropriate parameters:
PGS000004
(or simply 4
).
Current version supports Breast Cancer (PGS000004), Epithelial Ovarian Cancer (PGS003394), Kidney Cancer (PGS004908), Lung Cancer (PGS000740), Prostate Cancer (PGS003765).
-
Prospective Cohort
- Total Number of Profiles: Number of profiles to generate.
-
Retrospective Cohort
- Total Number of Profiles: Number of cases to generate.
3. Generate Data
Click Generate Data to begin the synthetic data generation process. This may take a few moments depending on dataset size.
- The platform retrieves the SNP information required for the simulation.
- It generates 100,000 Polygenic Risk Scores (PRS) from the SNP data.
- It fits a Cox Proportional Hazards model to estimate disease incidence parameters, using real-life incidence rates as the target. Parameters are optimized (Nelder–Mead) to minimize the difference between predicted and observed cumulative incidence.
- Using the approximated parameters, synthetic profiles are generated (age and other attributes) aligned with the target incidence pattern.
- Finally, the system computes the predicted cumulative incidence curve, the observed cumulative incidence curve, and a Kaplan–Meier survival curve based on the synthetic cohort.
5. View Statistics and Download Results
After generation completes, you can:
-
Inspect summary statistics:
- Observed vs. Predicted Incidence Rate: Compare real-world data with model estimates from the Cox PH model.
- Kaplan–Meier Curve: A step-function survival estimate showing the proportion event-free over time.
-
Download outputs:
- Prospective Cohort: All generated profiles (CSV or VCF).
- Retrospective Cohort: Matched case–control pairs (CSV or VCF).
Methodology
Parameters
- Country: Countries are retrieved from the World Bank to determine available demographic data.
- PGS ID: Score files are fetched from the PGS Catalog and parsed to extract variant IDs, allele frequencies, and effect sizes.
-
Age of Entry: Age distributions (by sex) for the selected country are retrieved from the World Bank.
Each age bin represents the proportion of the population and is used for realistic sampling.
- Male / Female: Percentages across bins are normalized to 100% and scaled by the selected sex’s population.
- Both: The total number of profiles is split by the country’s sex distribution, then age bins are allocated per sex as above.
- Follow-up Period: For each profile, a duration is sampled uniformly from the specified interval.
Prospective Cohort Generation
For a detailed explanation of how prospective cohorts are simulated using incidence rates, PGS data, and age-based risk models, see Prospective Cohort Methodology .
Retrospective Cohort Generation
In retrospective simulation, the system first generates the requested number of cases—profiles that develop the disease within follow-up according to age, PRS, and population incidence.
Each case is then matched with one or more controls who did not develop the disease during the same follow-up interval. Matching is based on age at entry and sex; controls are drawn from the remaining pool.