Getting Started Guide v1.0

1. Define Population-Specific Parameters

Begin by configuring the parameters of the cohort you want to simulate:

  • Country: Select the country whose age distribution will be used for generating study entry ages. Current version supports China (EAS), India (SAS), Portugal (EUR), Spain (EUR), United States of America (AMR), United States of America (EUR), and Zimbabwe (AFR).
  • Gender: Choose Male or Female.
  • Age of Entry Range: Set lower and upper bounds for entry age.

2. Define Cohort-Specific Parameters

Depending on the cohort type to simulate, specify the appropriate parameters:

  • Disease: Choose a disease with an associated Polygenic Score identifier from the PGS Catalog, such as PGS000004 (or simply 4). Current version supports Breast Cancer (PGS000004), Epithelial Ovarian Cancer (PGS003394), Kidney Cancer (PGS004908), Lung Cancer (PGS000740), Prostate Cancer (PGS003765).
    • Prospective Cohort
      • Total Number of Profiles: Number of profiles to generate.
    • Retrospective Cohort
      • Total Number of Profiles: Number of cases to generate.
  • Follow-up Interval: Set the minimum and maximum number of years each profile will be followed after study entry.
  • Create Retrospective Cohort: Toggle whether the simulated cohort should be Prospective or Retrospective.
  • Number of Controls per Case: This parameter appears when Create Retrospective Cohort is enabled and specifies the integer number of controls to be matched to each case defined in Total Number of Profiles.
  • 3. Generate Data

    Click Generate Data to begin the synthetic data generation process. This may take a few moments depending on dataset size.

    1. The platform retrieves the SNP information required for the simulation.
    2. It generates 100,000 Polygenic Risk Scores (PRS) from the SNP data.
    3. It fits a Cox Proportional Hazards model to estimate disease incidence parameters, using real-life incidence rates as the target. Parameters are optimized (Nelder–Mead) to minimize the difference between predicted and observed cumulative incidence.
    4. Using the approximated parameters, synthetic profiles are generated (age and other attributes) aligned with the target incidence pattern.
    5. Finally, the system computes the predicted cumulative incidence curve, the observed cumulative incidence curve, and a Kaplan–Meier survival curve based on the synthetic cohort.

    5. View Statistics and Download Results

    After generation completes, you can:

    • Inspect summary statistics:
      • Observed vs. Predicted Incidence Rate: Compare real-world data with model estimates from the Cox PH model.
      • Kaplan–Meier Curve: A step-function survival estimate showing the proportion event-free over time.
    • Download outputs:
      • Prospective Cohort: All generated profiles (CSV or VCF).
      • Retrospective Cohort: Matched case–control pairs (CSV or VCF).

    Methodology

    Parameters

    • Country: Countries are retrieved from the World Bank to determine available demographic data.
    • PGS ID: Score files are fetched from the PGS Catalog and parsed to extract variant IDs, allele frequencies, and effect sizes.
    • Age of Entry: Age distributions (by sex) for the selected country are retrieved from the World Bank. Each age bin represents the proportion of the population and is used for realistic sampling.
      • Male / Female: Percentages across bins are normalized to 100% and scaled by the selected sex’s population.
      • Both: The total number of profiles is split by the country’s sex distribution, then age bins are allocated per sex as above.
    • Follow-up Period: For each profile, a duration is sampled uniformly from the specified interval.

    Prospective Cohort Generation

    For a detailed explanation of how prospective cohorts are simulated using incidence rates, PGS data, and age-based risk models, see Prospective Cohort Methodology .

    Retrospective Cohort Generation

    In retrospective simulation, the system first generates the requested number of cases—profiles that develop the disease within follow-up according to age, PRS, and population incidence.

    Each case is then matched with one or more controls who did not develop the disease during the same follow-up interval. Matching is based on age at entry and sex; controls are drawn from the remaining pool.