DCEG Data Science Group

Division of Cancer Epidemiology and Genetics (DCEG)
National Institutes of Health (NIH) / National Cancer Institute (NCI)

Established in 2019 with the recruitment of DCEG’s inaugural Chief Data Scientist, Jonas Almeida, the Data Science Group seeks to advance research and infrastructure for data-intensive Precision Prevention studies.

Mission statement

To advance Data Science and Engineering for Precision Epidemiology through the development of Computational Commons.


The main goal of the Data Science Group is to accelerate the investigation of epidemiologic and genetic causes of cancer, and to advance Cloud Computing infrastructure for Precision Prevention. These two aims are pursued as a multidisciplinary research program that combines systems biology, computational statistics, artificial intelligence, and software engineering for biomedical applications.


Outreach through Education and development of trans-disciplinary human resources is the third aim of the Data Science Group, and is articulated by weekly Cloud4Bio Hackathons at NCI’s Shady Grove campus.


The evolution of the Web towards a global data space is opening entirely new opportunities for cancer prevention and to understanding its etiology. This is a technology development particularly well suited for Epidemiology research, challenged a widening diversity of data types and increasngly sensitive governance of data sources. The former range from digital pathology to wearable devices, while the latter stretches from federal and state sponsored reference data sources to consumer-facing cloud-hosted services. EpiSphere is therefore conceived as an epidemiology approach to the broader NIH datacommons initiative to advance interoperable data ecosystems, in a manner that is driven by specific data-intensive projects at DCEG. This practical focus drives the development of Data Science as computational infrastructure, enabled by scalable Cloud Computing and Artificial Intelligence (AI) made available by the NIH STRIDES initiative. In a nutshell, EpiSphere is an umbrella computational epidemiology initiative developed as infrastructure for data science projects.


Projects we’re involved