Data Science and Engineering @ DCEG

Ask us not what Data Science can do for you,
but what you can do for the Science of your Data.
We’ll help :-)

Mission statement

To advance Data Science and Engineering for Precision Epidemiology through the development of Computational Commons.

Goals

The main goal of the Data Science Group is to accelerate the investigation of epidemiologic and genetic causes of cancer, and to advance Cloud Computing infrastructure for Precision Prevention. These two aims are pursued as a multidisciplinary research program that combines systems biology, computational statistics, artificial intelligence, and software engineering for biomedical applications.

Training

Outreach through Education and development of trans-disciplinary human resources is the third aim of the Data Science Group, and is articulated by weekly Cloud4Bio Hackathons at NCI’s Shady Grove campus.

EpiSphere

The evolution of the Web towards a global data space is creating new opportunities for cancer prevention and understanding its etiology. This is a technology development particularly well suited for Epidemiology research, challenged by a widening diversity of data types, and increasngly sensitive governance of data sources. The data types now range from digital pathology to wearable devices, while its governance needs to traverse environments stretching from federal and state sponsored reference data sources, to consumer-facing cloud-hosted services. EpiSphere is therefore conceived as an epidemiology approach to NIH datacommons initiative with the goal of advancing interoperable data ecosystems in a manner that is driven by specific data-intensive projects at DCEG. Specifically, this practical focus drives the development of Data Science as computational infrastructure, enabled by scalable Cloud Computing and Artificial Intelligence (AI) made available by the NIH STRIDES initiative. As such, EpiSphere was conceived as an umbrella computational epidemiology framework informed, and validated, by the infrastructure for data science projects it develops.

People

Jonas Almeida, PhD - senior investigator, Chief Data Scientist.
Daniel Russ, PhD - Staff Scientist
Jeya Balasubramanian, PhD - postdoctoral Fellow
Praful Bhawsar, MS - Data Engineer, AI for Computational Pathology.
Bhaumik Patel, MS - Software Engineer, Data Systems Lead Engineer.
Abhinav Jonnada - Software Engineer, Development and Operations.
Brian Shen - Development and Operations, Application Engineer.
Gene Barra - DevOps.
Lee Mason - PhD student.
Lorena Sandoval - PhD student.
We’re hiring! Posdoctoral Felowship positions opened: careers.iscb.org/jobs/view/6543; and also analyst positions. If you are looking for intership positions, we have a challenge for you.

Projects we’re involved

EpiSphere - Web tools to operate Cancer Epidemiology Commons.
FeatureScape - Interactive representation and analysis of feature landscapes.
Serverless OpenHealth - live demo at bit.ly/loadsparcs.
Connect for Cancer Prevention Study - a next generation cohort study design that interoperates with integrated Health Care Systems (~200,000 participants).
Confluence - a research resource to uncover breast cancer genetics through genome-wide association studies (GWAS). The resource will include at least 300,000 breast cancer cases.
Digital Pathology - see for example
mortalityTracker - Web-based aggregation of CDC data services on causes of death, colated with real-time data on ongoing COVID-19 pandemic.
epiTracker - Seeking to generalize interactive tracking of epidemiological data to create the next generation of Cancer Maps.
epiVerse - Upcoming, webSDKs

More on the translational projects:

epiSphere

Portable Data Science Applications for Cancer Precision Prevention. For positions opened see also pdf. Prospective intership candidate are typically challenged by a test project which is then discussed in the selection interview.

Connect

Cancer Precision Prevention places an increasing focus on data-intensive platforms that can reach, and can be engaged, as consumer-facing digital applications. Ultimately, the emergence of a Learning Health Care System is orchestrated by computational systems that orchestrate both medical reccords and consumer-facing services, from wearable sensors to genomics. A new generation of cohort studies, such as NCI/DCEG Connect, is being designed accordingly.

Confluence

BigData designates the computational aggregation of large volumes of diverse data and diverse analytical environments in order to enable comprehensive integrative analysis. Even more than the logistic challanges, BigData typically has to navigate complex governance and complaince landscapes that can only be accomplished in Cloud Computing environments. Confluence is an international initiative aggregating data on 300k control and 300k breast cancer cases.

FAIR Data Platform

The data platform developed for Confluence is being abstracted into a distributed FAIR data platform for cohort studies.

Data Commons

Identifying novel algorithms and designing Web Applications backed by Cloud hosted APIs is the upbiquitous technology stack. EpiSphere seeks to integrate a multitude of health data streams generated and consumed in real time with the goal of contextualization of individual observatin by reference BigData. This process defines the API ecosystems of Epidemiology Data Commons.

Digital Pathology (patterns)

epiPath, imageBox, Active Learning (in press)

Time series

Mortality tracker - J. Bioinformatics PMID:33135727

Wearables

MutationSignature (bioinformatics)

under development

Code

Open-source code repositories at github.com/episphere.

Who, Where

EpiSphere is a software engineering research project of the Data Science Group at Division of Cancer Epidemiology and Genetics(DCEG) of the National Cancer Institute (NIH.NCI).