Taming the Many Barriers to Data Use and Reuse

4 minute read


Title: Taming the Many Barriers to Data Use and Reuse

Date: 2020-08-27

Hosted by: Department of Biomedical Informatics (DBMI), University of Pittsburgh

Speaker: Melissa Haendel

Summary and thoughts:

This talk was a part of the annual DBMI retreat 2020 (virtual this year). Melissa Haendel is one of the key members from the MONARCH initiative which I discovered with to our work in knowledge graphs this summer. The talk, however, was not related to the initiative, but rather on the broader aspect of data use and reuse in the health sciences using examples from various projects (MONARCH, N3C - more on this below). Dr. Haendel prefaced the talk by describing the barrier (aka ‘chasm of semantic despair’) that exists between basic science data and clinical data in biomedical informatics. Major issues in interoperability are an obstacle for data users and re-users both in basic sciences and clinical research. In recent years, FAIR data priniciples have gained traction in the health informatics field. FAIR stands for ‘Findable, Accessible, Interoperable and Reusable’ and are a set of guiding principles aimed at improving the quality and use of data. Dr. Haendel argues that these principles need some more TLC to make it truly guide our data use, namely extension to 3 more principles - Traceable, Licensing, and Connectivity.

Traceable data has information regarding the provenance, attribution and evidence. We don’t just know the source of the data but also everything about where the data and specific entities in the data come from. We are further able to attribute contributions in the data to people involved in data generation and dataset creation. The license of the data should be clear and accessible - a study assessing the licensing of various data sources shows that less than half (~43%) of the licenses satisfy basic requirements of licensing. These are major barriers to interoperability of data and it is not always possible for every project to employ a legal department to handle all the licensing dependencies. A ‘connected’ dataset consists of well-formed and persistent identifiers and relies heavily on identifier hygiene. More details on these studies can be found on the following websites: National Center for Data to Health and NIH National Center for Advancing Translational Science.

The second part of the talk focused on the National Covid Cohort Collaborative (N3C) effort by the National Center for Data to Health. This is a cloud-based initiative that will accept data via multiple data models (standard biomedical data models such as OMOP, ACT, PCORI) and transform them into a common analytic model for research. This is particularly relevant in the current pandemic where we need a centralized dataset of COVID patients to achieve generalized, longitudanal research. This is the largest and fastest effort over the last few months which takes data from systems in around 47 states in the United States and develops tools allowing researchers to perform their own analyses using the data. It was wonderful to hear first-hand the experience of developing this dataset due to my own part in translating EHR data to OMOP model in the last few months. Dr. Haendel is quite optimistic about the value in this cohort collaborative dataset for a variety of research on COVID using real-world data. One of the concerns raised in the follow-up questions made a great point about the differences in data arising from different sites over the country and the need to adjust and inform users about these differences in the analyses. The example of the N3C data (as well as other examples from MONARCH) tied back well to the overall theme of data usability and reusability, along with the access priniciples of “Share widely and wisely”.

Other notes:

Dr. Haendel encouraged people to post artifacts (code snippets, mappings) on Zenodo/Github and cite them in addition to publications. Often. there is a great value in these artifacts and they deserve to be recognized.