Data Cleaning and Curating for R&D

Case Study

Pfizer is one of the world’s largest research & development biopharmaceutical companies, striving to set the standard for quality, safety and value in the drug development and manufacture of health care products. Pfizer’s global portfolio includes medicines and vaccines as well as many consumer health care products. Pfizer colleagues work together across a wide range of medical disciplines to advance wellness, preventions, treatments and cures to bring therapies to people that extend and significantly improve their lives.

The Challenge

Pfizer has been using model-based meta-analysis (MBMA) as one of its strategies for decision making in their drug development. In many disease indications, there is an immense amount of publically available information on approved drugs as well as those currently in development. Accumulation and integration of this clinical trial data is very useful to understand the disease and treatment options. Quantification of the efficacy and safety of those available treatments and then benchmarking such data with an investigational drug provides invaluable information for critical decision making for development teams.

Since 2006, Pfizer has been developing literature databases covering more than 70 disease indications. They use a third party provider to extract data from publically available information sources (e.g. journals, conference proceedings) to digitize into standardized datasets. Pfizer has developed a standardized process for literature search, review, data capturing and archiving into its central repository thus enabling a high quality and consistent format. However, the majority of the data cleaning and curation process is still a manual process; the analyst needs to check the data, go back to the original paper if necessary and make the final data selections for data curation so that the dataset is ready for analysis. As such, the cleaning and curation process can be both difficult and time consuming. Pfizer required a user friendly (yet easily modifiable by the end user) application to assist with this process.

The Solution

The key aim of the application was to provide the users with an easy to use visualization tool for data cleaning and curation. Mango worked with Pfizer to understand and prioritise features of the application and delivered many of these initial features plus more than the original requirements in a proof of concept stage using the R framework for web applications: Shiny.

The newly designed interface allows users to quickly view the plots by paper, and users can switch the endpoint for the plots or create multi-panel plots by selecting x- and y- variables. It also adds flexibility to select (or de-select) columns by just few clicks and the column names (encoded) are matched up with column descriptions for easier selection (or de-selection). When it came to selecting the data point, the users had even more functionality, all based around an interactive graphic and table powered by Shiny. For example, users are able to select rows by simply clicking the point on the graphic with zoom and pan features making it easy to distinguish between points. Alternatively by scrolling through the table and selecting a row they could remove it from the data. The tool was implemented in R using Shiny meaning that the code is easy for end users to modify.

Business Benefits

The Shiny MBMA tool created by Mango has facilitated the process of data cleaning and curation, making it quicker, easier and more efficient for analysts at Pfizer. This is all done within a single application; after uploading the data in to Shiny app, the analyst is able to view the plots, clean the data, and select the columns with the interactive interface. Then the analysis-ready data can be exported. Further analysis stages can easily be added on to the application in the form of additional screens at any time as required.

The Shiny application provides benefits for the MBMA process, it saves tremendous time and effort as the quantitative understanding of the disease is critical for decision making in drug development at Pfizer.

‘The data cleaning and curation for MBMA was time-consuming and cumbersome because we needed to go back to the original paper to make sure that the data was correctly captured, and we need to understand what exactly was captured in the dataset out of more than 200 columns. This app makes it so much easier to create plots, check the data, and select/de-select the data points or columns we need for analysis. The app also has one click button to create a BibTex file (bibliography for the dataset), which saves us a day of tedious work.’