2. PHARMACOVIGILANCE RESEARCH:
In the past, we have been developing novel methods to account for errors in EHR data. In our paper presented at the AMIA 2016 Annual Symposium, we used extensive simulation studies guided by eMERGE (Electronic Medical Records and Genomics) data to quantify the loss of power due to different levels of phenotyping errors in association studies (Duan et al., 2016, AMIA Annual Symposium Proceedings). This paper won the first prize of 2016 \Best of Student Papers in Knowledge Discovery and Data Mining (KDDM)"Awards in AMIA 2016.
More recently, we have developed an innovative statistical method using the cutting-edge theory of integrated likelihood to correct for the estimation bias caused by phenotyping errors (Huang et al., 2017, JAMIA); see the figure below.
We are currently expanding this framework to tackle more realistic settings of differential misclassifications in EHR-derived phenotypes.
4. INTEGRATION OF HETEROGENEOUS BIOMEDICAL DATA:
We have recently applied this method to bladder cancer data (dbGaP) where information from the Spanish and Finnish populations (n=6,978) are integrated without sharing the raw data (Liu, et al., 2018, Genetic Epidemiology; under review), and have identified potential novel interactions; see the figure below.
We have also studied meta-analytical methods to quantify gene-environment interactions for lung cancer using four studies conducted in the State of Pennsylvania (n=1,610) (Huang et al., 2017, Genetic Epidemiology). The key innovations of this paradigm are to embrace heterogeneity across populations, and to combine information across studies without sharing raw data. These Big Data innovations are critical because (1) heterogeneity across datasets was first leveraged to improve statistical power, and (2) the algorithms were developed as distributed algorithms to avoid sharing raw data, which is critical for privacy protection.
We are currently collaborating with investigators at the Penn Institute for Translational Medicine and Therapeutics study a Penn biobank data. By linking EHR with genetic information, we are enabled to explore the contributions of genetic variations to multiple complex conditions. However, the high dimensional genetic data pose new challenges in statistical modeling and inference. This necessitates innovations. We are currently developing a novel statistical modeling and inferential framework to prioritize single nucleotide polymorphisms (SNP)s for identifying pleiotropic effects.