In the last decade, there has been major advances in the production and collection of data, from medical research to patient wellness regimes. These vast new troves of data from electronic health records (EHR), genetic databases, connected devices and wearables offer a unique opportunity to make health care more predictive and preventive. Transforming data into knowledge requires deep understanding the features of data, integrating disparate data sources, and having strong analytical strategies.


We aim to tackle several key challenges in the modern data rich era, including heterogeneity, complexity, suboptimal quality, reproducibility, and high-dimensionality of biomedical data. We believe that fundamental principles and wisdoms of statistics can be revived in tackling these problems, such as likelihood principle, Bayesian analysis, robust inference, dimension reduction, penalized methods, as well as a combination of statistical modeling and algorithm design.


Specifically, we are working on the following areas:


  1. Evidence synthesis under complex missing data mechanism;

  2. Pharmacovigilance research: signal detection and dynamic risk prediction;

  3. Bias reduction methods to account for data errors in EHR;

  4. Integration of heterogeneous biomedical data;

  5. Statistical inference for complex data.


We have been contributing to multivariate meta-analysis, meta-analysis of diagnostic tests, and network meta-analysis. For multivariate meta-analysis, we have developed a unified framework of methods to address correlated continuous or binary outcomes, model robustness, computational difficulties, and missing within-study correlations (Chen et al., 2013, 2014, 2015, 2016, Statistics in Medicine). In addition, we have extended this framework to meta-analysis of diagnostic tests to compare multiple diagnosis options, and have applied these methods to the surveillance of melanoma patients (Chen et al., 2014, Statistical Methods in Medical Research, Chen et al., 2015, JRSS-C; Chen et al., 2017, Statistics in Medicine). We have also proposed a unification of models for meta-analysis of diagnostic accuracy studies when the reference test is not a gold standard (Liu et al., 2015, Biometrics). For network meta-analysis, we have developed a framework of evidence synthesis methods that simultaneously compare multiple diagnostic options or treatment options (Ma et al., 2017, Biostatistics; Liu et al., 2017, JRSS-C). To address the problem of publication bias, we have developed an innovative EM algorithm to correct for biases caused by selective publication process (Ning et al, 2017, Biostatistics), and extended this work to diagnostic tests accuracy studies (Piao et al., 2017, SMMR, under revision).


To better serve clinical investigators, we have developed two R packages “mmeta” (developed in 2014) and “xmeta” (developed in 2016). In particular, our “xmeta” project is serving as a web platform for disseminate our methods and facilitate collaborations within and beyond Penn School of Medicine; see the figure below.


We have published several systematic reviews including on dermatology and on microbiology (Chahoud et al., 2016, JAMA and Agarwal et al., 2017, Clinical Microbiology and Infection). We are currently collaborating with Penn Center of Evidence-based Practice (lead by Dr. Craig Umscheid) and Penn School of Medicine, including a meta-analysis of heart failure outcomes with telemedicine/telephone support (Kimmel et al., 2018, JAMA, under review), a meta-analysis of the association between the nurse work environment and nurse job outcomes (Lake et al., 2018, Medical Care; under review), a meta-analysis of the association between gut microbiome dysbiosis and disease activity (Najem et al., 2018, under review), among others.


This line of research is supported by the R03 from AHRQ (2014-2016), R21 (2015-2017) and R01 (2017-2021) from NLM




We have been developing signal detection methods on evaluating vaccine and drug safety using massive Vaccine Adverse Event Reporting System (VAERS) data and FDA Adverse Event Reporting System (FAERS) data. In a sequence of papers, we developed a sensitive signal detection method for identifying temporal variation of adverse effects using VAERS data (Cai et al., 2016, BMC Medical Informatics and Decision Making), and developed and applied methods to study the individual differences in pharmacovigilance for trivalent influenza vaccine (Tao et al., 2015, Studies in Health Technology and Informatics; Du et al., 2016, Biomedical Informatics Insight). We also applied machine learning and GLM methods to compare HPV opinion trends from Twitter user groups, as a way to study social media impact on consumers’ health behavior (Du et al., 2017, Medinfo), and developed pipelines of methods to analyze and visualize the differences in drug safety outcomes using FAERS data (Huang et al., 2017, Medinfo, and Duan et al., 2017, Medinfo). More recently we developed a novel method by integrating VAERS data with CDC survey data of vaccine coverage and U.S. census data of race/ethnicity distribution to quantify differential AE rates by race/ethnicity groups forHPV vaccine; see the figure below (Huang et al. 2018 Frontier). Our method uses a generalized linear mixed effects model to link three sources of datasets, where the components of the model are approximated by different data sources.


We are developing a novel machine-learning framework for temporal risk prediction which incorporates external clinical knowledge. In vaccine/drug safety reports or EHR, patient records are generally longitudinal and represented as medical event sequences, where the events include clinical notes, conditions, medications, vital signs, laboratory reports, and so forth. Building a prediction model using the massive number of longitudinal events can be difficult without the guidance of external knowledge. We aim to develop machine learning methods for risk prediction using the temporal information in EHR data, which allow external knowledge incorporation.


Vaccines have been one of the most successful public health interventions to date. They are, however, pharmaceutical products that carry risks. Effective analyses of post-vaccination adverse events (AEs) is vital to assuring the safety of vaccines, a key public health intervention for reducing the frequency of vaccine- preventable illnesses. The CDC/FDA Vaccine Adverse Event Reporting System (VAERS) contains up to 30,000 reports per year over the past 25 years. VAERS reports include both structured data (e.g., vaccination date, first onset date, age, and gender) and unstructured narratives that often provide detailed clinical information about the clinical events and the temporal relationship of the series of event occurrences post vaccination. The structured data only provide one onsite date whereas temporal information of the sequence of events post vaccination is contained in the unstructured narratives.


While structured data in the VAERS are widely used, the narratives are generally ignored because of the challenges inherent in working with unstructured data. Without these narratives, potentially valuable information is lost. We propose to develop a novel framework to extract and accurately interpret the temporal information contained in the narratives through informatics approaches, and to develop prediction models for risk of severe AEs; see the figure below. Specifically, built upon the state-of-art ontology and natural language processing technologies, we will develop and validate a Temporal Information Modeling, Extraction and Reasoning system for Vaccine data (TIMER-V), which will automatically extract post-vaccination events and their temporal relationships from VAERS reports, semantically infer temporal relations, and integrate the exacted unstructured data with the structured data. Furthermore, we will provide and maintain a publicly available data access interface to query the new integrated data repository, which will facilitate vaccine safety research, casual inference, and other temporal related discovery. We will also develop and validate models to predict severe AEs using the co-occurrence or temporal patterns of the series of AEs post vaccination. We attempt to make use of the unstructured narratives in the VAERS reports to facilitate the temporal related discovery to a broad community of investigators in pharmacology, pharmacoepidemiology, vaccine safety research, among others.


This line of research is a joint work with Dr. Cui Tao at University of Texas School of Biomedical Informatics, and is supported by NIAID (2017-2022).


EHR have been increasingly used for research purposes due to the tremendous depth of patient data available, and the extensive information on health outcomes and risk factors contained in them. This tremendous trove of patient data necessitates the advancement of automated high-throughput phenotyping algorithms to expedite the identification of relevant patient cohorts from EHRs with clearly defined phenotypes. The resulting EHR-derived phenotypes are then used for general purpose knowledge discovery, including identification of risk factors for chronic conditions, evaluation of efficacy for new treatment options, prediction of adverse events following drug usage, drug repurposing, discovery of drug-drug interactions, and many others. While EHRs have been used for phenotyping and disease-related data mining for several years, there remain some major challenges to this type of research. A key challenge remains that the reproducibility of findings across studies is limited, which raised a fundamental concern on the value of these researching findings.


We are currently developing a framework of prior-knowledge-guided integrated likelihood model with readily available software to account for the EHR data error from imperfect phenotyping algorithms, which will ultimately enhance the reproducibility of EHR based discovery; see the figure below.

In the past, we have been developing novel methods to account for errors in EHR data. In our paper presented at the AMIA 2016 Annual Symposium, we used extensive simulation studies guided by eMERGE (Electronic Medical Records and Genomics) data to quantify the loss of power due to different levels of phenotyping errors in association studies (Duan et al., 2016, AMIA Annual Symposium Proceedings). This paper won the first prize of 2016 \Best of Student Papers in Knowledge Discovery and Data Mining (KDDM)"Awards in AMIA 2016.


More recently, we have developed an innovative statistical method using the cutting-edge theory of integrated likelihood to correct for the estimation bias caused by phenotyping errors (Huang et al., 2017, JAMIA); see the figure below.

We are currently expanding this framework to tackle more realistic settings of differential misclassifications in EHR-derived phenotypes.



We have developed a novel meta-analytic framework for integrating data from multiple studies in identifying high-dimensional genetic risk factors of lung cancer and bladder cancer. We proposed a novel paradigm, YETI (phylogenY-aware Effect-size Tests for Interactions), for detecting genetic interactions from heterogeneous GWAS (Liu et al., 2016, Genetic Epidemiology); see the figure below.

We have recently applied this method to bladder cancer data (dbGaP) where information from the Spanish and Finnish populations (n=6,978) are integrated without sharing the raw data (Liu, et al., 2018, Genetic Epidemiology; under review), and have identified potential novel interactions; see the figure below.

We have also studied meta-analytical methods to quantify gene-environment interactions for lung cancer using four studies conducted in the State of Pennsylvania (n=1,610) (Huang et al., 2017, Genetic Epidemiology). The key innovations of this paradigm are to embrace heterogeneity across populations, and to combine information across studies without sharing raw data. These Big Data innovations are critical because (1) heterogeneity across datasets was first leveraged to improve statistical power, and (2) the algorithms were developed as distributed algorithms to avoid sharing raw data, which is critical for privacy protection.


We are currently collaborating with investigators at the Penn Institute for Translational Medicine and Therapeutics  study a Penn biobank data. By linking EHR with genetic information, we are enabled to explore the contributions of genetic variations to multiple complex conditions. However, the high dimensional genetic data pose new challenges in statistical modeling and inference. This necessitates innovations. We are currently developing a novel statistical modeling and inferential framework to prioritize single nucleotide polymorphisms (SNP)s for identifying pleiotropic effects.


We have developed general theory for statistical inference for situations with non-standard problems that arise in applications including correlated data, variance component models, multivariate survival models, and mixture models. When these non-regular problems occur, special attention is needed to design test statistics that overcome misspecification of Type I error and substantial loss of statistical power. In a sequence of papers, published at Biometrika, one of the most prestigious journals in statistics, we laid a foundation of rigorous statistical inference on these non-regular problems (Chen and Liang, 2010, Biometrika; Chen et al., 2017, Biometrika; Chen et al, 2018, Biometrika). We have used the theoretical framework to develop inferential tools to develop tests for homogeneity in mixture models that are relevant to analysis of gene expression data, DNA methylation data, and genetic quantitative trait locus analyses (Chen et al., 2013, Genetic Epidemiology; Ning and Chen, 2015, Scandinavian Journal of Statistics; Hong et al., 2016, JASA; Hong et al., 2016, Biometrics), and pharmacovigilance studies (Cai et al, 2016, BMC Medical Informatics and Decision Making; Huang et al., 2018, Statistica Sinica).


We have also contributed to methods for longitudinal data analysis and multivariate survival analysis. Motivated by a study on evaluating the effectiveness of an intervention for weight loss using self-reported weights, we developed a novel framework of pseudo-likelihood methods to analyze longitudinal data where the observation times may be outcome dependent, in which case the standard generalized estimating equation (GEE) approach fails (Chen et al., 2015, Biostatistics; Cai et al., 2018, Statistics in Medicine, under revision; Shen et al., 2017+, Statistica Sinica, under review). A key innovation of this framework is that, unlike joint modeling methods, the validity of this methods does not rely on the correct specification of the observation time process or the complex correlation structures, offering greater model robustness. Motivated from a soft tissue sarcoma study, we proposed a time-dependent measure and developed a pseudo-likelihood-based inference to quantify the local dependence between two types of recurrent event processes (e.g., local and distant disease recurrences), without specifying the joint recurrent event processes (Ning et al., 2015, Biometrika). In addition, for analyzing re-offense data of juvenile probationers, he developed a frailty model for recurrent events during alternating restraint periods (e.g., placement in a community unit) and non-restraint periods (e.g., released to home), which corrects the bias induced by ignoring the differences between two periods, and leads to superior dynamic risk prediction (Li et al., 2016, Statistics in Medicine).

This site was designed with the
website builder. Create your website today.
Start Now