I am a postdoctoral fellow of Computational Biology at Harvard University and Broad Institute.
I build big data computational tools to decipher the role of the microbiome in health and disease
Each of us carries around millions of microorganisms – including bacteria, fungi, and viruses — on the inner and outer surfaces of our bodies, and we don't yet clearly know what they are doing there. Increasing evidence indicates that the microbiota is critical to normal host physiology and a driver of human disease when it is disrupted to cause dysbiosis. Therapeutic manipulation of the microbiota is currently an area of intense investigation as a possible treatment for many diseases. However, best practices and resources remain scarce for making the leap from 'omics survey to therapy development, with no clear consensus on the appropriate computational methods for scalable human health epidemiological studies or for detailed molecular mechanistic profiles of microbial communities.
At the core of my research are two urgent and unmet needs: (a) scalable tools for microbial epidemiology, and (b) robust diagnostic or prognostic prediction models based on a massive amount of genomic data, to rapidly translate scientific discoveries into better therapeutic outcomes. This entails research into the most appropriate quantitative approaches for analyzing health information related to the human microbiome, as well as molecular measurements of microbial communities in epidemiological human populations and from the environment. Specifically, this includes robust methods for identifying microbial associations with health outcomes in large cohorts, tracking microbial contributions to health over time, and linking microbial community composition to associated biochemical activities. Together, these areas lead to better integration of massive microbiome data within and across large populations and improved tailoring of therapeutic research to patient subpopulations, allowing us to better understand the dynamics of the human microbiome by harnessing high-throughput data for promoting human health, moving beyond incremental advances toward translational intervention in microbiome research.
We envision a future in which new therapeutics and diagnostics enable the management of our microbiota to treat and prevent disease. By leveraging the power of big data analytics gained through multiple microbiome studies, we will be prepared to enter the era of personalized medicine where clinical inventions can be custom-tailored to individual patients, representing an opportunity to improve patient and community health by bringing scientific discoveries from ‘bench to bedside’.
Abstract: Wiping of the mouth and nose at birth is an alternative method to oronasopharyngeal suction in delivery-room management of neonates, but whether these methods have equivalent effectiveness is unclear.For this randomised equivalency trial, neonates delivered at 35 weeks' gestation or later at the University of Alabama at Birmingham Hospital, Birmingham, AL, USA, between October, 2010, and November, 2011, were eligible. Before birth, neonates were randomly assigned gentle wiping of the face, mouth (implemented by the paediatric or obstetric resident), and nose with a towel (wipe group) or suction with a bulb syringe of the mouth and nostrils (suction group). The primary outcome was the respiratory rate in the first 24 h after birth. We hypothesised that respiratory rates would differ by fewer than 4 breaths per min between groups. Analysis was by intention to treat. This study is registered with ClinicalTrials.gov, number NCT01197807.506 neonates born at a median of 39 weeks' gestation (IQR 38-40) were randomised. Three parents withdrew consent and 15 non-vigorous neonates with meconium-stained amniotic fluid were excluded. Among the 488 treated neonates, the mean respiratory rates in the first 24 h were 51 (SD 8) breaths per min in the wipe group and 50 (6) breaths per min in the suction group (difference of means 1 breath per min, 95% CI -2 to 0, p<0·001).Wiping the nose and mouth has equivalent efficacy to routine use of oronasopharyngeal suction in neonates born at or beyond 35 weeks' gestation.None.
Pub.: 07 Jun '13, Pinned: 29 Jun '17
Abstract: To develop and validate a mortality risk algorithm for obese black and white men and women to elucidate risk factors prognostic of short-term mortality among obese persons.Prospective cohort study. Reasons for geographic and racial differences in stroke (REGARDS) study, is a cohort of black and white men and women aged ≥45 years. Obese (≥30 kg m(-2) ) participants in REGARDS (n = 11 288) were randomly assigned to the derivation data set or an independent validation set.During the mean follow-up period of 4.9 years, 8.9% (n = 504) in the derivation cohort and 8.7% (n = 492) in the validation cohort died. The best-fitting model based on data from the derivation cohort included demographic (age, sex), coronary heart disease (CHD) conditions (diabetes, systolic blood pressure, history of CHD), health behaviors (smoking, physical activity, alcohol use), and socioeconomic variables (income, use of physician services). The C-statistic when the model was applied to the validation cohort was 0.80. Observed and predicted rates of mortality were similar across deciles of mortality risk by race.A risk algorithm was established and validated to predict mortality among black and white obese subjects based on CHD risk factors, behavioral risk factors, and socioeconomic status.
Pub.: 12 Oct '13, Pinned: 29 Jun '17
Abstract: Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phenotypes contain enormous number of zeroes due to the presence of excessive zero counts in majority of patients. Most existing statistical methods assume that the count phenotypes follow one of these four distributions with appropriate dispersion-handling mechanisms: Poisson, Zero-inflated Poisson (ZIP), Negative Binomial, and Zero-inflated Negative Binomial (ZINB). However, little is known about their implications in genetic association studies. Also, there is a relative paucity of literature on their usefulness with respect to model misspecification and variable selection. In this article, we have investigated the performance of several state-of-the-art approaches for handling zero-inflated count data along with a novel penalized regression approach with an adaptive LASSO penalty, by simulating data under a variety of disease models and linkage disequilibrium patterns. By taking into account data-adaptive weights in the estimation procedure, the proposed method provides greater flexibility in multi-SNP modeling of zero-inflated count phenotypes. A fast coordinate descent algorithm nested within an EM (expectation-maximization) algorithm is implemented for estimating the model parameters and conducting variable selection simultaneously. Results show that the proposed method has optimal performance in the presence of multicollinearity, as measured by both prediction accuracy and empirical power, which is especially apparent as the sample size increases. Moreover, the Type I error rates become more or less uncontrollable for the competing methods when a model is misspecified, a phenomenon routinely encountered in practice.
Pub.: 12 Apr '16, Pinned: 29 Jun '17
Abstract: Multiple comparisons or multiple testing has been viewed as a thorny issue in genetic association studies aiming to detect disease-associated genetic variants from a large number of genotyped variants. We alleviate the problem of multiple comparisons by proposing a hierarchical modeling approach that is fundamentally different from the existing methods. The proposed hierarchical models simultaneously fit as many variables as possible and shrink unimportant effects towards zero. Thus, the hierarchical models yield more efficient estimates of parameters than the traditional methods that analyze genetic variants separately, and also coherently address the multiple comparisons problem due to largely reducing the effective number of genetic effects and the number of statistically "significant" effects. We develop a method for computing the effective number of genetic effects in hierarchical generalized linear models, and propose a new adjustment for multiple comparisons, the hierarchical Bonferroni correction, based on the effective number of genetic effects. Our approach not only increases the power to detect disease-associated variants but also controls the Type I error. We illustrate and evaluate our method with real and simulated data sets from genetic association studies. The method has been implemented in our freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Pub.: 22 Nov '13, Pinned: 29 Jun '17
Abstract: The development of congenital heart defects (CHDs) involves a complex interplay between genetic variants, epigenetic variants, and environmental exposures. Previous studies have suggested that susceptibility to CHDs is associated with maternal genotypes, fetal genotypes, and maternal-fetal genotype (MFG) interactions. We conducted a haplotype-based genetic association study of obstructive heart defects (OHDs), aiming to detect the genetic effects of 877 SNPs involved in the homocysteine, folate, and transsulfuration pathways. Genotypes were available for 285 mother-offspring pairs with OHD-affected pregnancies and 868 mother-offspring pairs with unaffected pregnancies. A penalized logistic regression model was applied with an adaptive least absolute shrinkage and selection operator (lasso), which dissects the maternal effect, fetal effect, and MFG interaction effects associated with OHDs. By examining the association between 140 haplotype blocks, we identified 9 blocks that are potentially associated with OHD occurrence. Four haplotype blocks, located in genes MGMT, MTHFS, CBS, and DNMT3L, were statistically significant using a Bayesian false-discovery probability threshold of 0.8. Two blocks in MGMT and MTHFS appear to have significant fetal effects, while the CBS and DNMT3L genes may have significant MFG interaction effects.
Pub.: 05 Jun '14, Pinned: 29 Jun '17
Abstract: African Americans are disproportionately affected by early-onset, high-grade malignancies. A fraction of this cancer health disparity can be explained by genetic differences between individuals of African or European descent. Here the wild-type Pro/Pro genotype at the TP53Pro72Arg (P72R) polymorphism (SNP: rs1042522) is more frequent in African Americans with cancer than in African Americans without cancer (51% vs. 37%), and is associated with a significant increase in the rates of cancer diagnosis in African Americans. To test the hypothesis that Tp53 allele-specific gene expression may contribute to African American cancer disparities, TP53 hemizygous knockout variants were generated and characterized in the RKO colon carcinoma cell line, which is wild type for TP53 and heterozygous at the TP53Pro72Arg locus. Transcriptome profiling, using RNAseq, in response to the DNA-damaging agent etoposide revealed a large number of Tp53-regulated transcripts, but also a subset of transcripts that were TP53Pro72Arg allele specific. In addition, a shRNA-library suppressor screen for Tp53 allele-specific escape from Tp53-induced arrest was performed. Several novel RNAi suppressors of Tp53 were identified, one of which, PRDM1β (BLIMP-1), was confirmed to be an Arg-specific transcript. Prdm1β silences target genes by recruiting H3K9 trimethyl (H3K9me3) repressive chromatin marks, and is necessary for stem cell differentiation. These results reveal a novel model for African American cancer disparity, in which the TP53 codon 72 allele influences lifetime cancer risk by driving damaged cells to differentiation through an epigenetic mechanism involving gene silencing.TP53 P72R polymorphism significantly contributes to increased African American cancer disparity.
Pub.: 20 Apr '14, Pinned: 29 Jun '17
Abstract: Development of oncologic therapies has traditionally been performed in a sequence of clinical trials intended to assess safety (phase I), preliminary efficacy (phase II), and improvement over the standard of care (phase III) in homogeneous (in terms of tumor type and disease stage) patient populations. As cancer has become increasingly understood on the molecular level, newer "targeted" drugs that inhibit specific cancer cell growth and survival mechanisms have increased the need for new clinical trial designs, wherein pertinent questions on the relationship between patient biomarkers and response to treatment can be answered. Herein, we review the clinical trial design literature from initial to more recently proposed designs for targeted agents or those treatments hypothesized to have enhanced effectiveness within patient subgroups (e.g., those with a certain biomarker value or who harbor a certain genetic tumor mutation). We also describe a number of real clinical trials where biomarker-based designs have been utilized, including a discussion of their respective advantages and challenges. As cancers become further categorized and/or reclassified according to individual patient and tumor features, we anticipate a continued need for novel trial designs to keep pace with the changing frontier of clinical cancer research.
Pub.: 02 Feb '16, Pinned: 29 Jun '17
Abstract: To examine whether as initial surgical intervention for necrotizing enterocolitis, primary peritoneal drainage as compared to primary laparotomy is associated with increased mortality or intestinal failure.Retrospective observational study of 240 infants with surgical necrotizing enterocolitis.There was no difference concerning the composite outcome of mortality before discharge or survival with intestinal failure after adjusting for known covariates (Odds Ratio 1.73, 95% CI 0.88, 3.40). More surviving infants in the peritoneal drainage with subsequent salvage or secondary laparotomy had intestinal failure compared to those who received a peritoneal drain without subsequent laparotomy and survived (12% vs. 14% vs. 1%, p=0.015).There is no difference between peritoneal drainage and laparotomy in infants with surgical necrotizing enterocolitis concerning the combined outcome of mortality or survival with intestinal failure. There is increased intestinal failure in surviving infants treated with peritoneal drain with either subsequent salvage or secondary laparotomy compared to peritoneal drainage alone.
Pub.: 14 Mar '13, Pinned: 29 Jun '17
Abstract: In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow's Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions.
Pub.: 11 Feb '14, Pinned: 29 Jun '17
Abstract: A Bayesian bi-level variable selection method (BAGB: Bayesian Analysis of Group Bridge) is developed for regularized regression and classification. This new development is motivated by grouped data, where generic variables can be divided into multiple groups, with variables in the same group being mechanistically related or statistically correlated. As an alternative to frequentist group variable selection methods, BAGB incorporates structural information among predictors through a group-wise shrinkage prior. Posterior computation proceeds via an efficient MCMC algorithm. In addition to the usual ease-of-interpretation of hierarchical linear models, the Bayesian formulation produces valid standard errors, a feature that is notably absent in the frequentist framework. Empirical evidence of the attractiveness of the method is illustrated by extensive Monte Carlo simulations and real data analysis. Finally, several extensions of this new approach are presented, providing a unified framework for bi-level variable selection in general models with flexible penalties.
Pub.: 18 Jan '17, Pinned: 29 Jun '17
Abstract: Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of metagenomic sequencing data. These data provide valuable resources for investigating interactions between the microbiome and host environmental/clinical factors. In addition to the well-known properties of microbiome count measurements, for example, varied total sequence reads across samples, over-dispersion and zero-inflation, microbiome studies usually collect samples with hierarchical structures, which introduce correlation among the samples and thus further complicate the analysis and interpretation of microbiome count data.In this article, we propose negative binomial mixed models (NBMMs) for detecting the association between the microbiome and host environmental/clinical factors for correlated microbiome count data. Although having not dealt with zero-inflation, the proposed mixed-effects models account for correlation among the samples by incorporating random effects into the commonly used fixed-effects negative binomial model, and can efficiently handle over-dispersion and varying total reads. We have developed a flexible and efficient IWLS (Iterative Weighted Least Squares) algorithm to fit the proposed NBMMs by taking advantage of the standard procedure for fitting the linear mixed models.We evaluate and demonstrate the proposed method via extensive simulation studies and the application to mouse gut microbiome data. The results show that the proposed method has desirable properties and outperform the previously used methods in terms of both empirical power and Type I error. The method has been incorporated into the freely available R package BhGLM ( http://www.ssg.uab.edu/bhglm/ and http://github.com/abbyyan3/BhGLM ), providing a useful tool for analyzing microbiome data.
Pub.: 05 Jan '17, Pinned: 29 Jun '17