A pinboard by
Himel Mallick

I am a postdoctoral fellow of Computational Biology at Harvard University and Broad Institute.


I build big data computational tools to decipher the role of the microbiome in health and disease

Each of us carries around millions of microorganisms – including bacteria, fungi, and viruses — on the inner and outer surfaces of our bodies, and we don't yet clearly know what they are doing there. Increasing evidence indicates that the microbiota is critical to normal host physiology and a driver of human disease when it is disrupted to cause dysbiosis. Therapeutic manipulation of the microbiota is currently an area of intense investigation as a possible treatment for many diseases. However, best practices and resources remain scarce for making the leap from 'omics survey to therapy development, with no clear consensus on the appropriate computational methods for scalable human health epidemiological studies or for detailed molecular mechanistic profiles of microbial communities.

At the core of my research are two urgent and unmet needs: (a) scalable tools for microbial epidemiology, and (b) robust diagnostic or prognostic prediction models based on a massive amount of genomic data, to rapidly translate scientific discoveries into better therapeutic outcomes. This entails research into the most appropriate quantitative approaches for analyzing health information related to the human microbiome, as well as molecular measurements of microbial communities in epidemiological human populations and from the environment. Specifically, this includes robust methods for identifying microbial associations with health outcomes in large cohorts, tracking microbial contributions to health over time, and linking microbial community composition to associated biochemical activities. Together, these areas lead to better integration of massive microbiome data within and across large populations and improved tailoring of therapeutic research to patient subpopulations, allowing us to better understand the dynamics of the human microbiome by harnessing high-throughput data for promoting human health, moving beyond incremental advances toward translational intervention in microbiome research.

We envision a future in which new therapeutics and diagnostics enable the management of our microbiota to treat and prevent disease. By leveraging the power of big data analytics gained through multiple microbiome studies, we will be prepared to enter the era of personalized medicine where clinical inventions can be custom-tailored to individual patients, representing an opportunity to improve patient and community health by bringing scientific discoveries from ‘bench to bedside’.


Oronasopharyngeal suction versus wiping of the mouth and nose at birth: a randomised equivalency trial.

Abstract: Wiping of the mouth and nose at birth is an alternative method to oronasopharyngeal suction in delivery-room management of neonates, but whether these methods have equivalent effectiveness is unclear.For this randomised equivalency trial, neonates delivered at 35 weeks' gestation or later at the University of Alabama at Birmingham Hospital, Birmingham, AL, USA, between October, 2010, and November, 2011, were eligible. Before birth, neonates were randomly assigned gentle wiping of the face, mouth (implemented by the paediatric or obstetric resident), and nose with a towel (wipe group) or suction with a bulb syringe of the mouth and nostrils (suction group). The primary outcome was the respiratory rate in the first 24 h after birth. We hypothesised that respiratory rates would differ by fewer than 4 breaths per min between groups. Analysis was by intention to treat. This study is registered with ClinicalTrials.gov, number NCT01197807.506 neonates born at a median of 39 weeks' gestation (IQR 38-40) were randomised. Three parents withdrew consent and 15 non-vigorous neonates with meconium-stained amniotic fluid were excluded. Among the 488 treated neonates, the mean respiratory rates in the first 24 h were 51 (SD 8) breaths per min in the wipe group and 50 (6) breaths per min in the suction group (difference of means 1 breath per min, 95% CI -2 to 0, p<0·001).Wiping the nose and mouth has equivalent efficacy to routine use of oronasopharyngeal suction in neonates born at or beyond 35 weeks' gestation.None.

Pub.: 07 Jun '13, Pinned: 29 Jun '17

EM Adaptive LASSO-A Multilocus Modeling Strategy for Detecting SNPs Associated with Zero-inflated Count Phenotypes.

Abstract: Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phenotypes contain enormous number of zeroes due to the presence of excessive zero counts in majority of patients. Most existing statistical methods assume that the count phenotypes follow one of these four distributions with appropriate dispersion-handling mechanisms: Poisson, Zero-inflated Poisson (ZIP), Negative Binomial, and Zero-inflated Negative Binomial (ZINB). However, little is known about their implications in genetic association studies. Also, there is a relative paucity of literature on their usefulness with respect to model misspecification and variable selection. In this article, we have investigated the performance of several state-of-the-art approaches for handling zero-inflated count data along with a novel penalized regression approach with an adaptive LASSO penalty, by simulating data under a variety of disease models and linkage disequilibrium patterns. By taking into account data-adaptive weights in the estimation procedure, the proposed method provides greater flexibility in multi-SNP modeling of zero-inflated count phenotypes. A fast coordinate descent algorithm nested within an EM (expectation-maximization) algorithm is implemented for estimating the model parameters and conducting variable selection simultaneously. Results show that the proposed method has optimal performance in the presence of multicollinearity, as measured by both prediction accuracy and empirical power, which is especially apparent as the sample size increases. Moreover, the Type I error rates become more or less uncontrollable for the competing methods when a model is misspecified, a phenomenon routinely encountered in practice.

Pub.: 12 Apr '16, Pinned: 29 Jun '17

Multiple comparisons in genetic association studies: a hierarchical modeling approach.

Abstract: Multiple comparisons or multiple testing has been viewed as a thorny issue in genetic association studies aiming to detect disease-associated genetic variants from a large number of genotyped variants. We alleviate the problem of multiple comparisons by proposing a hierarchical modeling approach that is fundamentally different from the existing methods. The proposed hierarchical models simultaneously fit as many variables as possible and shrink unimportant effects towards zero. Thus, the hierarchical models yield more efficient estimates of parameters than the traditional methods that analyze genetic variants separately, and also coherently address the multiple comparisons problem due to largely reducing the effective number of genetic effects and the number of statistically "significant" effects. We develop a method for computing the effective number of genetic effects in hierarchical generalized linear models, and propose a new adjustment for multiple comparisons, the hierarchical Bonferroni correction, based on the effective number of genetic effects. Our approach not only increases the power to detect disease-associated variants but also controls the Type I error. We illustrate and evaluate our method with real and simulated data sets from genetic association studies. The method has been implemented in our freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).

Pub.: 22 Nov '13, Pinned: 29 Jun '17

Transcriptomes and shRNA suppressors in a TP53 allele-specific model of early-onset colon cancer in African Americans.

Abstract: African Americans are disproportionately affected by early-onset, high-grade malignancies. A fraction of this cancer health disparity can be explained by genetic differences between individuals of African or European descent. Here the wild-type Pro/Pro genotype at the TP53Pro72Arg (P72R) polymorphism (SNP: rs1042522) is more frequent in African Americans with cancer than in African Americans without cancer (51% vs. 37%), and is associated with a significant increase in the rates of cancer diagnosis in African Americans. To test the hypothesis that Tp53 allele-specific gene expression may contribute to African American cancer disparities, TP53 hemizygous knockout variants were generated and characterized in the RKO colon carcinoma cell line, which is wild type for TP53 and heterozygous at the TP53Pro72Arg locus. Transcriptome profiling, using RNAseq, in response to the DNA-damaging agent etoposide revealed a large number of Tp53-regulated transcripts, but also a subset of transcripts that were TP53Pro72Arg allele specific. In addition, a shRNA-library suppressor screen for Tp53 allele-specific escape from Tp53-induced arrest was performed. Several novel RNAi suppressors of Tp53 were identified, one of which, PRDM1β (BLIMP-1), was confirmed to be an Arg-specific transcript. Prdm1β silences target genes by recruiting H3K9 trimethyl (H3K9me3) repressive chromatin marks, and is necessary for stem cell differentiation. These results reveal a novel model for African American cancer disparity, in which the TP53 codon 72 allele influences lifetime cancer risk by driving damaged cells to differentiation through an epigenetic mechanism involving gene silencing.TP53 P72R polymorphism significantly contributes to increased African American cancer disparity.

Pub.: 20 Apr '14, Pinned: 29 Jun '17

Negative binomial mixed models for analyzing microbiome count data.

Abstract: Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of metagenomic sequencing data. These data provide valuable resources for investigating interactions between the microbiome and host environmental/clinical factors. In addition to the well-known properties of microbiome count measurements, for example, varied total sequence reads across samples, over-dispersion and zero-inflation, microbiome studies usually collect samples with hierarchical structures, which introduce correlation among the samples and thus further complicate the analysis and interpretation of microbiome count data.In this article, we propose negative binomial mixed models (NBMMs) for detecting the association between the microbiome and host environmental/clinical factors for correlated microbiome count data. Although having not dealt with zero-inflation, the proposed mixed-effects models account for correlation among the samples by incorporating random effects into the commonly used fixed-effects negative binomial model, and can efficiently handle over-dispersion and varying total reads. We have developed a flexible and efficient IWLS (Iterative Weighted Least Squares) algorithm to fit the proposed NBMMs by taking advantage of the standard procedure for fitting the linear mixed models.We evaluate and demonstrate the proposed method via extensive simulation studies and the application to mouse gut microbiome data. The results show that the proposed method has desirable properties and outperform the previously used methods in terms of both empirical power and Type I error. The method has been incorporated into the freely available R package BhGLM ( http://www.ssg.uab.edu/bhglm/ and http://github.com/abbyyan3/BhGLM ), providing a useful tool for analyzing microbiome data.

Pub.: 05 Jan '17, Pinned: 29 Jun '17