Lecturer (Probationary), University of Ruhuna
My work presents several computational workﬂows for identiﬁcation and quantiﬁcation of proteomics data. Since existing methods of protein inference still need acceleration of protein identiﬁcation process, our work introduces a hardware accelerated protein inference framework (using FPGA) and an open-source biomarker discovery tool. Further, it includes critical analysis of existing multi-pattern matching algorithms in the context of proteomics data analysis.
Abstract: The Bicoid morphogen is amongst the earliest triggers of differential spatial pattern of gene expression and subsequent cell fate determination in the embryonic development of Drosophila. This maternally deposited morphogen is thought to diffuse in the embryo, establishing a concentration gradient which is sensed by downstream genes. In most model based analyses of this process, the translation of the bicoid mRNA is thought to take place at a fixed rate from the anterior pole of the embryo and a supply of the resulting protein at a constant rate is assumed. Is this process of morphogen generation a passive one as assumed in the modelling literature so far, or would available data support an alternate hypothesis that the stability of the mRNA is regulated by active processes? We introduce a model in which the stability of the maternal mRNA is regulated by being held constant for a length of time, followed by rapid degradation. With this more realistic model of the source, we have analysed three computational models of spatial morphogen propagation along the anterior-posterior axis: (a) passive diffusion modelled as a deterministic differential equation, (b) diffusion enhanced by a cytoplasmic flow term; and (c) diffusion modelled by stochastic simulation of the corresponding chemical reactions. Parameter estimation on these models by matching to publicly available data on spatio-temporal Bicoid profiles suggests strong support for regulated stability over either a constant supply rate or one where the maternal mRNA is permitted to degrade in a passive manner.
Pub.: 29 Sep '11, Pinned: 29 Jul '17
Abstract: With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins. Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties. Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous. <br> We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features. We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from analogous pairs, we note that significant performance gain was obtained by the inclusion of sequence and structure information. We find that the use of a linear classifier was enough to discriminate a protein pair at the family level. However, at the superfamily level, to detect remote homologous pairs was a relatively harder problem. We find that the use of nonlinear classifiers achieve significantly higher accuracies. <br> In this paper, we compare three different pattern classification methods on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made. Feature selection points to a \"knowledge gap\" in currently available functional annotations. We demonstrate how the scheme may be employed in a framework to associate an individual protein with an existing family of evolutionarily related proteins.
Pub.: 01 Jan '08, Pinned: 29 Jul '17
Abstract: In high-throughput experimental biology, it is widely acknowledged that while expression levels measured at the levels of transcriptome and the corresponding proteome do not, in general, correlate well, messenger RNA levels are used as convenient proxies for protein levels. Our interest is in developing data-driven computational models that can bridge the gap between these two levels of measurement at which different mechanisms of regulation may act on different molecular species causing any observed lack of correlations. To this end, we build data-driven predictors of protein levels using mRNA levels and known proxies of translation efficiencies as covariates. Previous work showed that in such a setting, outliers with respect to the model are reliable candidates for post-translational regulation.Here, we present and compare two novel formulations of deriving a protein concentration predictor from which outliers may be extracted in a systematic manner. The first approach, outlier rejecting regression, allows explicit specification of a certain fraction of the data as outliers. In a regression setting, this is a non-convex optimization problem which we solve by deriving a difference of convex functions algorithm (DCA). With post-translationally regulated proteins, one expects their concentrations to be affected primarily by disruption of protein stability. Our second algorithm exploits this observation by minimizing an asymmetric loss using quantile regression and extracts outlier proteins whose measured concentrations are lower than what a genome-wide regression would predict. We validate the two approaches on a dataset of yeast transcriptome and proteome. Functional annotation check on detected outliers demonstrate that the methods are able to identify post-translationally regulated genes with high statistical confidence.
Pub.: 31 Mar '15, Pinned: 29 Jul '17
Abstract: Bicoid protein molecules, translated from maternally provided bicoid mRNA, establish a concentration gradient in Drosophila early embryonic development. There is experimental evidence that the synthesis and subsequent destruction of this protein is regulated at source by precise control of the stability of the maternal mRNA. Can we infer the driving function at the source from noisy observations of the spatio-temporal protein profile? We use non-parametric Gaussian process regression for modelling the propagation of Bicoid in the embryo and infer aspects of source regulation as a posterior function.With synthetic data from a 1D diffusion model with a source simulated to model mRNA stability regulation, our results establish that the Gaussian process method can accurately infer the driving function and capture the spatio-temporal dynamics of embryonic Bicoid propagation. On real data from the FlyEx database, too, the reconstructed source function is indicative of stability regulation, but is temporally smoother than what we expected, partly due to the fact that the dataset is only partially observed. To be in line with recent thinking on the subject, we also analyse this model with a spatial gradient of maternal mRNA, rather than being fixed at only the anterior email@example.comSupplementary data are available at Bioinformatics online.
Pub.: 02 Dec '11, Pinned: 29 Jul '17
Abstract: The problem of inferring proteins from complex peptide cocktails (digestion products of biological samples) in shotgun proteomic workflow sets extreme demands on computational resources in respect of the required very high processing throughputs, rapid processing rates and reliability of results. This is exacerbated by the fact that, in general, a given protein cannot be defined by a fixed sequence of amino acids due to the existence of splice variants and isoforms of that protein. Therefore, the problem of protein inference could be considered as one of identifying sequences of amino acids with some limited tolerance. In the current paper a model-based hardware acceleration of a structured and practical inference approach is developed and validated on a mass spectrometry experiment of realistic size. We have achieved 10 times maximum speed-up in the co-designed workflow compared to a similar software-only workflow run on the processor used for co-design.
Pub.: 25 Dec '14, Pinned: 29 Jul '17
Abstract: Advances in life sciences over the last few decades have lead to the generation of a huge amount of biological data. Computing research has become a vital part in driving biological discovery where analysis and categorization of biological data are involved. String matching algorithms can be applied for protein/gene sequence matching and with the phenomenal increase in the size of string databases to be analyzed, software implementations of these algorithms seems to have hit a hard limit and hardware acceleration is increasingly being sought. Several hardware platforms such as Field Programmable Gate Arrays (FPGA), Graphics Processing Units (GPU) and Chip Multi Processors (CMP) are being explored as hardware platforms. In this paper, we give a comprehensive overview of the literature on hardware acceleration of string matching algorithms, we take an FPGA hardware exploration and expedite the design time by a design automation technique. Further, our design automation is also optimized for better hardware utilization through optimizing the number of peptides that can be represented in an FPGA tile. The results indicate significant improvements in design time and hardware utilization which are reported in this paper.
Pub.: 28 Mar '14, Pinned: 29 Jul '17
Abstract: The problem of inferring proteins from complex peptide samples in shotgun proteomic workflow sets extreme demands on computational resources. This is exacerbated by the fact that, in general, a given protein cannot be defined by a fixed sequence of amino acids due to the existence of splice variants and isoforms of that protein. Therefore, the problem of protein inference could be considered as one of identifying sequences of amino acids with some limited tolerance. Two problems arise from this: a) due to these variations, the applicability of exact string matching methodologies could be questioned and b) the difficulty of defining a reference sequence for a particular set of proteins that are functionally indistinguishable, but with some variation in features. This paper presents a model-based inference approach that is developed and validated to solve the inference problem. Our approach starts from an examination of the known set of splice variants and isoforms of a target protein to identify the Greatest Common Stable Substring (GCSS) of amino acids and the Substrings Subjects to Limited Variation (SSLV) and their respective locations on the GCSS. Then we define and solve the Sub-string Matching Problem with Limited Tolerance (SMPLT). This approach is validated on identified peptides in a labelled and clustered data set from UNIPROT. Identification of Baylisascaris Procyonis infection was used as an application instance that achieved up to 70 times speedup compared to a software only system. This workflow can be generalised to any inexact multiple pattern matching application by replacing the patterns in a clustered and distributed environment which permits a distance between member strings to account for permitted deviations such as substitutions, insertions and deletions.
Pub.: 25 Dec '14, Pinned: 29 Jul '17
Abstract: Reverse vaccinology (RV) is a bioinformatics approach that can predict antigens with protective potential from the protein coding genomes of bacterial pathogens for subunit vaccine design. RV has become firmly established following the development of the BEXSERO® vaccine against Neisseria meningitidis serogroup B. RV studies have begun to incorporate machine learning (ML) techniques to distinguish bacterial protective antigens (BPAs) from non-BPAs. This research contributes significantly to the RV field by using permutation analysis to demonstrate that a signal for protective antigens can be curated from published data. Furthermore, the effects of the following on an ML approach to RV were also assessed: nested cross-validation, balancing selection of non-BPAs for subcellular localization, increasing the training data, and incorporating greater numbers of protein annotation tools for feature generation. These enhancements yielded a support vector machine (SVM) classifier that could discriminate BPAs (n = 200) from non-BPAs (n = 200) with an area under the curve (AUC) of 0.787. In addition, hierarchical clustering of BPAs revealed that intracellular BPAs clustered separately from extracellular BPAs. However, no immediate benefit was derived when training SVM classifiers on data sets exclusively containing intra- or extracellular BPAs. In conclusion, this work demonstrates that ML classifiers have great utility in RV approaches and will lead to new subunit vaccines in the future.
Pub.: 06 Feb '17, Pinned: 29 Jul '17