PhD candidate, University of Washington
Low quality data leads to low quality results
Dealing with large volumes of data from a variety of sources has become the norm for biomedical scientists. Genomics, proteomics, imaging, clinical presentation etc must all be combined in analysis to generate personalized medical plans and treatments. Integrating all this data is a necessary but often painful step in the analysis pipeline. Cleaning the data, making it queryable and adherent to standards may not be the sexiest step on the road to scientific discovery, but its importance is clear, especially as poor experimental reproducibility has become increasingly recognized in recent years.
When we generate data without taking the necessary steps to make that data reusable and interoperable, we are doing the community a disservice. It becomes that much harder for others to reproduce your results, perform meta-analyses, build upon the foundations of your work, or achieve good results when using that data for secondary analysis. My research focuses on assessing these risks, and proposing ways to improve the interoperability of public data resources.
I use computational techniques to audit and resolve inconsistencies in public databases and datasets. These inconsistencies can be within or between datasets, but in both case, negatively affect analysis results derived from the data. For example, representations of biological pathways describing physiological functions are often used to determine which genes contribute the most to a phenotype, like cancer; this is called pathway enrichment analysis. Different databases have different representations of these pathways, some which can be quite poorly annotated and inconsistent. Choosing different pathway representations to use for pathway enrichment analysis can alter the output, yielding a completely different set of genes. This is obviously a problem if we want to pursue those genes as drug targets.
Instead of addressing the low quality of the analysis output, I believe we should be addressing data quality problems at the root. When contributions from basic biological science are converted into structured data for secondary use, that structured data should be clean, well annotated, and audited regularly for errors and inconsistencies. Only then can we confidently apply this data to other tasks, and speed up the development of novel medical treatments.
Abstract: The invention includes a medical document handling system and method and automated coding systems and methods for assigning predetermined medical codes to medical documents based on the documents' contents. The invention functions by analyzing electronic medical records and extracting medical information using natural language processing and machine learning. The system collects and amalgamates medical documentation in various formats from multiple sources and locations, normalizes the information, analyzes the information, recognizes information indicating contents corresponding to classification codes, assigns classification codes, and presents information in context correlated to medical records for billing and other purposes.
Pub.: 19 Jun '07, Pinned: 03 Jul '17
Abstract: A system for managing and exchanging electronic information provides a rules management component for executing conceptual rules, an ontology management component, an information model management component, and a system configuration management component. The ontology management component manages at least one ontology and mappings between members of different ontologies. The ontologies may include a code system and a terminology. The ontology management component may manage a value set that is a subset of the terminology. The information model management component manages one or more information model schemas, each defining an information model and comprising information defining at least one slot within the information model. The system configuration management component manages configuration information on the configuration of each system component. The system configuration component utilizes services of the rules management component, information model management component and ontology management component to dynamically bind value sets to slots of the information model.
Pub.: 05 Jul '16, Pinned: 03 Jul '17
Abstract: Ontology matching is a growing field of research that is of critical importance for the semantic web initiative. The use of background knowledge for ontology matching is often a key factor for success, particularly in complex and lexically rich domains such as the life sciences. However, in most ontology matching systems, the background knowledge sources are either predefined by the system or have to be provided by the user. In this paper, we present a novel methodology for automatically selecting background knowledge sources for any given ontologies to match. This methodology measures the usefulness of each background knowledge source by assessing the fraction of classes mapped through it over those mapped directly, which we call the mapping gain. We implemented this methodology in the AgreementMakerLight ontology matching framework, and evaluate it using the benchmark biomedical ontology matching tasks from the Ontology Alignment Evaluation Initiative (OAEI) 2013. In each matching problem, our methodology consistently identified the sources of background knowledge that led to the highest improvements over the baseline alignment (i.e., without background knowledge). Furthermore, our proposed mapping gain parameter is strongly correlated with the F-measure of the produced alignments, thus making it a good estimator for ontology matching techniques based on background knowledge.
Pub.: 08 Nov '14, Pinned: 03 Jul '17
Abstract: A major goal in genomic research is to identify genes that may jointly influence a biological response. From many years of intensive biomedical research, a large body of biological knowledge, or pathway information, has accumulated in available databases. There is a strong interest in leveraging these pathways to improve the statistical power and interpretability in studying gene networks associated with complex phenotypes. This prior information is a valuable complement to large-scale genomic data such as gene expression data generated from microarrays. However, it is a non-trivial task to effectively integrate available biological knowledge into gene expression data when reconstructing gene networks.In this article, we developed and applied a Lasso method from a Bayesian perspective, a method we call prior Lasso (pLasso), for the reconstruction of gene networks. In this method, we partition edges between genes into two subsets: one subset of edges is present in known pathways, whereas the other has no prior information associated. Our method assigns different prior distributions to each subset according to a modified Bayesian information criterion that incorporates prior knowledge on both the network structure and the pathway information. Simulation studies have indicated that the method is more effective in recovering the underlying network than a traditional Lasso method that does not use the prior information. We applied pLasso to microarray gene expression datasets, where we used information from the Pathway Commons (PC) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as prior information for the network reconstruction, and successfully identified network hub genes associated with clinical outcome in cancer patients.The source code is available at http://nba.uth.tmc.edu/homepage/liu/pLasso.
Pub.: 21 Aug '13, Pinned: 03 Jul '17
Abstract: GeneMANIA (http://www.genemania.org) is a flexible, user-friendly web interface for generating hypotheses about gene function, analyzing gene lists and prioritizing genes for functional assays. Given a query list, GeneMANIA extends the list with functionally similar genes that it identifies using available genomics and proteomics data. GeneMANIA also reports weights that indicate the predictive value of each selected data set for the query. Six organisms are currently supported (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Homo sapiens and Saccharomyces cerevisiae) and hundreds of data sets have been collected from GEO, BioGRID, Pathway Commons and I2D, as well as organism-specific functional genomics data sets. Users can select arbitrary subsets of the data sets associated with an organism to perform their analyses and can upload their own data sets to analyze. The GeneMANIA algorithm performs as well or better than other gene function prediction methods on yeast and mouse benchmarks. The high accuracy of the GeneMANIA prediction algorithm, an intuitive user interface and large database make GeneMANIA a useful tool for any biologist.
Pub.: 02 Jul '10, Pinned: 03 Jul '17
Abstract: Pathway Commons (http://www.pathwaycommons.org) is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in standard formats. Pathway Commons currently contains data from nine databases with over 1400 pathways and 687,000 interactions and will be continually expanded and updated.
Pub.: 13 Nov '10, Pinned: 03 Jul '17
Abstract: issues and limitations related to accessibility, understandability and ease of use of signalling pathway databases may hamper or divert research workflow, leading, in the worst case, to the generation of confusing reference frameworks and misinterpretation of experimental results. In an attempt to retrieve signalling pathway data related to a specific set of test genes, we queried and analysed the results from six of the major curated signalling pathway databases: Reactome, PathwayCommons, KEGG, InnateDB, PID, and Wikipathways.although we expected differences - often a desirable feature for the integration of each individual query, we observed variations of exceptional magnitude, with disproportionate quality and quantity of the results. Some of the more remarkable differences can be explained by the diverse conceptual designs and purposes of the databases, the types of data stored and the structure of the query, as well as by missing or erroneous descriptions of the search procedure. To go beyond the mere enumeration of these problems, we identified a number of operational features, in particular inner and cross coherence, which, once quantified, offer objective criteria to choose the best source of information.in silico biology heavily relies on the information stored in databases. To ensure that computational biology mirrors biological reality and offers focused hypotheses to be experimentally validated, coherence of data codification is crucial and yet highly underestimated. We make practical recommendations for the end-user to cope with the current state of the databases as well as for the maintainers of those databases to contribute to the goal of the full enactment of the open data paradigm.
Pub.: 15 Aug '13, Pinned: 03 Jul '17
Abstract: Pathway analysis has become the first choice for gaining insight into the underlying biology of differentially expressed genes and proteins, as it reduces complexity and has increased explanatory power. We discuss the evolution of knowledge base-driven pathway analysis over its first decade, distinctly divided into three generations. We also discuss the limitations that are specific to each generation, and how they are addressed by successive generations of methods. We identify a number of annotation challenges that must be addressed to enable development of the next generation of pathway analysis methods. Furthermore, we identify a number of methodological challenges that the next generation of methods must tackle to take advantage of the technological advances in genomics and proteomics in order to improve specificity, sensitivity, and relevance of pathway analysis.
Pub.: 03 Mar '12, Pinned: 30 Jun '17
Abstract: Several challenges to the field of ontology matching have been outlined in recent research. The selection of the appropriate similarity measures as well as the configuration tuning of their combination are known as fundamental issues the community should deal with. Verifying the semantic coherence of the discovered alignment is also known as a crucial task. As the challenging issues are both in basic matching techniques and in their combination, our approach is aimed to provide improvement at the basic matcher level and also at the level of framework. Matching large scale ontologies is currently one of the most challenging issues in ontology matching field. The main reason is that large ontologies are highly heterogeneous both at terminological and conceptual levels. Furthermore, matching very large ontologies entails exploring a very large searching space to discover correspondences. It may also require a huge amount of main memory to maintain the temporary results at each computational step. These factors strongly impact the effectiveness and efficiency of any ontology matching tool. To overcome these issues, we have developed a disk-based ontology matching approach. The underlying idea of our approach is that the complexity and therefore the cost of the matching algorithms are reduced thanks to the indexing data structures by avoiding exhaustive pair-wise comparisons. Indeed, we extensively used indexing techniques in many places. For example, we defined a bitmap encoding the structural information of an ontology. This indexing structure will be exploited for accelerating similarity propagation. Moreover, our approach uses a disk-based mechanism to store temporary data. This allows to perform any ontology matching task on a simple PC or laptop instead of a powerful server. In this paper, we describe YAM++, an ontology matching tool, aimed at solving these issues. We evaluated the efficiency of YAM++ in various OAEI 2012 and OAEI 2013 tracks. YAM++ was one of the best ontology matching systems in terms of F-measure. Most notably, the current version of YAM++ has passed all scalability and large scale ontology matching tests and obtained high matching quality results.
Pub.: 15 Oct '16, Pinned: 30 Jun '17
Abstract: The study of biological pathways is key to a large number of systems analyses. However, many relevant tools consider a limited number of pathway sources, missing out on many genes and gene-to-gene connections. Simply pooling several pathways sources would result in redundancy and the lack of systematic pathway interrelations. To address this, we exercised a combination of hierarchical clustering and nearest neighbor graph representation, with judiciously selected cutoff values, thereby consolidating 3215 human pathways from 12 sources into a set of 1073 SuperPaths. Our unification algorithm finds a balance between reducing redundancy and optimizing the level of pathway-related informativeness for individual genes. We show a substantial enhancement of the SuperPaths' capacity to infer gene-to-gene relationships when compared with individual pathway sources, separately or taken together. Further, we demonstrate that the chosen 12 sources entail nearly exhaustive gene coverage. The computed SuperPaths are presented in a new online database, PathCards, showing each SuperPath, its constituent network of pathways, and its contained genes. This provides researchers with a rich, searchable systems analysis resource. Database URL: http://pathcards.genecards.org/
Pub.: 01 Mar '15, Pinned: 30 Jun '17
Abstract: The introduction of omics data and advances in technologies involved in clinical treatment has led to a broad range of approaches to represent clinical information. Within this context, patient stratification across health institutions due to omic profiling presents a complex scenario to carry out multi-center clinical trials.This paper presents a standards-based approach to ensure semantic integration required to facilitate the analysis of clinico-genomic clinical trials. To ensure interoperability across different institutions, we have developed a Semantic Interoperability Layer (SIL) to facilitate homogeneous access to clinical and genetic information, based on different well-established biomedical standards and following International Health (IHE) recommendations.The SIL has shown suitability for integrating biomedical knowledge and technologies to match the latest clinical advances in healthcare and the use of genomic information. This genomic data integration in the SIL has been tested with a diagnostic classifier tool that takes advantage of harmonized multi-center clinico-genomic data for training statistical predictive models.The SIL has been adopted in national and international research initiatives, such as the EURECA-EU research project and the CIMED collaborative Spanish project, where the proposed solution has been applied and evaluated by clinical experts focused on clinico-genomic studies.
Pub.: 11 Jun '17, Pinned: 30 Jun '17
Abstract: In the summer of 2016 an international group of biomedical and health informatics faculty and graduate students gathered for the 16th meeting of the International Partnership in Health Informatics Education (IPHIE) masterclass at the University of Utah campus in Salt Lake City, Utah. This international biomedical and health informatics workshop was created to share knowledge and explore issues in biomedical health informatics (BHI).The goal of this paper is to summarize the discussions of biomedical and health informatics graduate students who were asked to define interoperability, and make critical observations to gather insight on how to improve biomedical education.Students were assigned to one of four groups and asked to define interoperability and explore potential solutions to current problems of interoperability in health care.We summarize here the student reports on the importance and possible solutions to the "interoperability problem" in biomedical informatics. Reports are provided from each of the four groups of highly qualified graduate students from leading BHI programs in the US, Europe and Asia.International workshops such as IPHIE provide a unique opportunity for graduate student learning and knowledge sharing. BHI faculty are encouraged to incorporate into their curriculum opportunities to exercise and strengthen student critical thinking to prepare our students for solving health informatics problems in the future.
Pub.: 22 Jun '17, Pinned: 30 Jun '17