A pinboard by
Lucy Wang

PhD candidate, University of Washington


Low quality data leads to low quality results

Dealing with large volumes of data from a variety of sources has become the norm for biomedical scientists. Genomics, proteomics, imaging, clinical presentation etc must all be combined in analysis to generate personalized medical plans and treatments. Integrating all this data is a necessary but often painful step in the analysis pipeline. Cleaning the data, making it queryable and adherent to standards may not be the sexiest step on the road to scientific discovery, but its importance is clear, especially as poor experimental reproducibility has become increasingly recognized in recent years.

When we generate data without taking the necessary steps to make that data reusable and interoperable, we are doing the community a disservice. It becomes that much harder for others to reproduce your results, perform meta-analyses, build upon the foundations of your work, or achieve good results when using that data for secondary analysis. My research focuses on assessing these risks, and proposing ways to improve the interoperability of public data resources.

I use computational techniques to audit and resolve inconsistencies in public databases and datasets. These inconsistencies can be within or between datasets, but in both case, negatively affect analysis results derived from the data. For example, representations of biological pathways describing physiological functions are often used to determine which genes contribute the most to a phenotype, like cancer; this is called pathway enrichment analysis. Different databases have different representations of these pathways, some which can be quite poorly annotated and inconsistent. Choosing different pathway representations to use for pathway enrichment analysis can alter the output, yielding a completely different set of genes. This is obviously a problem if we want to pursue those genes as drug targets.

Instead of addressing the low quality of the analysis output, I believe we should be addressing data quality problems at the root. When contributions from basic biological science are converted into structured data for secondary use, that structured data should be clean, well annotated, and audited regularly for errors and inconsistencies. Only then can we confidently apply this data to other tasks, and speed up the development of novel medical treatments.


Automatic background knowledge selection for matching biomedical ontologies.

Abstract: Ontology matching is a growing field of research that is of critical importance for the semantic web initiative. The use of background knowledge for ontology matching is often a key factor for success, particularly in complex and lexically rich domains such as the life sciences. However, in most ontology matching systems, the background knowledge sources are either predefined by the system or have to be provided by the user. In this paper, we present a novel methodology for automatically selecting background knowledge sources for any given ontologies to match. This methodology measures the usefulness of each background knowledge source by assessing the fraction of classes mapped through it over those mapped directly, which we call the mapping gain. We implemented this methodology in the AgreementMakerLight ontology matching framework, and evaluate it using the benchmark biomedical ontology matching tasks from the Ontology Alignment Evaluation Initiative (OAEI) 2013. In each matching problem, our methodology consistently identified the sources of background knowledge that led to the highest improvements over the baseline alignment (i.e., without background knowledge). Furthermore, our proposed mapping gain parameter is strongly correlated with the F-measure of the produced alignments, thus making it a good estimator for ontology matching techniques based on background knowledge.

Pub.: 08 Nov '14, Pinned: 03 Jul '17

Incorporating prior knowledge into Gene Network Study.

Abstract: A major goal in genomic research is to identify genes that may jointly influence a biological response. From many years of intensive biomedical research, a large body of biological knowledge, or pathway information, has accumulated in available databases. There is a strong interest in leveraging these pathways to improve the statistical power and interpretability in studying gene networks associated with complex phenotypes. This prior information is a valuable complement to large-scale genomic data such as gene expression data generated from microarrays. However, it is a non-trivial task to effectively integrate available biological knowledge into gene expression data when reconstructing gene networks.In this article, we developed and applied a Lasso method from a Bayesian perspective, a method we call prior Lasso (pLasso), for the reconstruction of gene networks. In this method, we partition edges between genes into two subsets: one subset of edges is present in known pathways, whereas the other has no prior information associated. Our method assigns different prior distributions to each subset according to a modified Bayesian information criterion that incorporates prior knowledge on both the network structure and the pathway information. Simulation studies have indicated that the method is more effective in recovering the underlying network than a traditional Lasso method that does not use the prior information. We applied pLasso to microarray gene expression datasets, where we used information from the Pathway Commons (PC) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as prior information for the network reconstruction, and successfully identified network hub genes associated with clinical outcome in cancer patients.The source code is available at http://nba.uth.tmc.edu/homepage/liu/pLasso.

Pub.: 21 Aug '13, Pinned: 03 Jul '17

Signalling pathway database usability: lessons learned.

Abstract: issues and limitations related to accessibility, understandability and ease of use of signalling pathway databases may hamper or divert research workflow, leading, in the worst case, to the generation of confusing reference frameworks and misinterpretation of experimental results. In an attempt to retrieve signalling pathway data related to a specific set of test genes, we queried and analysed the results from six of the major curated signalling pathway databases: Reactome, PathwayCommons, KEGG, InnateDB, PID, and Wikipathways.although we expected differences - often a desirable feature for the integration of each individual query, we observed variations of exceptional magnitude, with disproportionate quality and quantity of the results. Some of the more remarkable differences can be explained by the diverse conceptual designs and purposes of the databases, the types of data stored and the structure of the query, as well as by missing or erroneous descriptions of the search procedure. To go beyond the mere enumeration of these problems, we identified a number of operational features, in particular inner and cross coherence, which, once quantified, offer objective criteria to choose the best source of information.in silico biology heavily relies on the information stored in databases. To ensure that computational biology mirrors biological reality and offers focused hypotheses to be experimentally validated, coherence of data codification is crucial and yet highly underestimated. We make practical recommendations for the end-user to cope with the current state of the databases as well as for the maintainers of those databases to contribute to the goal of the full enactment of the open data paradigm.

Pub.: 15 Aug '13, Pinned: 03 Jul '17

Overview of YAM++ — (not) Yet Another Matcher for ontology alignment task

Abstract: Several challenges to the field of ontology matching have been outlined in recent research. The selection of the appropriate similarity measures as well as the configuration tuning of their combination are known as fundamental issues the community should deal with. Verifying the semantic coherence of the discovered alignment is also known as a crucial task. As the challenging issues are both in basic matching techniques and in their combination, our approach is aimed to provide improvement at the basic matcher level and also at the level of framework. Matching large scale ontologies is currently one of the most challenging issues in ontology matching field. The main reason is that large ontologies are highly heterogeneous both at terminological and conceptual levels. Furthermore, matching very large ontologies entails exploring a very large searching space to discover correspondences. It may also require a huge amount of main memory to maintain the temporary results at each computational step. These factors strongly impact the effectiveness and efficiency of any ontology matching tool. To overcome these issues, we have developed a disk-based ontology matching approach. The underlying idea of our approach is that the complexity and therefore the cost of the matching algorithms are reduced thanks to the indexing data structures by avoiding exhaustive pair-wise comparisons. Indeed, we extensively used indexing techniques in many places. For example, we defined a bitmap encoding the structural information of an ontology. This indexing structure will be exploited for accelerating similarity propagation. Moreover, our approach uses a disk-based mechanism to store temporary data. This allows to perform any ontology matching task on a simple PC or laptop instead of a powerful server. In this paper, we describe YAM++, an ontology matching tool, aimed at solving these issues. We evaluated the efficiency of YAM++ in various OAEI 2012 and OAEI 2013 tracks. YAM++ was one of the best ontology matching systems in terms of F-measure. Most notably, the current version of YAM++ has passed all scalability and large scale ontology matching tests and obtained high matching quality results.

Pub.: 15 Oct '16, Pinned: 30 Jun '17

A semantic interoperability approach to support integration of gene expression and clinical data in breast cancer.

Abstract: The introduction of omics data and advances in technologies involved in clinical treatment has led to a broad range of approaches to represent clinical information. Within this context, patient stratification across health institutions due to omic profiling presents a complex scenario to carry out multi-center clinical trials.This paper presents a standards-based approach to ensure semantic integration required to facilitate the analysis of clinico-genomic clinical trials. To ensure interoperability across different institutions, we have developed a Semantic Interoperability Layer (SIL) to facilitate homogeneous access to clinical and genetic information, based on different well-established biomedical standards and following International Health (IHE) recommendations.The SIL has shown suitability for integrating biomedical knowledge and technologies to match the latest clinical advances in healthcare and the use of genomic information. This genomic data integration in the SIL has been tested with a diagnostic classifier tool that takes advantage of harmonized multi-center clinico-genomic data for training statistical predictive models.The SIL has been adopted in national and international research initiatives, such as the EURECA-EU research project and the CIMED collaborative Spanish project, where the proposed solution has been applied and evaluated by clinical experts focused on clinico-genomic studies.

Pub.: 11 Jun '17, Pinned: 30 Jun '17

Solving Interoperability in Translational Health. Perspectives of Students from the International Partnership in Health Informatics Education (IPHIE) 2016 Master Class.

Abstract: In the summer of 2016 an international group of biomedical and health informatics faculty and graduate students gathered for the 16th meeting of the International Partnership in Health Informatics Education (IPHIE) masterclass at the University of Utah campus in Salt Lake City, Utah. This international biomedical and health informatics workshop was created to share knowledge and explore issues in biomedical health informatics (BHI).The goal of this paper is to summarize the discussions of biomedical and health informatics graduate students who were asked to define interoperability, and make critical observations to gather insight on how to improve biomedical education.Students were assigned to one of four groups and asked to define interoperability and explore potential solutions to current problems of interoperability in health care.We summarize here the student reports on the importance and possible solutions to the "interoperability problem" in biomedical informatics. Reports are provided from each of the four groups of highly qualified graduate students from leading BHI programs in the US, Europe and Asia.International workshops such as IPHIE provide a unique opportunity for graduate student learning and knowledge sharing. BHI faculty are encouraged to incorporate into their curriculum opportunities to exercise and strengthen student critical thinking to prepare our students for solving health informatics problems in the future.

Pub.: 22 Jun '17, Pinned: 30 Jun '17