PhD candidate, University of Maryland
A Human Computer Interaction for the Data Science or Human Data Interaction
Text mining extracts valuable insights from a text corpus. Many interesting problems in text mining such as identifying characteristics of a group of documents, selecting high-quality comments to promote, or describing an image are open-ended tasks where no ground truth exists. Humans must still provide world knowledge, reasoning, and context for these tasks. However, this approach does not scale to large corpora, and automating them is proving to be a challenging problem. While sophisticated text mining algorithms are becoming increasingly proficient at extracting themes, identifying insightful documents, or labeling images, the lack of formative evaluation makes it difficult to evaluate and improve them.
My research suggests a general framework for transforming state-of-the-art text mining algorithms into interactive analytics process using visual representations. First, the output of the system can be explored using interactive visualization. ParallelSpaces examines the understanding of the results of topic modeling for Yelp business reviews, where businesses and their reviews constitute each separate visual space and exploring these spaces enable the characterization of each space using the other. TopicLens is a Magic Lens-type interaction technique, where the documents under the lens are clustered according to topics in real time. Second, based on the output understanding, the user can directly manipulate the model parameters. CommentIQ is a comment moderation tool where moderators can adjust model parameters according to the context or goals. Third, based on user understanding of output results using visualizations, they can sculpt features for the concept they can use in document scoring. ConceptVector uses word embedding to support these pro- cess. Finally, based on output understanding, one can teach or improve the specific part of the model with teaching dataset. My planned future project, CaptionViz, visualizes the output of image caption generation model and users can improve the model performance by feeding a complement dataset called the ”teaching set.”
Abstract: Topic modeling has been widely used for analyzing text document collections. Recently, there have been significant advancements in various topic modeling techniques, particularly in the form of probabilistic graphical modeling. State-of-the-art techniques such as Latent Dirichlet Allocation (LDA) have been successfully applied in visual text analytics. However, most of the widely-used methods based on probabilistic modeling have drawbacks in terms of consistency from multiple runs and empirical convergence. Furthermore, due to the complicatedness in the formulation and the algorithm, LDA cannot easily incorporate various types of user feedback. To tackle this problem, we propose a reliable and flexible visual analytics system for topic modeling called UTOPIAN (User-driven Topic modeling based on Interactive Nonnegative Matrix Factorization). Centered around its semi-supervised formulation, UTOPIAN enables users to interact with the topic modeling method and steer the result in a user-driven manner. We demonstrate the capability of UTOPIAN via several usage scenarios with real-world document corpuses such as InfoVis/VAST paper data set and product review data sets.
Pub.: 21 Sep '13, Pinned: 27 Jun '17
Pub.: 12 Sep '15, Pinned: 27 Jun '17
Abstract: Topic modeling, which reveals underlying topics of a document corpus, has been actively adopted in visual analytics for large-scale document collections. However, due to its significant processing time and non-interactive nature, topic modeling has so far not been tightly integrated into a visual analytics workflow. Instead, most such systems are limited to utilizing a fixed, initial set of topics. Motivated by this gap in the literature, we propose a novel interaction technique called TopicLens that allows a user to dynamically explore data through a lens interface where topic modeling and the corresponding 2D embedding are efficiently computed on the fly. To support this interaction in real time while maintaining view consistency, we propose a novel efficient topic modeling method and a semi-supervised 2D embedding algorithm. Our work is based on improving state-of-the-art methods such as nonnegative matrix factorization and t-distributed stochastic neighbor embedding. Furthermore, we have built a web-based visual analytics system integrated with TopicLens. We use this system to measure the performance and the visualization quality of our proposed methods. We provide several scenarios showcasing the capability of TopicLens using real-world datasets.
Pub.: 23 Nov '16, Pinned: 27 Jun '17
Join Sparrho today to stay on top of science
Discover, organise and share research that matters to you