A pinboard by
Deok Gun Park

PhD candidate, University of Maryland


A Human Computer Interaction for the Data Science or Human Data Interaction

Text mining extracts valuable insights from a text corpus. Many interesting problems in text mining such as identifying characteristics of a group of documents, selecting high-quality comments to promote, or describing an image are open-ended tasks where no ground truth exists. Humans must still provide world knowledge, reasoning, and context for these tasks. However, this approach does not scale to large corpora, and automating them is proving to be a challenging problem. While sophisticated text mining algorithms are becoming increasingly proficient at extracting themes, identifying insightful documents, or labeling images, the lack of formative evaluation makes it difficult to evaluate and improve them.

My research suggests a general framework for transforming state-of-the-art text mining algorithms into interactive analytics process using visual representations. First, the output of the system can be explored using interactive visualization. ParallelSpaces examines the understanding of the results of topic modeling for Yelp business reviews, where businesses and their reviews constitute each separate visual space and exploring these spaces enable the characterization of each space using the other. TopicLens is a Magic Lens-type interaction technique, where the documents under the lens are clustered according to topics in real time. Second, based on the output understanding, the user can directly manipulate the model parameters. CommentIQ is a comment moderation tool where moderators can adjust model parameters according to the context or goals. Third, based on user understanding of output results using visualizations, they can sculpt features for the concept they can use in document scoring. ConceptVector uses word embedding to support these pro- cess. Finally, based on output understanding, one can teach or improve the specific part of the model with teaching dataset. My planned future project, CaptionViz, visualizes the output of image caption generation model and users can improve the model performance by feeding a complement dataset called the ”teaching set.”