A pinboard by Emeka Ogbuju

I am a Ph.D. student in Computer Science. I study public sentiments for automated decision making

Pinboard Summary

Automated extraction of public opinions using big data technologies to assist government decisions

Nigeria has over the years survived majorly by the exploration and exporting of crude oil. Wealth generated from crude oil has been responsible for virtually everything in Nigeria, ranging from politics and the debate on resource allocation to manipulation of primordial sentiments leading to a myriad of challenges that has made the formation of national identity elusive. This study is, therefore, an attempt to enhance a re-focusing of Nigeria on data as the new oil that can aid development. The vast data generated through the social media must be mined and further extracted, processed and analyzed for policy determination using ICT tools and techniques. The study adopted the resource curse theory as its theoretical framework of analysis to explain the contradiction of poverty in the midst of the abundance of natural resources in Nigeria and the urgent need for refocusing on the use of data as a way of collecting feedback from people. We propose a centralized real-time data collection framework that is capable of aggregating all forms of tweets with a predefined set of keywords into MongoDB, a NoSQL document store. We performed sentiment analysis on the tweets using deep learning algorithm to determine citizens’ opinions on the sector and personalities involved. The result showed a negative sentiment in certain areas of the sector and positive on others. It revealed the true feelings of the citizenry on the personalities analyzed and show clearly the policy direction that the government should embark on to steer the sectors aright. There are indeed possibilities of a data-driven decision making in the nation. The application of the concepts in this study will drive Nigeria and indeed any developing nation to a route towards true smartification of her governance – creation of simple policies, measurable plans, attainable procedures, relevant programs and timely delivery of projects.

24 items pinned

Analytical mapping of opinion mining and sentiment analysis research during 2000–2015

Abstract: The new transformed read-write Web has resulted in a rapid growth of user generated content on the Web resulting into a huge volume of unstructured data. A substantial part of this data is unstructured text such as reviews and blogs. Opinion mining and sentiment analysis (OMSA) as a research discipline has emerged during last 15 years and provides a methodology to computationally process the unstructured data mainly to extract opinions and identify their sentiments. The relatively new but fast growing research discipline has changed a lot during these years. This paper presents a scientometric analysis of research work done on OMSA during 2000–2016. For the scientometric mapping, research publications indexed in Web of Science (WoS) database are used as input data. The publication data is analyzed computationally to identify year-wise publication pattern, rate of growth of publications, types of authorship of papers on OMSA, collaboration patterns in publications on OMSA, most productive countries, institutions, journals and authors, citation patterns and an year-wise citation reference network, and theme density plots and keyword bursts in OMSA publications during the period. A somewhat detailed manual analysis of the data is also performed to identify popular approaches (machine learning and lexicon-based) used in these publications, levels (document, sentence or aspect-level) of sentiment analysis work done and major application areas of OMSA. The paper presents a detailed analytical mapping of OMSA research work and charts the progress of discipline on various useful parameters.

Pub.: 18 Jul '16, Pinned: 29 Jun '17

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

Abstract: Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are laden with opinions, their "dirty" nature (as natural language) has discouraged researchers from applying LDA-based opinion model for product review mining. Tweets are often informal, unstructured and lacking labeled data such as categories and ratings, making it challenging for product opinion mining. In this paper, we propose an LDA-based opinion model named Twitter Opinion Topic Model (TOTM) for opinion mining and sentiment analysis. TOTM leverages hashtags, mentions, emoticons and strong sentiment words that are present in tweets in its discovery process. It improves opinion prediction by modeling the target-opinion interaction directly, thus discovering target specific opinion words, neglected in existing approaches. Moreover, we propose a new formulation of incorporating sentiment prior information into a topic model, by utilizing an existing public sentiment lexicon. This is novel in that it learns and updates with the data. We conduct experiments on 9 million tweets on electronic products, and demonstrate the improved performance of TOTM in both quantitative evaluations and qualitative analysis. We show that aspect-based opinion analysis on massive volume of tweets provides useful opinions on products.

Pub.: 21 Sep '16, Pinned: 21 Jun '17

A fuzzy computational model of emotion for cloud based sentiment analysis

Abstract: This paper presents a novel emotion modeling methodology for incorporating human emotion into intelligent computer systems. The proposed approach includes a method to elicit emotion information from users, a new representation of emotion (AV-AT model) that is modelled using a genetically optimized adaptive fuzzy logic technique, and a framework for predicting and tracking user’s affective trajectory over time. The fuzzy technique is evaluated in terms of its ability to model affective states in comparison to other existing machine learning approaches. The performance of the proposed affect modeling methodology is tested through the deployment of a personalised learning system, and series of offline and online experiments. A hybrid cloud intelligence infrastructure is used to conduct large-scale experiments to analyze user sentiments and associated emotions, using data from a million Facebook users. A performance analysis of the infrastructure on processing, analyzing, and data storage has been carried out, illustrating its viability for large-scale data processing tasks. A comparison of the proposed emotion categorizing approach with Facebook’s sentiment analysis API demonstrates that our approach can achieve comparable performance. Finally, discussions on research contributions to cloud intelligence using sentiment analysis, emotion modeling, big data, and comparisons with other approaches are presented in detail.

Pub.: 10 Feb '17, Pinned: 21 Jun '17

Distributed Real-Time Sentiment Analysis for Big Data Social Streams

Abstract: Big data trend has enforced the data-centric systems to have continuous fast data streams. In recent years, real-time analytics on stream data has formed into a new research field, which aims to answer queries about what-is-happening-now with a negligible delay. The real challenge with real-time stream data processing is that it is impossible to store instances of data, and therefore online analytical algorithms are utilized. To perform real-time analytics, pre-processing of data should be performed in a way that only a short summary of stream is stored in main memory. In addition, due to high speed of arrival, average processing time for each instance of data should be in such a way that incoming instances are not lost without being captured. Lastly, the learner needs to provide high analytical accuracy measures. Sentinel is a distributed system written in Java that aims to solve this challenge by enforcing both the processing and learning process to be done in distributed form. Sentinel is built on top of Apache Storm, a distributed computing platform. Sentinels learner, Vertical Hoeffding Tree, is a parallel decision tree-learning algorithm based on the VFDT, with ability of enabling parallel classification in distributed environments. Sentinel also uses SpaceSaving to keep a summary of the data stream and stores its summary in a synopsis data structure. Application of Sentinel on Twitter Public Stream API is shown and the results are discussed.

Pub.: 27 Dec '16, Pinned: 21 Jun '17

Towards Near Real-Time BGP Deep Analysis: A Big-Data Approach

Abstract: BGP (Border Gateway Protocol) serves as the primary routing protocol for the Internet, enabling Autonomous Systems (individual network operators) to exchange network reachability information. Alongside significant on-going research and development efforts, there is a practical need to understand the nature of events that occur on the Internet. Network operators are acutely aware of security-related incidents such as 'Prefix Hijacking' as well as the impact of network instabilities that ripple through the Internet. Recent research focused on the study of BGP anomalies (both network/prefix instability and security-related incidents) has been based on the analysis of historical logs. Further analysis to understand the nature of these anomalous events is not always sufficient to be able to differentiate malicious activities, such as prefix- or sub-prefix- hijacking, from those events caused by inadvertent misconfigurations. In addition, such techniques are challenged by a lack of sufficient resources to store and process data feeds in real-time from multiple BGP Vantage Points (VPs). In this paper, we present a BGP Deep-analysis application developed using the PNDA (Platform for Network Data Analytics) 'Big-Data' platform. PNDA provides a highly scalable environment that enables the ingestion and processing of 'live' BGP feeds from many vantage points in a schema-agnostic manner. The Apache Spark-based application, in conjunction with PNDA's distributed processing capabilities, is able to perform high-level insights as well as near-to-real-time statistical analysis

Pub.: 24 May '17, Pinned: 20 Jun '17

RSI-CB: A Large Scale Remote Sensing Image Classification Benchmark via Crowdsource Data

Abstract: Remote sensing image classification is a fundamental task in remote sensing image processing. Remote sensing field still lacks of such a large-scale benchmark compared to ImageNet, Place2. We propose a remote sensing image classification benchmark (RSI-CB) based on crowd-source data which is massive, scalable, and diversity. Using crowdsource data, we can efficiently annotate ground objects in remotes sensing image by point of interests, vectors data from OSM or other crowd-source data. Based on this method, we construct a worldwide large-scale benchmark for remote sensing image classification. In this benchmark, there are two sub datasets with 256 * 256 and 128 * 128 size respectively since different convolution neural networks requirement different image size. The former sub dataset contains 6 categories with 35 subclasses with total of more than 24,000 images; the later one contains 6 categories with 45 subclasses with total of more than 36,000 images. The six categories are agricultural land, construction land and facilities, transportation and facilities, water and water conservancy facilities, woodland and other land, and each category has several subclasses. This classification system is defined according to the national standard of land use classification in China, and is inspired by the hierarchy mechanism of ImageNet. Finally, we have done a large number of experiments to compare RSI-CB with SAT-4, UC-Merced datasets on handcrafted features, such as such as SIFT, and classical CNN models, such as AlexNet, VGG, GoogleNet, and ResNet. We also show CNN models trained by RSI-CB have good performance when transfer to other dataset, i.e. UC-Merced, and good generalization ability. The experiments show that RSI-CB is more suitable as a benchmark for remote sensing image classification task than other ones in big data era, and can be potentially used in practical applications.

Pub.: 29 May '17, Pinned: 20 Jun '17