A pinboard by
Brandon Park

Ph.D Candidate, George Mason University


Network based methodologies for time series that provide new insights into causality and forecasting

High dimensional time series is concerned with data that are characterized by small sample sizes and a large number of features with time trends. As an example, while forecasting the national GDP (which we refer to as a response variable) data concerning several economic features such as income, inflation rate, and unemployment rate are collected over time. However, several of these features are correlated amongst themselves over time and hence the relevant features affecting the response variable are unknown. Consequently, identifying important features in time series data is a first step for data analysis. Recently developed statistical methods for identifying relevant features tend to fail in these situations due to the presence of time lags and trends and their interactions.
In my thesis, we develop new methodologies for identifying relevant features and their time effect for response variables of interest. In addition to identification of relevant features, we also identify clusters of features that have a similar effect on response variables. These clusters take into account time effect and time lags. This is accomplished by using new network based methodologies which involve network wide metrics in a multilayer network. These network wide metrics represent the importance of features in each layer of the network. For this reason, we carry out a detailed analysis of the multilayer networks and provide useful software, based on machine learning algorithms, for routine use by practitioners.

An Interactive Machine Learning Framework

Abstract: Machine learning (ML) is believed to be an effective and efficient tool to build reliable prediction model or extract useful structure from an avalanche of data. However, ML is also criticized by its difficulty in interpretation and complicated parameter tuning. In contrast, visualization is able to well organize and visually encode the entangled information in data and guild audiences to simpler perceptual inferences and analytic thinking. But large scale and high dimensional data will usually lead to the failure of many visualization methods. In this paper, we close a loop between ML and visualization via interaction between ML algorithm and users, so machine intelligence and human intelligence can cooperate and improve each other in a mutually rewarding way. In particular, we propose "transparent boosting tree (TBT)", which visualizes both the model structure and prediction statistics of each step in the learning process of gradient boosting tree to user, and involves user's feedback operations to trees into the learning process. In TBT, ML is in charge of updating weights in learning model and filtering information shown to user from the big data, while visualization is in charge of providing a visual understanding of ML model to facilitate user exploration. It combines the advantages of both ML in big data statistics and human in decision making based on domain knowledge. We develop a user friendly interface for this novel learning method, and apply it to two datasets collected from real applications. Our study shows that making ML transparent by using interactive visualization can significantly improve the exploration of ML algorithms, give rise to novel insights of ML models, and integrates both machine and human intelligence.

Pub.: 18 Oct '16, Pinned: 03 Jul '17