Research Fellow, Monash University


Given a problem which anomaly detection technique would be the most effective?

With enormous amounts of data being generated on all aspects of our world, an increasingly important problem is to analyse data and to find anomalies: data points that are unusual in the set. Examples include fraudulent credit card transactions amongst billions of legitimate transactions; fetal anomalies that may indicate severe disabilities of unborn infants; chromosomal anomalies in tumours heralding cancers such as leukaemia; and unusual trends or patterns in social media that may herald social unrest or terror activities. These are a handful of real world applications where anomaly detection is used. In every scientific field of study, anomalies are of interest as they tell a different story from the norm.

There are many anomaly detection methods available in literature and new methods are constantly developed. However, the performance of the different techniques depends on the characteristics of the problems/datasets that it is used on. For example, an algorithm that performs very well in identifying fraudulent credit card activities may be useless in identifying chromosomal anomalies in tumours. Both scenarios are examples where anomaly detection methods are used. So, apart from the obvious difference in context, what makes an algorithm perform well on one scenario and poorly on another? This is our research problem.

Each scenario, including the ones discussed above, gives rise to a dataset having its own set of attributes and observations. The anomaly detection methods are used on these datasets to discover anomalies. Therefore, our problem translates to datasets as follows: What makes an algorithm perform well on one dataset and poorly on another? The key intrinsic characteristics of datasets that have an impact on anomaly detection algorithms are unknown. Our research creates a bridge between the dataset characteristics and the anomaly detection techniques. It answers the question given a dataset (which comes from a scenario), what is the anomaly detection technique that is best suited for it. Thus, we can understand the algorithm performance in terms of dataset characteristics. For which datasets will a given algorithm perform best? For which datasets will this algorithm not work at all? This is the new knowledge that this research project contributes to the existing literature.


Childhood cancer risk in those with chromosomal and non-chromosomal congenital anomalies in Washington State: 1984-2013.

Abstract: The presence of a congenital anomaly is associated with increased childhood cancer risk, likely due to large effects of Down syndrome and chromosomal anomalies for leukemia. Less is known about associations with presence of non-chromosomal anomalies.Records of children diagnosed with cancer at <20 years of age during 1984-2013 in Washington State cancer registries were linked to their birth certificates (N = 4,105). A comparison group of children born in the same years was identified. Congenital anomalies were assessed from birth records and diagnosis codes in linked hospital discharge data. Logistic regression was used to estimate odds ratios (OR) and 95% confidence intervals (CI) for cancer, and for specific cancer types in relation to the presence of any anomaly and specific anomalies.Having any congenital anomaly was associated with an increased risk of childhood cancer (OR: 1.46, 95% CI 1.28-1.65). Non-chromosomal anomalies were also associated with increased childhood cancer risk overall (OR: 1.35; 95% CI: 1.18-1.54), and with increased risk of several cancer types, including neuroblastoma, renal, hepatoblastoma, soft-tissue sarcoma, and germ cell tumors. Increasing number of non-chromosomal anomalies was associated with a stronger risk of childhood cancer (OR for 3+ anomalies: 3.11, 95% CI: 1.54-6.11). Although central nervous system (CNS) anomalies were associated with CNS tumors (OR: 6.05, 95% CI 2.75-13.27), there was no strong evidence of other non-chromosomal anomalies being specifically associated with cancer occurring in the same organ system or anatomic location.Non-chromosomal anomalies increased risk of several cancer types. Additionally, we found that increasing number of non-chromosomal anomalies was associated with a stronger risk of cancer. Pooling similar data from many regions would increase power to identify specific associations in order to inform molecular studies examining possible common developmental pathways in the etiologies of birth defects and cancer.

Pub.: 09 Jun '17, Pinned: 29 Aug '17

Anomaly Detection and Modeling in 802.11 Wireless Networks

Abstract: IEEE 802.11 Wireless Networks are getting more and more popular at university campuses, enterprises, shopping centers, airports and in so many other public places, providing Internet access to a large crowd openly and quickly. The wireless users are also getting more dependent on WiFi technology and therefore demanding more reliability and higher performance for this vital technology. However, due to unstable radio conditions, faulty equipment, and dynamic user behavior among other reasons, there are always unpredictable performance problems in a wireless covered area. Detection and prediction of such problems is of great significance to network managers if they are to alleviate the connectivity issues of the mobile users and provide a higher quality wireless service. This paper aims to improve the management of the 802.11 wireless networks by characterizing and modeling wireless usage patterns in a set of anomalous scenarios that can occur in such networks. We apply time-invariant (Gaussian Mixture Models) and time-variant (Hidden Markov Models) modeling approaches to a dataset generated from a large production network and describe how we use these models for anomaly detection. We then generate several common anomalies on a Testbed network and evaluate the proposed anomaly detection methodologies in a controlled environment. The experimental results of the Testbed show that HMM outperforms GMM and yields a higher anomaly detection ratio and a lower false alarm rate.

Pub.: 04 Jul '17, Pinned: 29 Aug '17

Time Series Anomaly Detection; Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data

Abstract: Google uses continuous streams of data from industry partners in order to deliver accurate results to users. Unexpected drops in traffic can be an indication of an underlying issue and may be an early warning that remedial action may be necessary. Detecting such drops is non-trivial because streams are variable and noisy, with roughly regular spikes (in many different shapes) in traffic data. We investigated the question of whether or not we can predict anomalies in these data streams. Our goal is to utilize Machine Learning and statistical approaches to classify anomalous drops in periodic, but noisy, traffic patterns. Since we do not have a large body of labeled examples to directly apply supervised learning for anomaly classification, we approached the problem in two parts. First we used TensorFlow to train our various models including DNNs, RNNs, and LSTMs to perform regression and predict the expected value in the time series. Secondly we created anomaly detection rules that compared the actual values to predicted values. Since the problem requires finding sustained anomalies, rather than just short delays or momentary inactivity in the data, our two detection methods focused on continuous sections of activity rather than just single points. We tried multiple combinations of our models and rules and found that using the intersection of our two anomaly detection methods proved to be an effective method of detecting anomalies on almost all of our models. In the process we also found that not all data fell within our experimental assumptions, as one data stream had no periodicity, and therefore no time based model could predict it.

Pub.: 11 Aug '17, Pinned: 29 Aug '17

Energy-based Models for Video Anomaly Detection

Abstract: Automated detection of abnormalities in data has been studied in research area in recent years because of its diverse applications in practice including video surveillance, industrial damage detection and network intrusion detection. However, building an effective anomaly detection system is a non-trivial task since it requires to tackle challenging issues of the shortage of annotated data, inability of defining anomaly objects explicitly and the expensive cost of feature engineering procedure. Unlike existing appoaches which only partially solve these problems, we develop a unique framework to cope the problems above simultaneously. Instead of hanlding with ambiguous definition of anomaly objects, we propose to work with regular patterns whose unlabeled data is abundant and usually easy to collect in practice. This allows our system to be trained completely in an unsupervised procedure and liberate us from the need for costly data annotation. By learning generative model that capture the normality distribution in data, we can isolate abnormal data points that result in low normality scores (high abnormality scores). Moreover, by leverage on the power of generative networks, i.e. energy-based models, we are also able to learn the feature representation automatically rather than replying on hand-crafted features that have been dominating anomaly detection research over many decades. We demonstrate our proposal on the specific application of video anomaly detection and the experimental results indicate that our method performs better than baselines and are comparable with state-of-the-art methods in many benchmark video anomaly detection datasets.

Pub.: 17 Aug '17, Pinned: 29 Aug '17