Research Fellow, Monash University
Given a problem which anomaly detection technique would be the most effective?
With enormous amounts of data being generated on all aspects of our world, an increasingly important problem is to analyse data and to find anomalies: data points that are unusual in the set. Examples include fraudulent credit card transactions amongst billions of legitimate transactions; fetal anomalies that may indicate severe disabilities of unborn infants; chromosomal anomalies in tumours heralding cancers such as leukaemia; and unusual trends or patterns in social media that may herald social unrest or terror activities. These are a handful of real world applications where anomaly detection is used. In every scientific field of study, anomalies are of interest as they tell a different story from the norm.
There are many anomaly detection methods available in literature and new methods are constantly developed. However, the performance of the different techniques depends on the characteristics of the problems/datasets that it is used on. For example, an algorithm that performs very well in identifying fraudulent credit card activities may be useless in identifying chromosomal anomalies in tumours. Both scenarios are examples where anomaly detection methods are used. So, apart from the obvious difference in context, what makes an algorithm perform well on one scenario and poorly on another? This is our research problem.
Each scenario, including the ones discussed above, gives rise to a dataset having its own set of attributes and observations. The anomaly detection methods are used on these datasets to discover anomalies. Therefore, our problem translates to datasets as follows: What makes an algorithm perform well on one dataset and poorly on another? The key intrinsic characteristics of datasets that have an impact on anomaly detection algorithms are unknown. Our research creates a bridge between the dataset characteristics and the anomaly detection techniques. It answers the question given a dataset (which comes from a scenario), what is the anomaly detection technique that is best suited for it. Thus, we can understand the algorithm performance in terms of dataset characteristics. For which datasets will a given algorithm perform best? For which datasets will this algorithm not work at all? This is the new knowledge that this research project contributes to the existing literature.
Abstract: Real data often contain anomalous cases, also known as outliers. These may spoil the resulting analysis but they may also contain valuable information. In either case, the ability to detect such anomalies is essential. A useful tool for this purpose is robust statistics, which aims to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. We present an overview of several robust methods and the resulting graphical outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data, such as estimating location and scatter, linear regression, principal component analysis, classification, clustering, and functional data analysis. Also the challenging new topic of cellwise outliers is introduced.
Pub.: 31 Jul '17, Pinned: 29 Aug '17
Abstract: To evaluate the accuracy of prenatal neurosonography in diagnosing underlying causes of fetal ventriculomegaly, posterior fossa anomalies and microcephaly before 24 weeks' gestational age (GA) and to study the accuracy of prenatal counseling on postnatal prognosis.A retrospective cohort study based on 146 cases of these fetal brain anomalies before 24 weeks' GA. Counseling on prognosis was compared with postnatal outcome. Data on genetic testing was analyzed.Out of 146 cases, 135 (92%) were diagnosed correctly before 24 weeks' GA. Accuracy was 98% (97/99) in cases with multiple anomalies and 81% (38/47) in cases with an isolated abnormality. Counseling on prognosis was correct in 143 out of 146 cases (98%). Prenatal genetic diagnostics detected an anomaly in 51/113 (45%) of cases. In 14/62 (23%) cases prenatal karyotyping was normal, but postnatal array-CGH detected a pathogenic anomaly.Despite the challenges of early gestation, accuracy in diagnosing and counseling fetal brain anomalies before 24 weeks' GA was high. Prenatal genetic testing is a valuable diagnostic tool and should be offered to all women with fetal brain anomalies. Considering the many different types of anomalies and diverse etiologies, a multidisciplinary approach is essential for counseling on postnatal outcome.
Pub.: 07 Jun '17, Pinned: 29 Aug '17
Abstract: Graph representations offer powerful and intuitive ways to describe data in a multitude of application domains. Here, we consider stochastic processes generating graphs and propose a methodology for detecting changes in stationarity of such processes. The methodology is general and considers a process generating attributed graphs with a variable number of vertices/edges, without the need to assume one-to-one correspondence between vertices at different time steps. The methodology acts by embedding every graph of the stream into a vector domain, where a conventional multivariate change detection procedure can be easily applied. We ground the soundness of our proposal by proving several theoretical results. In addition, we provide a specific implementation of the methodology and evaluate its effectiveness on several detection problems involving attributed graphs representing biological molecules and drawings. Experimental results are contrasted with respect to suitable baseline methods, demonstrating the competitiveness of our approach.
Pub.: 21 Jun '17, Pinned: 29 Aug '17
Abstract: The presence of a congenital anomaly is associated with increased childhood cancer risk, likely due to large effects of Down syndrome and chromosomal anomalies for leukemia. Less is known about associations with presence of non-chromosomal anomalies.Records of children diagnosed with cancer at <20 years of age during 1984-2013 in Washington State cancer registries were linked to their birth certificates (N = 4,105). A comparison group of children born in the same years was identified. Congenital anomalies were assessed from birth records and diagnosis codes in linked hospital discharge data. Logistic regression was used to estimate odds ratios (OR) and 95% confidence intervals (CI) for cancer, and for specific cancer types in relation to the presence of any anomaly and specific anomalies.Having any congenital anomaly was associated with an increased risk of childhood cancer (OR: 1.46, 95% CI 1.28-1.65). Non-chromosomal anomalies were also associated with increased childhood cancer risk overall (OR: 1.35; 95% CI: 1.18-1.54), and with increased risk of several cancer types, including neuroblastoma, renal, hepatoblastoma, soft-tissue sarcoma, and germ cell tumors. Increasing number of non-chromosomal anomalies was associated with a stronger risk of childhood cancer (OR for 3+ anomalies: 3.11, 95% CI: 1.54-6.11). Although central nervous system (CNS) anomalies were associated with CNS tumors (OR: 6.05, 95% CI 2.75-13.27), there was no strong evidence of other non-chromosomal anomalies being specifically associated with cancer occurring in the same organ system or anatomic location.Non-chromosomal anomalies increased risk of several cancer types. Additionally, we found that increasing number of non-chromosomal anomalies was associated with a stronger risk of cancer. Pooling similar data from many regions would increase power to identify specific associations in order to inform molecular studies examining possible common developmental pathways in the etiologies of birth defects and cancer.
Pub.: 09 Jun '17, Pinned: 29 Aug '17
Abstract: In the class of streaming anomaly detection algorithms for univariate time series, the size of the sliding window over which various statistics are calculated is an important parameter. To address the anomalous variation in the scale of the pseudo-periodicity of time series, we define a streaming multi-scale anomaly score with a streaming PCA over a multi-scale lag-matrix. We define three methods of aggregation of the multi-scale anomaly scores. We evaluate their performance on Yahoo! and Numenta dataset for unsupervised anomaly detection benchmark. To the best of authors' knowledge, this is the first time a multi-scale streaming anomaly detection has been proposed and systematically studied.
Pub.: 21 Jun '17, Pinned: 29 Aug '17
Abstract: IEEE 802.11 Wireless Networks are getting more and more popular at university campuses, enterprises, shopping centers, airports and in so many other public places, providing Internet access to a large crowd openly and quickly. The wireless users are also getting more dependent on WiFi technology and therefore demanding more reliability and higher performance for this vital technology. However, due to unstable radio conditions, faulty equipment, and dynamic user behavior among other reasons, there are always unpredictable performance problems in a wireless covered area. Detection and prediction of such problems is of great significance to network managers if they are to alleviate the connectivity issues of the mobile users and provide a higher quality wireless service. This paper aims to improve the management of the 802.11 wireless networks by characterizing and modeling wireless usage patterns in a set of anomalous scenarios that can occur in such networks. We apply time-invariant (Gaussian Mixture Models) and time-variant (Hidden Markov Models) modeling approaches to a dataset generated from a large production network and describe how we use these models for anomaly detection. We then generate several common anomalies on a Testbed network and evaluate the proposed anomaly detection methodologies in a controlled environment. The experimental results of the Testbed show that HMM outperforms GMM and yields a higher anomaly detection ratio and a lower false alarm rate.
Pub.: 04 Jul '17, Pinned: 29 Aug '17
Abstract: Anomaly detection in database management systems (DBMSs) is difficult because of increasing number of statistics (stat) and event metrics in big data system. In this paper, I propose an automatic DBMS diagnosis system that detects anomaly periods with abnormal DB stat metrics and finds causal events in the periods. Reconstruction error from deep autoencoder and statistical process control approach are applied to detect time period with anomalies. Related events are found using time series similarity measures between events and abnormal stat metrics. After training deep autoencoder with DBMS metric data, efficacy of anomaly detection is investigated from other DBMSs containing anomalies. Experiment results show effectiveness of proposed model, especially, batch temporal normalization layer. Proposed model is used for publishing automatic DBMS diagnosis reports in order to determine DBMS configuration and SQL tuning.
Pub.: 08 Aug '17, Pinned: 29 Aug '17
Abstract: In this paper, we use variational recurrent neural network to investigate the anomaly detection problem on graph time series. The temporal correlation is modeled by the combination of recurrent neural network (RNN) and variational inference (VI), while the spatial information is captured by the graph convolutional network. In order to incorporate external factors, we use feature extractor to augment the transition of latent variables, which can learn the influence of external factors. With the target function as accumulative ELBO, it is easy to extend this model to on-line method. The experimental study on traffic flow data shows the detection capability of the proposed method.
Pub.: 09 Aug '17, Pinned: 29 Aug '17
Abstract: Google uses continuous streams of data from industry partners in order to deliver accurate results to users. Unexpected drops in traffic can be an indication of an underlying issue and may be an early warning that remedial action may be necessary. Detecting such drops is non-trivial because streams are variable and noisy, with roughly regular spikes (in many different shapes) in traffic data. We investigated the question of whether or not we can predict anomalies in these data streams. Our goal is to utilize Machine Learning and statistical approaches to classify anomalous drops in periodic, but noisy, traffic patterns. Since we do not have a large body of labeled examples to directly apply supervised learning for anomaly classification, we approached the problem in two parts. First we used TensorFlow to train our various models including DNNs, RNNs, and LSTMs to perform regression and predict the expected value in the time series. Secondly we created anomaly detection rules that compared the actual values to predicted values. Since the problem requires finding sustained anomalies, rather than just short delays or momentary inactivity in the data, our two detection methods focused on continuous sections of activity rather than just single points. We tried multiple combinations of our models and rules and found that using the intersection of our two anomaly detection methods proved to be an effective method of detecting anomalies on almost all of our models. In the process we also found that not all data fell within our experimental assumptions, as one data stream had no periodicity, and therefore no time based model could predict it.
Pub.: 11 Aug '17, Pinned: 29 Aug '17
Abstract: Data leakage and theft from databases is a dangerous threat to organizations. Data Security and Data Privacy protection systems (DSDP) monitor data access and usage to identify leakage or suspicious activities that should be investigated. Because of the high velocity nature of database systems, such systems audit only a portion of the vast number of transactions that take place. Anomalies are investigated by a Security Officer (SO) in order to choose the proper response. In this paper we investigate the effect of sampling methods based on the risk the transaction poses and propose a new method for "combined sampling" for capturing a more varied sample.
Pub.: 14 Aug '17, Pinned: 29 Aug '17
Abstract: Automated detection of abnormalities in data has been studied in research area in recent years because of its diverse applications in practice including video surveillance, industrial damage detection and network intrusion detection. However, building an effective anomaly detection system is a non-trivial task since it requires to tackle challenging issues of the shortage of annotated data, inability of defining anomaly objects explicitly and the expensive cost of feature engineering procedure. Unlike existing appoaches which only partially solve these problems, we develop a unique framework to cope the problems above simultaneously. Instead of hanlding with ambiguous definition of anomaly objects, we propose to work with regular patterns whose unlabeled data is abundant and usually easy to collect in practice. This allows our system to be trained completely in an unsupervised procedure and liberate us from the need for costly data annotation. By learning generative model that capture the normality distribution in data, we can isolate abnormal data points that result in low normality scores (high abnormality scores). Moreover, by leverage on the power of generative networks, i.e. energy-based models, we are also able to learn the feature representation automatically rather than replying on hand-crafted features that have been dominating anomaly detection research over many decades. We demonstrate our proposal on the specific application of video anomaly detection and the experimental results indicate that our method performs better than baselines and are comparable with state-of-the-art methods in many benchmark video anomaly detection datasets.
Pub.: 17 Aug '17, Pinned: 29 Aug '17