Research Fellow, Deakin University
Efficient Identification of Arbitrarily Shaped and Varied Density Clusters in High-dimensional Data.
Clustering has become one of the most important processes of knowledge discovery from data in the era of big data. It explores and reveals the hidden patterns in the data, and provides insight into the natural groupings in the data. I am solving two existing problems of density-based clustering in order to efficiently identify the arbitrarily shaped and varied density clusters in high-dimensional data. I have investigated and designed different approaches for each problem. The effectiveness of these proposed approaches has been verified with extensive empirical evaluations on synthetic and real-world datasets.
Abstract: We investigate statistical properties of a clustering algorithm that receives level set estimates from a kernel density estimator and then estimates the first split in the density level cluster tree if such a split is present or detects the absence of such a split. Key aspects of our analysis include finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for chosing the kernel bandwidth. For the rates and the adaptivity we do not need continuity assumptions on the density such as H\"older continuity, but only require intuitive geometric assumptions of non-parametric nature.
Pub.: 17 Aug '17, Pinned: 27 Aug '17
Abstract: Machine learning algorithms are everywhere, ranging from simple data analysis and pattern recognition tools used across the sciences to complex systems that achieve super-human performance on various tasks. Ensuring that they are well-behaved---that they do not, for example, cause harm to humans or act in a racist or sexist way---is therefore not a hypothetical problem to be dealt with in the future, but a pressing one that we address here. We propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesirable behaviors. To show the viability of this new framework, we use it to create new machine learning algorithms that preclude the sexist and harmful behaviors exhibited by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning.
Pub.: 17 Aug '17, Pinned: 27 Aug '17
Abstract: The analysis of mixed data has been raising challenges in statistics and machine learning. One of two most prominent challenges is to develop new statistical techniques and methodologies to effectively handle mixed data by making the data less heterogeneous with minimum loss of information. The other challenge is that such methods must be able to apply in large-scale tasks when dealing with huge amount of mixed data. To tackle these challenges, we introduce parameter sharing and balancing extensions to our recent model, the mixed-variate restricted Boltzmann machine (MV.RBM) which can transform heterogeneous data into homogeneous representation. We also integrate structured sparsity and distance metric learning into RBM-based models. Our proposed methods are applied in various applications including latent patient profile modelling in medical data analysis and representation learning for image retrieval. The experimental results demonstrate the models perform better than baseline methods in medical data and outperform state-of-the-art rivals in image dataset.
Pub.: 18 Aug '17, Pinned: 27 Aug '17
Abstract: Neural networks are known to be vulnerable to adversarial examples, inputs that have been intentionally perturbed to remain visually similar to the source input, but cause a misclassification. Until now, black-box attacks against neural networks have relied on transferability of adversarial examples. White-box attacks are used to generate adversarial examples on a substitute model and then transferred to the black-box target model. In this paper, we introduce a direct attack against black-box neural networks, that uses another attacker neural network to learn to craft adversarial examples. We show that our attack is capable of crafting adversarial examples that are indistinguishable from the source input and are misclassified with overwhelming probability - reducing accuracy of the black-box neural network from 99.4% to 0.77% on the MNIST dataset, and from 91.4% to 6.8% on the CIFAR-10 dataset. Our attack can adapt and reduce the effectiveness of proposed defenses against adversarial examples, requires very little training data, and produces adversarial examples that can transfer to different machine learning models such as Random Forest, SVM, and K-Nearest Neighbor. To demonstrate the practicality of our attack, we launch a live attack against a target black-box model hosted online by Amazon: the crafted adversarial examples reduce its accuracy from 91.8% to 61.3%. Additionally, we show attacks proposed in the literature have unique, identifiable distributions. We use this information to train a classifier that is robust against such attacks.
Pub.: 17 Aug '17, Pinned: 21 Aug '17
Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical practice.
Pub.: 13 May '17, Pinned: 21 Aug '17
Abstract: Interactive model analysis, the process of understanding, diagnosing, and refining a machine learning model with the help of interactive visualization, is very important for users to efficiently solve real-world artificial intelligence and data mining problems. Dramatic advances in big data analytics has led to a wide variety of interactive model analysis tasks. In this paper, we present a comprehensive analysis and interpretation of this rapidly developing area. Specifically, we classify the relevant work into three categories: understanding, diagnosis, and refinement. Each category is exemplified by recent influential work. Possible future research opportunities are also explored and discussed.
Pub.: 03 Feb '17, Pinned: 21 Aug '17
Join Sparrho today to stay on top of science
Discover, organise and share research that matters to you