Ph.D. Student, The University of Western Australia
To develop robust, efficient and practical algorithms for scene understanding.
The main aim of my research is to develop robust, efficient and practical algorithms for scene understanding. Scene understanding seeks to equip computers with human like vision capabilities. The performance of machine learning and computer vision techniques has increased considerably for various components of scene understanding e.g. for scene classification, segmentation, object recognition, text detection and depth map generation. However, it is now required to combine these individual tasks so that they can support one another. This is an essential step towards human like perception of scenes. In order to achieve this goal, we need generalizable algorithms which can produce state-of-the-art results for the different components of scene understanding. Deep learning has evolved as the main machine learning technique for feature extraction and classification in scene understanding tasks. My research goal is the application of deep learning and machine learning techniques for the development of robust algorithms which will especially be applicable for the tasks of object recognition, face recognition and surveillance. Another important, yet relatively unexplored, feature in scene understanding is text which is the most important form of human communication. Text occurring in natural scenes can provide information about the context of the scene, the category of the scene, the types of potential objects in the scene as well as possible interactions between them. My second goal is the development of text detection and recognition methods with the goal to be able to use them in real world scenarios. I am investigating novel methods and architectures of deep neural networks to improve the efficiency and robustness of text localization techniques. My final goal will be to integrate these recognition capabilities and incorporate them in a robot owned by our research group. My research will be useful in the fields of surveillance, robotic scene understanding, image and video retrieval from a large database (or internet) and autonomous driving vehicles. It will also be helpful in the development of personal assistance devices for visually impaired, blind and elderly people.
Abstract: This paper presents a morphology-based text line extraction algorithm for extracting text regions from cluttered images. First of all, the method defines a novel set of morphological operations for extracting important contrast regions as possible text line candidates. The contrast feature is robust to lighting changes and invariant against different image transformations like image scaling, translation, and skewing. In order to detect skewed text lines, a moment-based method is then used for estimating their orientations. According to the orientation, an x-projection technique can be applied to extract various text geometries from the text-analogue segments for text verification. However, due to noise, a text line region is often fragmented to different pieces of segments. Therefore, after the projection, a novel recovery algorithm is then proposed for recovering a complete text line from its pieces of segments. After that, a verification scheme is then proposed for verifying all extracted potential text lines according to their text geometries. Experimental results show that the proposed method improves the state-of-the-art work in terms of effectiveness and robustness for text line detection.
Pub.: 03 Aug '07, Pinned: 30 Jul '17
Abstract: In this paper, we propose a novel algorithm to detect text information from natural scene images. Scene text classification and detection are still open research topics. Our proposed algorithm is able to model both character appearance and structure to generate representative and discriminative text descriptors. The contributions of this paper include three aspects: 1) a new character appearance model by a structure correlation algorithm which extracts discriminative appearance features from detected interest points of character samples; 2) a new text descriptor based on structons and correlatons, which model character structure by structure differences among character samples and structure component co-occurrence; and 3) a new text region localization method by combining color decomposition, character contour refinement, and string line alignment to localize character candidates and refine detected text regions. We perform three groups of experiments to evaluate the effectiveness of our proposed algorithm, including text classification, text detection, and character identification. The evaluation results on benchmark datasets demonstrate that our algorithm achieves the state-of-the-art performance on scene text classification and detection, and significantly outperforms the existing algorithms for character identification.
Pub.: 15 Jan '13, Pinned: 30 Jul '17
Abstract: This paper addresses the problem of scene understanding for driver assistance systems. To recognize the large number of objects that may be found on the road, several sensors and decision algorithms have to be used. The proposed approach is based on the representation of all available information in over-segmented image regions. The main novelty of the framework is its capability to incorporate new classes of objects and to include new sensors or detection methods while remaining robust to sensor failures. Several classes such as ground, vegetation or sky are considered, as well as three different sensors. The approach was evaluated on real publicly available urban driving scene data.
Pub.: 16 Dec '14, Pinned: 30 Jul '17
Abstract: In this paper we present our solution to the 300 Faces in the Wild Facial Landmark Localization Challenge. We demonstrate how to achieve very competitive localization performance with a simple deep learning based system. Human study is conducted to show that the accuracy of our system has been very close to human performance. We discuss how this finding would affect our future direction to improve our system.
Pub.: 12 Dec '15, Pinned: 30 Jul '17
Abstract: Face recognition (FR) plays an important role in video surveillance by allowing to accurately recognize individuals of interest over a distributed network of cameras. Systems for still-to-video FR are exposed to challenging operational environments. The appearance of faces changes when captured under unconstrained conditions due to variations in pose, scale, illumination, occlusion, blur, etc. Moreover, the facial models used for matching may not be robust to intra-class variations because they are typically designed a priori with one reference facial still per person. Indeed, faces captured during enrollment (using still cameras) may differ considerably from those captured during operations (using surveillance cameras). In this paper, an efficient multi-classifier system (MCS) is proposed for accurate still-to-video FR based on multiple face representations and domain adaptation (DA). An individual-specific ensemble of exemplar-SVM (e-SVM) classifiers is thereby designed to improve robustness to intra-class variations. During enrollment of a target individual, an ensemble is used to model the single reference still, where multiple face descriptors and random feature subspaces allow to generate a diverse pool of patch-wise classifiers. To adapt these ensembles to the operational domains, e-SVMs are trained using labeled face patches extracted from the reference still versus patches extracted from cohort and other non-target stills mixed with unlabeled patches extracted from the corresponding face trajectories captured with surveillance cameras. During operations, the most competent classifiers per given probe face are dynamically selected and weighted based on the internal criteria determined in the feature space of e-SVMs. This paper also investigates the impact of using different training schemes for DA, as well as, the validation set of non-target faces extracted from stills and video trajectories of unknown individuals in the operational domain. The performance of the proposed system was validated using videos from the COX-S2V and Chokepoint datasets. Results indicate that the proposed system can surpass state-of-the-art accuracy, yet with a significantly lower computational complexity. Indeed, dynamic selection and weighting allow to combine only the most relevant classifiers for each input probe.
Pub.: 13 Apr '17, Pinned: 30 Jul '17