A pinboard by
Krishna Kumar Singh

PhD student, University of California, Davis


Hiding the image patches to improve weakly-supervised object and action Localization performance.

Object localization/detection is a fundamental problem in computer vision. While tremendous advances have been made in recent years, existing state-of-the-art methods are trained in a strongly-supervised fashion, in which the system learns an object category’s appearance properties and precise localization information from images annotated with bounding boxes. However, such carefully labeled exemplars are expensive to obtain in the large numbers that are needed to fully represent a category’s variability, and methods trained in this manner can suffer from unintentional biases or errors imparted by annotators that hinder the system’s ability to generalize to new, unseen data.

To address these issues, researchers have proposed to train object detectors with relatively inexpensive weak supervision, in which each training image is only weakly labeled with an image-level tag (e.g., “car”, “no car”) that states an object’s presence/absence but not its location. The main advantage of weakly supervised learning is that it requires less detailed annotations compared to the fully-supervised setting, and therefore has the potential to use the vast weakly-annotated visual data available on the web.

Most existing weakly-supervised methods localize only the most discriminative parts of an object rather than all relevant parts, which leads to suboptimal performance. Our key idea is to hide patches in a training image randomly, forcing the network to seek other relevant parts when the most discriminative part is hidden. We thus name our approach ‘Hide-and-Seek’. To demonstrates the intuition we can take an example of localizing dog in an image: if we randomly remove some patches from the image then there is a possibility that the dog’s face, which is the most discriminative, will not be visible to the model. In this case, the model must seek other relevant parts like the tail and legs in order to do well on the classification task. By randomly hiding different patches during training, network is forced to focus on multiple relevant parts of the object.

Since Hide-and-Seek only alters the input image, it can easily be generalized to different neural networks and tasks like action localization. For the temporal action localization task (in which the start and end times of an action need to be found), random frame sequences are hidden while training a network on action classification, which forces the network to learn the relevant frames corresponding to an action.