Weakly Supervised Object Localization and Detection

In this project, I created and trained two deep networks capable of object localization and detection in a weakly supervised setting. That is, the networks were trained using only image-level classification labels, though were able to produce localization data. The two networks implemented here are based on Oqab et al. '15 and Bilen et al. '16.

In the former, object localization emerges by making a small augmentation to a typical feature extracting deep CNN like ResNet18. A final set of convolutions are added, compressing the image representation to K feature maps for a K-class dataset. Each feature map is max-pooled to one element of a K length vector representing class scores. Thus, each of these ultimate feature maps acts as a heat map for the presence of any given class in the image. The max-pooling operation has the impact of teaching the network to identify only the most salient features of a given class in the given image, such that the heat map tends not to cover the full extents of the object. This localization was made more robust by leveraging average-pooling on the feature maps, and the visualization improved by implementing class activation maps from Zhou et al. 16.

Max-pool

Average-pool

In the second implementation, region proposals are leveraged in order to produce actual bounding boxes in the weakly-supervised setting. Given a set of region proposals for a target image, the network is capable of jointly reasoning about what class may be in the image, and in which proposal it is likely to be located. This is accomplished by splitting the classifier of the network into two branches, one that is softmaxed along the region proposal dimension, and another that is softmaxed along the class dimension. These are recombined via element-wise multiplication to return the probability that any given class is present in any given region proposal. Greedy NMS was implemented to retrieve the final bounding boxes.