Visual Question Answering

In this project, I implemented two networks to accomplish the task of visual question answering. Specifically, my implementations were trained on the VQA dataset from Agrawal et al. '16. This task involves generating natural language answers to open-ended natural language questions about an image. These questions can probe basic information like the color, content and count of features in the image, in addition to more interesting questions that require intuition and deeper understanding (or a resemblance of understanding). To support this task, natural language question words were embedded using one-hot vectors over a dictionary of the most common question words in the dataset. Answers were embedded as one-hot vectors over the most common full answers in the dataset.

The first network implemented was from Zhou et al. '15. As the name states, it is a simple architecture. Questions are transformed to frequency vectors in a bag-of-words embedding. Features are extracted from the question with a single linear layer and concatenated to image features extracted from GoogLeNet. One final linear layer maps the concatenated features to the appropriate output embedding.

The second network is a hierarchical co-attention network from Lu et al. '17. The questions are embedded at the word, phrase and sentence level. Word embeddings are generated through a simple linear project of each one-hot input token. The word embeddings are then transformed into 3 separate phrase embeddings for each word, based on adjacent unigrams, bigrams and trigrams. A max-pool over the individual embedding values of these n-gram embeddings constitutes the phrase embedding for a given word. Finally, the phrase embeddings are fed into an LSTM to generate a sentence embedding.

Image features are extracted from the final max-pool layer of a resnet18 network. Alternating co-attention is applied between the question and image features at all 3 question levels separately. The weights are shared between these 3 applications of co-attention. The alternating co-attention is structured as follows: First, attention is paid to the question with no guidance. This produces a guidance vector that acts as a key for the attention paid to the image features, which in turn generates a new guidance vector. In kind, this image guidance vector is used to guide attention one last time on the question feature, producing a final question guidance vector. This final question guidance, together with the image guidance, constitutes the output of the alternating attention. The six output guidance vectors from alternating co-attention are combined hierarchically to generate the final network output. To preserve a larger batch size with limited computational resources, the Resnet image embeddings were pre-calculated.

Iter 0

Incorrect understanding of part-of-speech necessary to answer the question

Iter 24500

Correct part of speech, incorrect answer

Iter 47250

Correct part of speech and correct answer demonstrating conceptual 'understanding' (or a good guess)