Description
Tutorial Topic: Visual-Linguistic Problem
The purpose of this assignment is to make sure every student has a fundamental understanding of
the topic of their tutorial and has the same foundation for the next tutorials. For each tutorial, you
will receive a different survey paper which gives an overview to the state of the art on (a subset of)
the topic of your tutorial. You can find the survey paper on Wattle, under Tutorial Material, folder
“Assignment 1”
The assignment consists of two different parts. Part 1 are the general questions which are the same
for every tutorial. Part 2 are paper specific questions. You should be able to answer all questions
after carefully reading the survey paper. We assumed that it should take you about 7.5-10h to
complete the assignment.
Full reference of the survey paper: Zhang, D., Cao, R., & Wu, S. (2019). Information Fusion in
Visual Question Answering: A Survey. Information Fusion, 52, 268-280.
https://doi.org/10.1016/j.inffus.2019.03.005
Note: Note: Short and precise answers are preferred. Answer in your own words. Please do
not exceed around 250 words per question.
Part 1: General Questions (7.5 marks)
1. What is the branch in the survey paper you find most interesting and why? (1 mark)
2. Write a summary of the branch that you pick in your own words (maximum 500 words, 2 marks)
3. What are the three papers you would read next if you were to do a research project on that branch.
Please explain why you would pick these papers and give their full references. (1.5 marks)
4. Find and list at least 2 research groups who conduct state-of-the-art research in this topic. Please
justify your answer. (1 mark)
5. Name two open research problems in the field of this survey paper and explain why they are hard
and interesting. (max 500 words, 2 marks)
Part 2: Paper-specific Questions (7.5 marks)
1. What are the significant steps for an end-to-end VQA model? For each step, what are the possible
techniques? (1 mark)
2. Why is the attention mechanism positively effective for VQA models? (1.5 marks)
3. Is the following statement on the attention mechanism correct? Please justify your answer. (1.5
marks)
Outputs of attention layers are individual visual and linguistic feature vectors (i.e. Vi and Vq), which
cannot be directly used to generate answers. And the fusion of two feature channels is always needed to
get a joint representation no matter whether the two features are attended or not. Therefore, we should
regard the attention mechanism acting more like feature extraction than feature fusion in a VQA model.
4. Among all the fusion methods, which one do you think outperforms others? Please justify your
answer (1.5 marks)
5. Why is information fusion important to VQA? Is information fusion essential for other visuallinguistic problems? You may explain from either the survey’s perspective or your own opinion.
(1.5 marks)
6. After reading through the whole survey paper, can you design a combination of technique series
that may result in the best performance? You may refer to Fig.3. in Section 4 for the end-to-end
framework. (0.5 marks)