Description

5/5 - (1 vote)

1 Theory (50pt)
1.1 Attention (13pts)
This question tests your intuitive understanding of attention and its property.
(a) (1pts) Given queries Q ∈ R
d×n
, keys K ∈ R
d×m and values V ∈ R
t×m, describe
the operations needed to calculate the output H of the standard dot-product
1
attention. What is the output dimension? (You can use the softargmaxβ
function directly. It is applied to the column of each matrix).
(b) (2pts) Explain how the scale β influence the output of the attention? And
what β is conveniently to use?
(c) (2pts) One advantage of the attention operation is that it is really easy to
preserve a value vector v to the output h. Explain in what situation, the
outputs preserves the value vectors. Also, what should the scale β be if we
just want the attention operation to preserve value vectors. Which of the
four types of attention we are referring to? How can this be done when
using fully connected architectures?
(d) (2pts) On the other hand, the attention operation can also dilute different
value vectors v to generate new output h. Explain in what situation the
outputs is spread version of the value vectors. Also, what should the scale
β be if we just want the attention operation to diffuse as much as possible.
Which of the four types of attention we are referring to? How can this be
done when using fully connected architectures?
(e) (2pts) If we have a small perturbation to one of the ki (you could assume
the perturbation is a zero-mean Gaussian with small variance, so the new
kˆ
i = ki +ϵ), how will the output of the H change?
(f) (2pts) If we have a small perturbation to one of the queries qi
, how will the
output of the H change? How would this differ from the previous case?
(g) (2pts) If we have a large perturbation that it scales one key so the kˆ = αk
for α > 1, how will the output of the H change?
1.2 Multi-headed Attention (3pts)
This question tests your intuitive understanding of Multi-headed Attention and
its property.
(a) (1pts) Given queries Q ∈ R
d×n
, K ∈ R
d×m and V ∈ R
t×m, describe the operations for calculating the output H of the standard multi-headed scaled
dot-product attention? Assume we have h heads.
(b) (2pts) Is there anything similar to multi-headed attention for convolutional
networks? Explain why do you think they are similar.
1.3 Self Attention (11pts)
This question tests your intuitive understanding of Self Attention and its property.
2
(a) (2pts) Given an input C ∈ R
e×n
, what is the queries Q, the keys K and the
values V and the output H of the standard multi-headed scaled dot-product
self-attention? Assume we have h heads. (You can name and define the
weight matrices by yourself)
(b) (2pts) Explain what is positional encoding. What is the difference between
absolute and relative positional encoding. When is it appropriate to use
absolute positional encoding? When is it more appropriate to use relative
encoding?
(c) (2pts) Show us one situation that the self attention layer behaves like an
identity layer or permutation layer.
(d) (3pts) Show us one situation that the self attention layer behaves like an
convolution layer with a kernel larger than 1. You can assume we use
positional encoding.
(e) (2pts) Suppose we are training a transformer architecture for real time
automatic speech recognition. Do we need to do anything special to the
attention mechanism? How do we achieve this?
1.4 Transformer (15pts)
Read the original paper on the Transformer model: “Attention is All You Need”
by Vaswani et al. (2017).
(a) (3pts) Explain the primary differences between the Transformer architecture and previous sequence-to-sequence models (such as RNNs and LSTMs).
(b) (3pts) Explain the concept of self-attention and its importance in the Transformer model.
(c) (3pts) Describe the multi-head attention mechanism and its benefits.
(d) (3pts) Explain the feed-forward neural networks used in the model and
their purpose.
(e) (3pts) Name two techniques used in the paper to improve training stability
of the transformer model, in particular regards to the issue of exploding /
vanishing gradients. And briefly explain how they do so.
1.5 Vision Transformer (8pts)
Read the paper on the Transformer model: “An Image is Worth 16 × 16 Words:
Transformers for Image Recognition at Scale”.
3
(a) (2pts) What is the key difference between the Vision Transformer (ViT)
and traditional convolutional neural networks (CNNs) in terms of handling
input images? Can you spot a convolution layer in the ViT architecture?
(b) (2pts) What is the role of positional embeddings in the Vision Transformer
model, and how do they differ from positional encodings used in the original
Transformer architecture?
(c) (2pts) How does the Vision Transformer model generate the final classification output? Describe the process and components involved in this step.
(d) (2pts) How does ViT compare with CNN in terms of performance across
different data regimes? What explains this trend?
2 Implementation (50pt)
Please add your solutions to this notebook HW4-VIT-Student.ipynb . Plase use
your NYU account to access the notebook. The notebook contains parts
marked as TODO, where you should put your code or explanations. The notebook
is a Google Colab notebook, you should copy it to your drive, add your solutions,
and then download and submit it to NYU Classes. You’re also free to run it on any
other machine, as long as the version you send us can be run on Google Colab.

Solved Homework 4: Transformer Deep Learning CSCI-GA 2572 Fall 2025

Download Details:

Description

Solved Homework 4: Transformer Deep Learning CSCI-GA 2572 Fall 2025

Download Details:

Description

Related products

Solved Homework 4: Transformer CSCI-GA 2572 Deep Learning

Solved CS7643: Deep Learning Homework 4

Solved Homework 1 – Deep Learning CS/DS541