We want to find the self-attention weights assigned to the tokens in the sequence “Attention is everything” using scaled dot product attention. A single head is used.

The sequence is of length 3, and the dimensionality of the transformer is 4.

Below is the input embedding of shape (3, 4). Note that this embedding is the sum of the token embedding and the position embedding.

X = [1, 2, 3, 4]

[5, 0, 7, 0]

[9, 0, 1, 2]

The weights of the Q, K, and V are:

Wq = [0.3, 0.2, 0.8, 0.9]

[0.4, 0.1, 0.4, 0.5]

[0.5, 0.7, 0.2, 0.8]

[0.8, 0.8, 0.7, 0.4]

Wk = [0.3, 0.9, 0.2, 0.7]

[0.5, 0.4, 0.2, 0.2]

[0.1, 0.7, 0.3, 0.6]

[0.8, 0.4, 0.5, 0.9]

Wv = [0.2, 0.2, 0.3, 0.9]

[0.2, 0.3, 0.8, 0.6]

[0.7, 0.5, 0.9, 0.9]

[1.0, 0.4, 0.2, 0.5]

If a causal mask is applied, what attention weight does “is” assign to “everything” in the sequence “Attention is everything”? Give the answer to 2dp.

Hint: Lecture 19 slides 17 - 27

Question

We want to find the self-attention weights assigned to the tokens in the sequence “Attention is everything” using scaled dot product attention. A single head is used.

The sequence is of length 3, and the dimensionality of the transformer is 4.

Below is the input embedding of shape (3, 4). Note that this embedding is the sum of the token embedding and the position embedding.

X = [1, 2, 3, 4]

[5, 0, 7, 0]

[9, 0, 1, 2]

The weights of the Q, K, and V are:

Wq = [0.3, 0.2, 0.8, 0.9]

[0.4, 0.1, 0.4, 0.5]

[0.5, 0.7, 0.2, 0.8]

[0.8, 0.8, 0.7, 0.4]

Wk = [0.3, 0.9, 0.2, 0.7]

[0.5, 0.4, 0.2, 0.2]

[0.1, 0.7, 0.3, 0.6]

[0.8, 0.4, 0.5, 0.9]

Wv = [0.2, 0.2, 0.3, 0.9]

[0.2, 0.3, 0.8, 0.6]

[0.7, 0.5, 0.9, 0.9]

[1.0, 0.4, 0.2, 0.5]

If a causal mask is applied, what attention weight does “is” assign to “everything” in the sequence “Attention is everything”? Give the answer to 2dp.

Hint: Lecture 19 slides 17 - 27

BlackTom AI · Accepted Answer

We are asked to find the self-attention weight that the token 'is' (the second token in the sequence) assigns to the token 'everything' (the third token) under scaled dot-product attention with a causal mask.

First, restating the setup: we have a 3-token sequence 'Attention is everything' with transformer dimension 4. The model uses a single attention head, with given Q, K, V weight matrices. A causal mask is applied, meaning the attention mechanism cannot attend to future positions; for a token at position i, only positions up to i (including i) are allowed as keys/values.

Now, analyze the impact of the causal mask on the specific pair ('is' attending to 'everything').
- The token 'is' is at position 2. Under a causal mask, position 2 can attend to positions 1 and 2 (the past and current), but not to position 3 (the future).
- The token 'everything' is at position 3, which is in the future relative to position 2. Therefore, the attention score for the pair (2 -> 3) is masked to zero, regardless of the actual dot-product value between the corresponding query and key vectors.

As a result, the attention weight that 'is' assigns to 'everything' must be zero after applying the softmax over the masked scores (since one of the unmasked scores would be considered, but the masked score contributes zero to the normalization effectively for that position).

Because the question asks for the weight to two decimal places and the options include 0, the computed weight is 0.00 when formatted to two decimal places. If you consider numerical rounding, 0 remains 0.00, confirming the given answer choice.

Note on the general flow: while you could compute the exact Q, K, V projections and the dot products, the essential constraint here is the causal mask that blocks attention to future tokens. The masking directly enforces a zero weight to the future token in this context, making any nonzero computation for that pair moot for the final probability distribution.

Therefore, the attention weight from 'is' to 'everything' with a causal mask is 0.00 when expressed to two decimals.

类似问题

What key mechanism do transformers use to process sequential data effectively?

What is the primary role of the self-attention mechanism in Transformer-based language models?

Consider the sentence “Mary went to the mall because she wanted a new pair of shoes.” This sentence is passed through an encoder-only transformer model. What model component enables it to learn that “she” refers to “Mary”? Hint: Lec 19.

Cyclic Shift + Masked MSA are necessary for the correct operation of SW-MSA; their absence will render SW-MSA either non-functional or produce incorrect results.

Question 30 Choose a, b, c or d as the best answer. The author’s main argument is that the 100:80:100 model __________.

更多留学生实用工具

智能学习助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单