On scaled dot-product attention and training stability of a transformer:

I Without scaling by
𝐷
𝑘
, the variance of the dot product
𝑞
𝑛
⊤
𝑘
𝑚
grows with dimensionality, producing large logits that can saturate the softmax.
II Scaling by
𝐷
𝑘
primarily solves exploding-gradient problems inside the value projection
𝑉
.
III The softmax-normalized matrix
S
o
f
t
m
a
x
(
𝑄
𝐾
⊤
)
is applied row-wise, making each row represent how strongly a query attends to all keys.
IV Scaled dot-product attention computes
A
t
t
e
n
t
i
o
n
(
𝑄
,
𝐾
,
𝑉
)
=
S
o
f
t
m
a
x
!
(
𝑄
𝐾
⊤
𝐷
𝑘
)
𝑉
, and the resulting matrix always has the same dimension as
𝑉
.

Question

On scaled dot-product attention and training stability of a transformer:

I	Without scaling by 
𝐷
𝑘
, the variance of the dot product 
𝑞
𝑛
⊤
𝑘
𝑚
 grows with dimensionality, producing large logits that can saturate the softmax.
II	Scaling by 
𝐷
𝑘
 primarily solves exploding-gradient problems inside the value projection 
𝑉
.
III	The softmax-normalized matrix 
S
o
f
t
m
a
x
(
𝑄
𝐾
⊤
)
 is applied row-wise, making each row represent how strongly a query attends to all keys.
IV	Scaled dot-product attention computes 
A
t
t
e
n
t
i
o
n
(
𝑄
,
𝐾
,
𝑉
)
=
S
o
f
t
m
a
x
!
(
𝑄
𝐾
⊤
𝐷
𝑘
)
𝑉
, and the resulting matrix always has the same dimension as 
𝑉
.

BlackTom AI · Accepted Answer

Let’s parse the statements about scaled dot-product attention and training stability in transformers, and test each one against the standard formulation.
Option I: 'Without scaling by Dk, the variance of the dot product q_n^⊤ k_m grows with dimensionality, producing large logits that can saturate the softmax.' This is correct in spirit. The unscaled dot product between Q and K tends to have variance that grows with the dimensionality Dk, which makes the distribution produced by softmax very peaky as Dk increases. Scaling by sqrt(Dk) is introduced precisely to counteract this by keeping the variance of QK^⊤/√Dk roughly constant regardless of Dk. The claim explicitly notes the need for scaling to prevent saturation, so this is a true statement.
Option II: 'Scaling by Dk primarily solves exploding-gradient problems inside the value projection V.' This is not correct as stated. The primary motivation for scaling by √Dk is to stabilize the softmax input and control the distribution of attention weights, not to mitigate exploding gradients inside the V projection. While numerical stability is a benefit, the specific claim that it primarily solves exploding gradients in V is inaccurate and mischaracterizes the role of the scaling factor.
Option III: 'The softmax-normalized matrix Softmax(QK⊤) is applied row-wise, making each row represent how strongly a query attends to all keys.' The description here has a small miswording: the actual operation is Softmax(QK⊤ / √Dk) applied row-wise, resulting in an attention weight matrix A where each row corresponds to a query and encodes the distribution of attention over keys. The core idea that rows correspond to a query’s attention distribution over keys is correct, and applying Softmax row-wise is accurate. The missing explicit division by √Dk in the statement is a caveat, but the overall interpretation is correct in terms of row-wise softmax representing attention weights. This makes III largely true with the caveat noted.
Option IV: 'Scaled dot-product attention computes Attention(Q,K,V) = Softmax(QK⊤/Dk) V, and the resulting matrix always has the same dimension as V.' There are two issues here. First, the scaling uses √Dk, not Dk, so the division by Dk is incorrect. Second, the claim that the resulting matrix always has the same dimension as V is misleading: the result has shape (n_queries, Dv), i.e., the same number of rows as Q and the same number of columns as V, not necessarily the same overall dimension as V if one views V as a matrix with a particular row count. Depending on interpretation, this could be read as true (same feature dimension as V) or false (not necessarily the same outer dimensions as V’s original matrix, since Q’s length may differ from K’s length). Given the strict mathematical wording, the incorrect scaling factor and potential ambiguity about dimensions make this option false in the standard formulation.
In summary, the options broadly align as: I is true, II is false, III is true with the note about the missing scaling, and IV is false due to the incorrect scaling and the dimensional caveat. The provided answer listing I, III, and IV would therefore mix in a statement that conflicts with the conventional understanding of the scaling and dimensionality in scaled dot-product attention.

类似问题

What is the main role of the attention mechanism in an LLM?

Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?

Which of the following attention models uses a subset of the input to derive the output, and can not be trained directly with gradient methods?

Why is the attention mechanism particularly suitable for modeling financial time series?

Which of the following statements is correct about query, key, and value in transformer models?

Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply

更多留学生实用工具

智能学习助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单