On scaled dot-product attention and training stability of a transformer:   I Without scaling by 𝐷 𝑘 , the variance of the dot product 𝑞 𝑛 ⊤ 𝑘 𝑚 grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by 𝐷 𝑘 primarily solves exploding-gradient problems inside the value projection 𝑉 . III The softmax-normalized matrix S o f t m a x ( 𝑄 𝐾 ⊤ ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( 𝑄 , 𝐾 , 𝑉 ) = S o f t m a x ! ( 𝑄 𝐾 ⊤ 𝐷 𝑘 ) 𝑉 , and the resulting matrix always has the same dimension as 𝑉 .  单项选择题

登录即可查看完整答案

我们收录了全球超50000道真实原题与详细解析,现在登录,立即获得答案。

类似问题

更多留学生实用工具

加入我们,立即解锁 海量真题独家解析,让复习快人一步!