Consider a single-headed attention layer. What happens to the dimensions of the value weight matrix Wv, when we double the maximum input sequence length? Select all that apply多项选择题
A
None of the above
B
Half the number of columns
C
Half the number of rows
D
Double the number of rows
E
Double the number of columns
登录即可查看完整答案
我们收录了全球超50000道真实原题与详细解析,现在登录,立即获得答案。
类似问题
Aside from the transformer architecture itself, what is the major technological breakthrough in natural language techniques that significantly advanced natural language AI capabilities, particularly with respect to understanding the relationships of words and their context, even when there are interdependencies and long-range dependencies (e.g., within sentences, paragraphs, etc). As a bonus, this breakthrough also solved many of the problems that recurrent neural network (RNN) techniques such as LSTMs had with long-range dependencies and contextual understanding, and is one of the major reasons that transformers are replacing the use of RNNs.
What is the main role of the attention mechanism in an LLM?
On scaled dot-product attention and training stability of a transformer: I Without scaling by 𝐷 𝑘 , the variance of the dot product 𝑞 𝑛 ⊤ 𝑘 𝑚 grows with dimensionality, producing large logits that can saturate the softmax. II Scaling by 𝐷 𝑘 primarily solves exploding-gradient problems inside the value projection 𝑉 . III The softmax-normalized matrix S o f t m a x ( 𝑄 𝐾 ⊤ ) is applied row-wise, making each row represent how strongly a query attends to all keys. IV Scaled dot-product attention computes A t t e n t i o n ( 𝑄 , 𝐾 , 𝑉 ) = S o f t m a x ! ( 𝑄 𝐾 ⊤ 𝐷 𝑘 ) 𝑉 , and the resulting matrix always has the same dimension as 𝑉 .
Which innovation is at the core of the transformer architecture and enables modeling long-range dependencies effectively?
更多留学生实用工具
希望你的学习变得更简单
加入我们,立即解锁 海量真题 与 独家解析,让复习快人一步!