Want to know more about the attention ❗ mechanism? Tangrui(Tory) Li, tuo90515@temple.edu --- **Forewords** The "attention mechanism" is from the deep learning field, but it has a different interpretation here, which may lead it to something symbolic. --- As you may see on many webpages, the attention mechanism is usually interpreted as a mask over the original data instance. No matter it is an image, or it is a natural language sentence. So, we may think such a mechanism has some "**intelligence**", and it can automatically find these "**meaningful**" parts. It is true, but **partially**. The "meaningfulness" is from a pure mathematically perspective. That is, if we have two vector representations of high-level features, we may pay attention to both when they are highly linearly corelated. It is very natural to think this as inner products. And this is about how classical self-attention mechanism works. To be formal, we may have \( N \) features (vector representations) restored in a matrix (each feature is restored as a **row-vector**), namely \( X=[x_1;x_2;\dots;x_N],x_i\in \mathbb{R}^k \), and we would like to see how these features are linearly correlated, so we have \( XX^T \) as the inner product matrix. Then we can use these inner products as the weight applied on \( X \), say \( (XX^T)X \), such that the higher the collinearity, the higher the "meaningfulness". If we come back to the self-attention score formula [1], we may find some similarities. $$ Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V \\ Q=T_QX \\ K=T_KX \\ V=T_VX $$ \( Q,K,V \) are linear transformations (without activation) of \( X \), in which these linear transformations are trainable parts to add more flexibility. The \( softmax() \) function and \( \sqrt{d_k} \) are for a kind of normalization. Regardless of these auxiliary parts, we will just find \( Attention(X)=(XX^T)X \) as explained above. This may account for the self-attention mechanism, but just mathematically. --- When talking about "attention", we human beings will not do it in a mathematical way, instead, we do it "semantically". That is why we have these \( Q,K,V \) interpreted as **queries**, **keys** and **values**. Since the author would like to give these vector-represented features some semantic meaning. A hint for this naming is from Hopfield networks, which is a binary network for restoring and retrieving patterns. Here I will give a simple example to introduce the functionality. Say you have two patterns, which are the below icons.
☠️|🍎
And you have a test image, you would like to know to which remembered pattern, it is mostly close to. Here is your test image.
💀
Though it is different from any restored patterns, we and the Hopfield network will transform your test image into the first pattern. --- Here, we may call your test image as "queries \( Q \) ", and the remembered patterns as "keys \( K \)". Though not appeared in Hopfield networks, we may define the similarities of your query and keys as "values \( V \)". In neural networks, these patterns exist as vectors, though they usually do not have a semantical meaning as defined in Hopfield networks. --- **Related to NARS?** In NARS we will not have such vector representations, but note that the original Hopfield networks are trained using the Hebb learning theorem ("activated simultaneously, more likely"), which might be also interested for NARS. I would like to find how such learning theorems could be applied in NARS, along with the similarity between Hopfield networks and the self-attention mechanism, to discover more similarities between NARS and neural networks. --- Reference [1] Vaswani, Ashish, et al. "Attention is all you need." *Advances in neural information processing systems* 30 (2017).