Consider we have 5 cars. Each car has a set of features *e.g.* safety,
capacity, acceleration, mpg etc. Each car also has a set of associated
values *e.g.* financial, social, resell, etc. Now, consider that we have
a new car and we would like to determine its set of values. An obvious
approach is to say that if the new car has features similar to another
car, its value will also be similar to that car.

In this example, if we determine that the new car is $70\%$ similar to car 1, $15\%$ similar to car 2, $10\%$ similar to car 3, $3\%$ similar to car 4 and $2\%$ similar to car 5, then the value of the new car,

\begin{align} V = 0.70\ V_1 + 0.15\ V_2 + 0.10\ V_3 + 0.03\ V_4 + 0.02\ V_5 \end{align}

Note that, $0.70 + 0.15 + 0.10 + 0.03 + 0.02 = 1$. This process of
determining the value of the new car is called `attention`

. It involves
finding the similarity between the new and existing cars. The weighted
values of the existing cars are then used to determine the value of the
new car.

A vector which represents features is called the `key`

vector, $k$. A
vector which represents values is called the `value`

vector, $v$. A
vector whose value is to be determined is called the `query`

vector, $q$.
The function used to determine similarity between a query and key
vector is called the attention function or the scoring function. The
scoring function returns a real valued scalar. The scores are normalized,
typically using softmax, such that sum of scores is equal to 1. The
final value is equal to the weighted sum of the value vectors.

There are many ways to define a scoring function [1,4]. Let's assume $k$, $v$ and $q$ are column vectors.

This is the simplest function and has no learnable parameters. However, it requires that $q$ and $k$ are of the same size.

\begin{align} \textit{score}(q, k) = q^Tk \end{align}

The dot product increases as the dimensions get larger. A fix is to introduce a scaling factor, and was proposed in Transformers [3]. This function also has no learnable parameters. The equation below is different from the equation in [3] because the authors assume $k$, $v$ and $q$ to be row vectors.

\begin{align} \textit{score}(q, k) = \frac{q^Tk}{\textit{scaling factor}} \end{align}

In this formulation, $W$ is a learnt matrix. Also the dimensions of $q$ and $k$ can be different. Note that the dot product and scaled dot products functions are also bilinear.

\begin{align} \textit{score}(q, k) = q^T W k \end{align}

In this formulation, we concatenate the query and key vectors and then apply a linear transformation.

\begin{align} \textit{score}(q, k) &= w^T tanh(W_1q + W_2k) \\ &= w^T tanh(W[q;k]) \end{align}

where [$q$ ; $k$] implies concatenation. Here $w$ and $W$ are learnt parameters. Also, the sizes of $q$ and $k$ can be different. This function was used in [2,4].

When we want to determine the score of multiple key and query vectors at once, we can replace the key and query vectors with the key and query matrices, $K$ and $Q$ respectively, in the above equations. In general, given $Q$, $K$ and $V$, the value of the corresponding query vectors is given by,

\begin{align} \textit{Attention}(Q,K,V) = V.\textit{softmax}\ (\textit{score}(Q, K)) \end{align}

Self attention is nothing but $Q = K = V$ i.e. we compute a new value for each vector by comparing it with all vectors (including itself).

In Transformers [3], the authors first apply a linear transformation
to the input matrices $Q$, $K$ and $V$, and then perform attention *i.e.*
they compute

\begin{align} \textit{Attention}(W^QQ,W^KK,W^VV) = W^VV\textit{softmax}\ (\textit{score}(W^QQ, W^KK)) \end{align}

where, $W^V$, $W^Q$ and $W^K$ are learnt parameters. In multi-head attention, say with #heads = 4, the authors apply a linear transformation to the matrices and perform attention 4 times as follows.

\begin{align} \textit{head}_1 &= \textit{Attention}(W^Q_1Q, W^K_1, W^V_1V) \\ \textit{head}_2 &= \textit{Attention}(W^Q_2Q, W^K_2, W^V_2V) \\ \textit{head}_3 &= \textit{Attention}(W^Q_3Q, W^K_3, W^V_3V) \\ \textit{head}_4 &= \textit{Attention}(W^Q_4Q, W^K_4, W^V_4V) \\ \end{align}

where, $W^V_i$, $W^Q_i$ and $W^K_i$ are learnt parameters. The 4 heads are concatenated and then linearly transformed again, as follows.

\begin{align} \textit{Multihead}(Q, K, V) = W^O [head_1; head_2; head_3; head_4] \end{align}