What is Attention, Self Attention, Multi-Head Attention?

February 17, 2019

Consider we have 5 cars. Each car has a set of features e.g. safety, capacity, acceleration, mpg etc. Each car also has a set of associated values e.g. financial, social, resell, etc. Now, consider that we have a new car and we would like to determine its set of values. An obvious approach is to say that if the new car has features similar to another car, its value will also be similar to that car.

In this example, if we determine that the new car is $70\%$ similar to car 1, $15\%$ similar to car 2, $10\%$ similar to car 3, $3\%$ similar to car 4 and $2\%$ similar to car 5, then the value of the new car,

\begin{align} V = 0.70\ V_1 + 0.15\ V_2 + 0.10\ V_3 + 0.03\ V_4 + 0.02\ V_5 \end{align}

Note that, $0.70 + 0.15 + 0.10 + 0.03 + 0.02 = 1$. This process of determining the value of the new car is called attention. It involves finding the similarity between the new and existing cars. The weighted values of the existing cars are then used to determine the value of the new car.

A vector which represents features is called the key vector, $k$. A vector which represents values is called the value vector, $v$. A vector whose value is to be determined is called the query vector, $q$. The function used to determine similarity between a query and key vector is called the attention function or the scoring function. The scoring function returns a real valued scalar. The scores are normalized, typically using softmax, such that sum of scores is equal to 1. The final value is equal to the weighted sum of the value vectors.

Scoring Functions

There are many ways to define a scoring function [1,4]. Let's assume $k$, $v$ and $q$ are column vectors.

Dot product

This is the simplest function and has no learnable parameters. However, it requires that $q$ and $k$ are of the same size.

\begin{align} \textit{score}(q, k) = q^Tk \end{align}

Scaled dot product

The dot product increases as the dimensions get larger. A fix is to introduce a scaling factor, and was proposed in Transformers [3]. This function also has no learnable parameters. The equation below is different from the equation in [3] because the authors assume $k$, $v$ and $q$ to be row vectors.

\begin{align} \textit{score}(q, k) = \frac{q^Tk}{\textit{scaling factor}} \end{align}

Bilinear

In this formulation, $W$ is a learnt matrix. Also the dimensions of $q$ and $k$ can be different. Note that the dot product and scaled dot products functions are also bilinear.

\begin{align} \textit{score}(q, k) = q^T W k \end{align}

MLP

In this formulation, we concatenate the query and key vectors and then apply a linear transformation.

\begin{align} \textit{score}(q, k) &= w^T tanh(W_1q + W_2k) \\ &= w^T tanh(W[q;k]) \end{align}

where [$q$ ; $k$] implies concatenation. Here $w$ and $W$ are learnt parameters. Also, the sizes of $q$ and $k$ can be different. This function was used in [2,4].

Attention & Self Attention

When we want to determine the score of multiple key and query vectors at once, we can replace the key and query vectors with the key and query matrices, $K$ and $Q$ respectively, in the above equations. In general, given $Q$, $K$ and $V$, the value of the corresponding query vectors is given by,

\begin{align} \textit{Attention}(Q,K,V) = V.\textit{softmax}\ (\textit{score}(Q, K)) \end{align}

Self attention is nothing but $Q = K = V$ i.e. we compute a new value for each vector by comparing it with all vectors (including itself).

Multi-Head Attention

In Transformers [3], the authors first apply a linear transformation to the input matrices $Q$, $K$ and $V$, and then perform attention i.e. they compute

\begin{align} \textit{Attention}(W^QQ,W^KK,W^VV) = W^VV\textit{softmax}\ (\textit{score}(W^QQ, W^KK)) \end{align}

where, $W^V$, $W^Q$ and $W^K$ are learnt parameters. In multi-head attention, say with #heads = 4, the authors apply a linear transformation to the matrices and perform attention 4 times as follows.

\begin{align} \textit{head}_1 &= \textit{Attention}(W^Q_1Q, W^K_1, W^V_1V) \\ \textit{head}_2 &= \textit{Attention}(W^Q_2Q, W^K_2, W^V_2V) \\ \textit{head}_3 &= \textit{Attention}(W^Q_3Q, W^K_3, W^V_3V) \\ \textit{head}_4 &= \textit{Attention}(W^Q_4Q, W^K_4, W^V_4V) \\ \end{align}

where, $W^V_i$, $W^Q_i$ and $W^K_i$ are learnt parameters. The 4 heads are concatenated and then linearly transformed again, as follows.

\begin{align} \textit{Multihead}(Q, K, V) = W^O [head_1; head_2; head_3; head_4] \end{align}