Matrix Multiplication: Inner Product, Outer Product & Systolic Array

June 14, 2018

There are multiple ways to implement matrix multiplication in software and hardware. Let, $C_{M\times N} = A_{M\times K}.B_{K \times N}$. The most straightforward software approach is to implement it using three nested for loops as shown below. The animation on the right shows the matrix $A$ in and $B$ in . The elements of $A$ and $B$ accessed during code execution are highlighted in and respectively. Each element of $C$ is the accumulation of $K$ partial sums and is shown in shades of .


for(int m = 0; m < M; m++) {
    for(int n = 0; n < N; n++) {
        for(int k = 0; k < K; k++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}


It turns out that the result is the same even if we permute the order of the for loops. The for loops can be permuted in $3! = 6$ ways. Each permutation results in a different access pattern for the elements of $A$ and $B$, and sequence in which the partial sums are generated for the $C$ matrix. The animation below shows the access pattern when the $M$ and $N$ loops are swapped. The elements of $C$ are produced column wise as opposed to row wise.


for(int n = 0; n < N; n++) {
    for(int m = 0; m < M; m++) {
        for(int k = 0; k < K; k++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}

In both of the above cases, the elements of $C$ are computed sequentially. For each element of $C$, we calculate the inner product or dot product of a row vector from $A$ and a column vector from $B$. In the next two permutations, $K$ is the outermost loop. In these cases, a partial sum is generated for each element of $C$ before the next partial sums are generated. Overall, $K$ partial sums are accumulated. The order in which partial sums are generated depends on the order of $M$ and $N$ loops.


for(int k = 0; k < K; k++) {
    for(int m = 0; m < M; m++) {
        for(int n = 0; n < N; n++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}



for(int k = 0; k < K; k++) {
    for(int n = 0; n < N; n++) {
        for(int m = 0; m < M; m++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}

In the next two permutations, partial sums are generated and accumulated per row or column of $C$ before moving on to the next row or column.


for(int m = 0; m < M; m++) {
    for(int k = 0; k < K; k++) {
        for(int n = 0; n < N; n++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}



for(int n = 0; n < N; n++) {
    for(int k = 0; k < K; k++) {
        for(int m = 0; m < M; m++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}

Outer Product

One way to design a matrix multiplication accelerator is to generate and accumulate partial sums in parallel. Generating partial sums is the same as computing an outer product of a column vector from $A$ and a row vector from $B$. In the animation below, in each time step, we generate an outer product i.e. we generate $MN$ partial sums in parallel. Accumulating them in parallel gives us the result in $K$ time steps.


for(int k = 0; k < K; k++) {
    #parallel
    for(int m = 0; m < M; m++) {
        for(int n = 0; n < N; n++) {
            C[m][n] += A[m][k]*B[k][n];
        }
    }
}

Systolic Array

Another way to design a matrix multiplication accelerator is to use a systolic array. Reference [3] explains it with a simple example. The inputs to the systolic array are rows of $A$ and columns of $B$ staggered in time. The partial sums are also generated and accumulated in parallel. However, it takes $M+(N-1)+(K-1)$ time steps to generate the result and the computation has a diagonal wavefront as shown in the animation.

References

  1. Inner Product
  2. Outer Product
  3. Systolic Array