There are multiple ways to implement matrix multiplication in software and hardware. Let, $C_{M\times N} = A_{M\times K}.B_{K \times N}$. The most straightforward software approach is to implement it using three nested for loops as shown below. The animation on the right shows the matrix $A$ in and $B$ in . The elements of $A$ and $B$ accessed during code execution are highlighted in and respectively. Each element of $C$ is the accumulation of $K$ partial sums and is shown in shades of .

```
for(int m = 0; m < M; m++) {
for(int n = 0; n < N; n++) {
for(int k = 0; k < K; k++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

It turns out that the result is the same even if we permute the order of the for loops. The for loops can be permuted in $3! = 6$ ways. Each permutation results in a different access pattern for the elements of $A$ and $B$, and sequence in which the partial sums are generated for the $C$ matrix. The animation below shows the access pattern when the $M$ and $N$ loops are swapped. The elements of $C$ are produced column wise as opposed to row wise.

```
for(int n = 0; n < N; n++) {
for(int m = 0; m < M; m++) {
for(int k = 0; k < K; k++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

In both of the above cases, the elements of $C$ are computed sequentially. For each element of $C$, we calculate the inner product or dot product of a row vector from $A$ and a column vector from $B$. In the next two permutations, $K$ is the outermost loop. In these cases, a partial sum is generated for each element of $C$ before the next partial sums are generated. Overall, $K$ partial sums are accumulated. The order in which partial sums are generated depends on the order of $M$ and $N$ loops.

```
for(int k = 0; k < K; k++) {
for(int m = 0; m < M; m++) {
for(int n = 0; n < N; n++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

```
for(int k = 0; k < K; k++) {
for(int n = 0; n < N; n++) {
for(int m = 0; m < M; m++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

In the next two permutations, partial sums are generated and accumulated per row or column of $C$ before moving on to the next row or column.

```
for(int m = 0; m < M; m++) {
for(int k = 0; k < K; k++) {
for(int n = 0; n < N; n++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

```
for(int n = 0; n < N; n++) {
for(int k = 0; k < K; k++) {
for(int m = 0; m < M; m++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

One way to design a matrix multiplication accelerator is to generate and accumulate partial sums in parallel. Generating partial sums is the same as computing an outer product of a column vector from $A$ and a row vector from $B$. In the animation below, in each time step, we generate an outer product i.e. we generate $MN$ partial sums in parallel. Accumulating them in parallel gives us the result in $K$ time steps.

```
for(int k = 0; k < K; k++) {
#parallel
for(int m = 0; m < M; m++) {
for(int n = 0; n < N; n++) {
C[m][n] += A[m][k]*B[k][n];
}
}
}
```

Another way to design a matrix multiplication accelerator is to use a systolic array. Reference [3] explains it with a simple example. The inputs to the systolic array are rows of $A$ and columns of $B$ staggered in time. The partial sums are also generated and accumulated in parallel. However, it takes $M+(N-1)+(K-1)$ time steps to generate the result and the computation has a diagonal wavefront as shown in the animation.