Row major format is most notably used by C/C++ while column major format
is used by Fortran, MATLAB, R etc. External libraries and packages for
scientific computing, statistics, BLAS routines (e.g. Nvidia cuBLAS),
machine learning may either choose a particular storage format or
support both formats. Nvidia cuBLAS library uses a column
major format, but can be used with both C and Fortran code.
The question then is, how does a programmer deal with both
formats in the same application *e.g.* C/C++ (row major) on the CPU and
cuBLAS (column major) on the GPU. Some software frameworks like PyTorch
completely hide this complexity. However, it is interesting to understand
what happens under the hood and how row major and column major software
systems interact. In the above example, row major code runs on the CPU
and column major code runs on the GPU. However, the device *i.e.* CPU
or GPU is not relevant in this context. In the above example, one could
substitute another column major BLAS library (*e.g.* MKL) running on
the CPU instead of cuBLAS running on the GPU.

A computers' memory is essentially a byte-addressable data storage
device. How a matrix (an array) is laid out in memory *e.g.* row major,
column major or other custom formats is purely a software (programming
language, library developer) choice. C/C++ by default follow a row-major
storage format but can be used to design, for example, a BLAS library
which follows a column major format. The same memory can support and run
equally well both a row major and a column major format. Due to the way
caches and DRAMs are designed, accessing data elements sequentially from
memory is better for performace. However, these performance considerations
apply equally to any format chosen by the programmer.

Consider a $2\times3$ matrix,

\begin{align}
A =
\begin{bmatrix}
2 & 3 & 5 \\
7 & 11 & 13 \\
\end{bmatrix}
\end{align}

In a row major system, we save two scalars, #rows, $r=2$ and #columns, $c=3$. The data, $d$ is stored in memory as:

\begin{align}
\begin{array}{|c|c|c|c|c|c|}
\hline
2 & 3 & 5 & 7 & 11 & 13 \\
\hline
\end{array}
\end{align}

In a column major system, $r=2$ and $c=3$. However, $d$ is stored in memory as:

\begin{align}
\begin{array}{|c|c|c|c|c|c|}
\hline
2 & 7 & 3 & 11 & 5 & 13 \\
\hline
\end{array}
\end{align}

If we want to pass (by reference on a CPU or explicitly copy to a GPU) the matrix $A$ from a row major system to a column major system or vice versa, we will have to first reorder the data in memory. If $A$ is large or if we pass $A$ back and forth multiple times, the reordering operation is undesirable and inefficient. Is there a better solution? Consider the case where $r=3$, $c=2$ and the data in memory is:

\begin{align}
\begin{array}{|c|c|c|c|c|c|}
\hline
2 & 3 & 5 & 7 & 11 & 13 \\
\hline
\end{array}
\end{align}

If these values are interpreted by a column major system, it would reconstruct the matrix as:

\begin{align}
B =
\begin{bmatrix}
2 & 7 \\
3 & 11 \\
5 & 13 \\
\end{bmatrix}
= A^T
\end{align}

Therefore, if we have a matrix $A$ stored in a row major system, we can simply swap the values of $r$ and $c$, and pass it to a column major system. The column major system will interpret the data as $A^T$. Note that neither did we reorder data nor did we perform a transpose. The same works in reverse. Let's see how we can use this trick to our advantage for some common matrix operations.

Suppose we want to perform the operation, $C = A + B$, where $A$ and $B$ are matrices in a row major format. We want the output $C$ to be in a row major format as well. Also suppose, we have a fast matrix addition function but it is in a column major BLAS library.

First, we pass $A$ and $B$ as $A^T$ and $B^T$ to the BLAS library. As described above, this requires neither data reordering nor explicit transpose. Next, instead of computing $C = A + B$, we compute $C^T = A^T + B^T$ using the BLAS function. Finally, we pass $C^T$ from the column major library to the row major system which interprets the data as $C$. Again, for the last operation, we performed neither a data reordering nor a transpose. Hence, we were able to perform $C = A + B$ where the input and output data was in row major format but the computation was done by a column major library. We were able to do this without any data reordering or transposes. Next, let's consider a more complicated example of a matrix multiplication.

Consider the linear layer in PyTorch [1]. Ignoring the bias component, the equations for forward and backward propagation are:

\begin{align}
Y &= X.W^T \\
dX &= dY.W \\
dW &= dY^TX
\end{align}

where, $X$ is the input, $W$ is the weight, $Y$ is the output, $dX$ is the input gradient, $dW$ is the weight gradient and $dY$ is the output gradient matrix. The $X$ and $W$ matrices are in row major format on the CPU. We would like $Y$, $dX$ and $dW$ to be in row major format as well, so that they can be inspected/modified on the CPU. However, we would like the computation to be performed on a GPU using cuBLAS (column major) because it is much faster.

During forward propagation, we copy $X$ and $W$ to GPU memory. As
described above, cuBLAS interprets these matrices as $X^T$ and $W^T$. Now,
instead of computing $Y = X.W^T$, we compute $Y^T = W.X^T$ on the
GPU. Since, the GPU has $X^T$ and $W^T$, the first matrix is read in a
transposed fashion and the second matrix as is, which results in a **TN**
GEMM kernel. Note that no explicit transpose is performed.

For the weight gradient, instead of computing $dW = dY^TX$, we compute
$dW^T = X^T.dY$ on the GPU. The GPU has $X^T$, and $dY^T$ is produced by
the previous layer. Therefore, the second matrix is read in a transposed
fashion which results in a **NT** GEMM kernel.

For the input gradient, instead of computing $dX = dY.W$, we compute
$dX^T = W^T.dY^T$ on the GPU. The GPU has $W^T$, and $dY^T$ is produced
by the previous layer. Therefore, both matrices are read in as is which
results in a **NN** GEMM kernel.

The matrices $Y^T$, $dX^T$ and $dW^T$, as seen by cuBLAS on the GPU can be copied to the CPU where they will be interpreted as $Y$, $dX$ and $dW$ by PyTorch as desired.