How does PyTorch (row major) interact with cuBLAS (column major)?

January 20, 2019

Row major format is most notably used by C/C++ while column major format is used by Fortran, MATLAB, R etc. External libraries and packages for scientific computing, statistics, BLAS routines (e.g. Nvidia cuBLAS), machine learning may either choose a particular storage format or support both formats. Nvidia cuBLAS library uses a column major format, but can be used with both C and Fortran code. The question then is, how does a programmer deal with both formats in the same application e.g. C/C++ (row major) on the CPU and cuBLAS (column major) on the GPU. Some software frameworks like PyTorch completely hide this complexity. However, it is interesting to understand what happens under the hood and how row major and column major software systems interact. In the above example, row major code runs on the CPU and column major code runs on the GPU. However, the device i.e. CPU or GPU is not relevant in this context. In the above example, one could substitute another column major BLAS library (e.g. MKL) running on the CPU instead of cuBLAS running on the GPU.

Hardware agnostic

A computers' memory is essentially a byte-addressable data storage device. How a matrix (an array) is laid out in memory e.g. row major, column major or other custom formats is purely a software (programming language, library developer) choice. C/C++ by default follow a row-major storage format but can be used to design, for example, a BLAS library which follows a column major format. The same memory can support and run equally well both a row major and a column major format. Due to the way caches and DRAMs are designed, accessing data elements sequentially from memory is better for performace. However, these performance considerations apply equally to any format chosen by the programmer.

$A_{\text{row major}} = A^T_{\text{column major}}$

Consider a $2\times3$ matrix,

\begin{align} A = \begin{bmatrix} 2 & 3 & 5 \\ 7 & 11 & 13 \\ \end{bmatrix} \end{align}

In a row major system, we save two scalars, #rows, $r=2$ and #columns, $c=3$. The data, $d$ is stored in memory as:

\begin{align} \begin{array}{|c|c|c|c|c|c|} \hline 2 & 3 & 5 & 7 & 11 & 13 \\ \hline \end{array} \end{align}

In a column major system, $r=2$ and $c=3$. However, $d$ is stored in memory as:

\begin{align} \begin{array}{|c|c|c|c|c|c|} \hline 2 & 7 & 3 & 11 & 5 & 13 \\ \hline \end{array} \end{align}

If we want to pass (by reference on a CPU or explicitly copy to a GPU) the matrix $A$ from a row major system to a column major system or vice versa, we will have to first reorder the data in memory. If $A$ is large or if we pass $A$ back and forth multiple times, the reordering operation is undesirable and inefficient. Is there a better solution? Consider the case where $r=3$, $c=2$ and the data in memory is:

\begin{align} \begin{array}{|c|c|c|c|c|c|} \hline 2 & 3 & 5 & 7 & 11 & 13 \\ \hline \end{array} \end{align}

If these values are interpreted by a column major system, it would reconstruct the matrix as:

\begin{align} B = \begin{bmatrix} 2 & 7 \\ 3 & 11 \\ 5 & 13 \\ \end{bmatrix} = A^T \end{align}

Therefore, if we have a matrix $A$ stored in a row major system, we can simply swap the values of $r$ and $c$, and pass it to a column major system. The column major system will interpret the data as $A^T$. Note that neither did we reorder data nor did we perform a transpose. The same works in reverse. Let's see how we can use this trick to our advantage for some common matrix operations.

Suppose we want to perform the operation, $C = A + B$, where $A$ and $B$ are matrices in a row major format. We want the output $C$ to be in a row major format as well. Also suppose, we have a fast matrix addition function but it is in a column major BLAS library.

First, we pass $A$ and $B$ as $A^T$ and $B^T$ to the BLAS library. As described above, this requires neither data reordering nor explicit transpose. Next, instead of computing $C = A + B$, we compute $C^T = A^T + B^T$ using the BLAS function. Finally, we pass $C^T$ from the column major library to the row major system which interprets the data as $C$. Again, for the last operation, we performed neither a data reordering nor a transpose. Hence, we were able to perform $C = A + B$ where the input and output data was in row major format but the computation was done by a column major library. We were able to do this without any data reordering or transposes. Next, let's consider a more complicated example of a matrix multiplication.

Matrix Multiplication

Consider the linear layer in PyTorch [1]. Ignoring the bias component, the equations for forward and backward propagation are:

\begin{align} Y &= X.W^T \\ dX &= dY.W \\ dW &= dY^TX \end{align}

where, $X$ is the input, $W$ is the weight, $Y$ is the output, $dX$ is the input gradient, $dW$ is the weight gradient and $dY$ is the output gradient matrix. The $X$ and $W$ matrices are in row major format on the CPU. We would like $Y$, $dX$ and $dW$ to be in row major format as well, so that they can be inspected/modified on the CPU. However, we would like the computation to be performed on a GPU using cuBLAS (column major) because it is much faster.

During forward propagation, we copy $X$ and $W$ to GPU memory. As described above, cuBLAS interprets these matrices as $X^T$ and $W^T$. Now, instead of computing $Y = X.W^T$, we compute $Y^T = W.X^T$ on the GPU. Since, the GPU has $X^T$ and $W^T$, the first matrix is read in a transposed fashion and the second matrix as is, which results in a TN GEMM kernel. Note that no explicit transpose is performed.

For the weight gradient, instead of computing $dW = dY^TX$, we compute $dW^T = X^T.dY$ on the GPU. The GPU has $X^T$, and $dY^T$ is produced by the previous layer. Therefore, the second matrix is read in a transposed fashion which results in a NT GEMM kernel.

For the input gradient, instead of computing $dX = dY.W$, we compute $dX^T = W^T.dY^T$ on the GPU. The GPU has $W^T$, and $dY^T$ is produced by the previous layer. Therefore, both matrices are read in as is which results in a NN GEMM kernel.

The matrices $Y^T$, $dX^T$ and $dW^T$, as seen by cuBLAS on the GPU can be copied to the CPU where they will be interpreted as $Y$, $dX$ and $dW$ by PyTorch as desired.