# Back Propagation in Strided Convolution Layer

###### January 10, 2018

Consider a valid strided convolution [1] between an input feature map, $X$ and a filter (synonymously kernel or weights), $W$ to produce an output feature map, $Y$. Let, $U$ be the horizontal stride and $V$ be the vertical stride. In the following example, let $U=2$ and $V=2$. In the following example, there is some horizontal overlap but no vertical overlap as we move the filter across the input feature map. The case where, $U=V=1$ has been discussed in a different post.

\begin{align} \begin{bmatrix} w_1 & w_2 & w_3 \\ w_4 & w_5 & w_6 \\ \end{bmatrix} \circledast \begin{bmatrix} x_1 & x_2 & x_3 & x_4 & x_5 \\ x_6 & x_7 & x_8 & x_9 & x_{10} \\ x_{11} & x_{12} & x_{13} & x_{14} & x_{15} \\ x_{16} & x_{17} & x_{18} & x_{19} & x_{20} \\ \end{bmatrix} &= \begin{bmatrix} y_1 & y_2 \\ y_3 & y_4 \\ \end{bmatrix} \\ \\ W \circledast X &= Y \\ \end{align}

During forward propagation, the outputs are given by

\begin{align} y_1 &= w_1x_1 + w_2x_2 + w_3x_3 + w_4x_6 + w_5x_7 + w_6x_8 \\ y_2 &= w_1x_3 + w_2x_4 + w_3x_5 + w_4x_8 + w_5x_9 + w_6x_{10} \\ y_3 &= w_1x_{11} + w_2x_{12} + w_3x_{13} + w_4x_{16} + w_5x_{17} + w_6x_{18} \\ y_4 &= w_1x_{13} + w_2x_{14} + w_3x_{15} + w_4x_{18} + w_5x_{19} + w_6x_{20} \\ \end{align}
Let $L$ be the loss (synonymously error or cost) of the network we want to minimize. During backward propagation, given $\frac{\partial L}{\partial Y}$ we calculate $\frac{\partial L}{\partial W}$ i.e. the weight gradient, $\frac{\partial L}{\partial X}$ i.e. the input gradient. The weight gradient is used to adjust (learn) the values of the weight matrix, while the input gradient is propagated backwards through the network.

Using the equations above,

\begin{align} \frac{\partial L}{\partial W} &= \begin{bmatrix} \frac{\partial L}{\partial w_1} & \frac{\partial L}{\partial w_2} & \frac{\partial L}{\partial w_3} \\ \frac{\partial L}{\partial w_4} & \frac{\partial L}{\partial w_5} & \frac{\partial L}{\partial w_6} \\ \end{bmatrix} \\ &= \begin{bmatrix} (\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial w_1} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial w_1} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial w_1} + \frac{\partial L}{\partial y_4}\frac{\partial y_4}{\partial w_1}) & (\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial w_2} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial w_2} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial w_2} + \frac{\partial L}{\partial y_4}\frac{\partial y_4}{\partial w_2}) & (\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial w_3} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial w_3} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial w_3} + \frac{\partial L}{\partial y_4}\frac{\partial y_4}{\partial w_3}) \\ (\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial w_4} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial w_4} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial w_4} + \frac{\partial L}{\partial y_4}\frac{\partial y_4}{\partial w_4}) & (\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial w_5} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial w_5} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial w_5} + \frac{\partial L}{\partial y_4}\frac{\partial y_4}{\partial w_5}) & (\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial w_6} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial w_6} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial w_6} + \frac{\partial L}{\partial y_4}\frac{\partial y_4}{\partial w_6}) \\ \end{bmatrix} \\ &= \begin{bmatrix} (\frac{\partial L}{\partial y_1}x_1 + \frac{\partial L}{\partial y_2}x_3 + \frac{\partial L}{\partial y_3}x_{11} + \frac{\partial L}{\partial y_4}x_{13}) & (\frac{\partial L}{\partial y_1}x_2 + \frac{\partial L}{\partial y_2}x_4 + \frac{\partial L}{\partial y_3}x_{12} + \frac{\partial L}{\partial y_4}x_{14}) & (\frac{\partial L}{\partial y_1}x_3 + \frac{\partial L}{\partial y_2}x_5 + \frac{\partial L}{\partial y_3}x_{13} + \frac{\partial L}{\partial y_4}x_{15}) \\ (\frac{\partial L}{\partial y_1}x_6 + \frac{\partial L}{\partial y_2}x_8 + \frac{\partial L}{\partial y_3}x_{16} + \frac{\partial L}{\partial y_4}x_{18}) & (\frac{\partial L}{\partial y_1}x_7 + \frac{\partial L}{\partial y_2}x_9 + \frac{\partial L}{\partial y_3}x_{17} + \frac{\partial L}{\partial y_4}x_{19}) & (\frac{\partial L}{\partial y_1}x_8 + \frac{\partial L}{\partial y_2}x_{10} + \frac{\partial L}{\partial y_3}x_{18} + \frac{\partial L}{\partial y_4}x_{20}) \\ \end{bmatrix} \\ &= \begin{bmatrix} x_1 & x_2 & x_3 & x_4 & x_5 \\ x_6 & x_7 & x_8 & x_9 & x_{10} \\ x_{11} & x_{12} & x_{13} & x_{14} & x_{15} \\ x_{16} & x_{17} & x_{18} & x_{19} & x_{20} \\ \end{bmatrix} \circledast \begin{bmatrix} \frac{\partial L}{\partial y_1} & 0 & \frac{\partial L}{\partial y_2} \\ 0 & 0 & 0 \\ \frac{\partial L}{\partial y_3} & 0 & \frac{\partial L}{\partial y_4} \\ \end{bmatrix} \end{align}

Note that,

\begin{align} \begin{bmatrix} \frac{\partial L}{\partial y_1} & 0 & \frac{\partial L}{\partial y_2} \\ 0 & 0 & 0 \\ \frac{\partial L}{\partial y_3} & 0 & \frac{\partial L}{\partial y_4} \\ \end{bmatrix} &= \begin{bmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \frac{\partial L}{\partial y_1} & \frac{\partial L}{\partial y_2} \\ \frac{\partial L}{\partial y_3} & \frac{\partial L}{\partial y_4} \\ \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix}\\ &= D_V \frac{\partial L}{\partial Y} D_U \\ \end{align}

where, $D_V$ dilates the rows of the matrix vertically, for lack of a better term, and $D_U$ dilates the columns of the matrix horizontally. $D_V$ inserts $V-1$ rows of zeros between each row and $D_U$ inserts $U-1$ columns of zeros between each column. Therefore, $\frac{\partial L}{\partial W}$ is equal to a valid convolution between $X$ and horizontally and vertically dilated $\frac{\partial L}{\partial Y}$ i.e.

\begin{align} \frac{\partial L}{\partial W} = X \circledast (D_V \frac{\partial L}{\partial Y} D_U) \end{align}

where, $J_2$ and $J_3$ are exchange matrices [2]. $J_2$ will reverse the rows of the matrix and $J_3$ will reverse the columns of the matrix. The second matrix is the dilated $\frac{\partial L}{\partial Y}$ and appropriately padded with zeros to prepare it for a full convolution [1]. Therefore, $\frac{\partial L}{\partial X}$ is equal to a full convolution between the row and column reversed $W$ and dilated (horizontally by $U$ and vertically by $V$) $\frac{\partial L}{\partial Y}$ i.e.