Back Propagation in Fully Connected Layer

December 29, 2017

During forward propagation, the output of a fully connected (FC) layer is given by

$$ Y = f(WX + B) $$

where $W$ is the weight matrix, $X$ is the input matrix, $B$ is the bias matrix, $f$ is the activation function such as sigmoid, tanh, ReLU etc. and $Y$ is the output matrix. To understand back propagation in a FC layer it helps to seperate it into two layers. The first layer calculates $Z$, the matrix of weighted sum of the inputs and the second layer applies the activation function to each element of the $Z$ matrix, as follows.

\begin{align} Z &= WX + B \\ Y &= f(Z) \\ \end{align}

Let, $L$ be the loss(synonymously error or cost) of the network we want to minimize. During backward propagation, given $\frac{\partial L}{\partial Y}$, we first calculate $\frac{\partial L}{\partial Z}$ i.e. back propagation through the second layer. Then using $\frac{\partial L}{\partial Z}$ we calculate $\frac{\partial L}{\partial W}$ i.e. the weight gradient, $\frac{\partial L}{\partial X}$ i.e. the input gradient, and $\frac{\partial L}{\partial B}$ i.e. the bias gradient. The weight gradient and the bias gradient are used to adjust (learn) the values of the weight and bias matrix respectively, while the input gradient is propagated backward through the network.

Calculating the latter three gradient values amounts to back propagation through the first layer. For a FC layer, these three values can be calculated simultaneously (in parallel). In a nut shell, given $\frac{\partial L}{\partial Y}$ we want to calculate $\frac{\partial L}{\partial X}$ i.e. propagate the loss gradient/error gradient backward through the network. The value of $\frac{\partial L}{\partial X}$ becomes the $\frac{\partial L}{\partial Y}$ for the preceding layer and so on.

Given $\frac{\partial L}{\partial Y}$ let us first calculate $\frac{\partial L}{\partial Z}$ for the cases when the activation function, $f$ is sigmoid, tanh and ReLU respectively.

Sigmoid

The sigmoid activation is defined as

$$ Y = \frac{1}{1 + e^{-Z}} $$

Given $\frac{\partial L}{\partial Y}$, $\frac{\partial L}{\partial Z}$ can be calculated as follows,

\begin{align} \frac{\partial L}{\partial Z} &= \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial Z} \\ &= \frac{\partial L}{\partial Y} \frac{e^{-Z}}{(1 + e^{-Z})^2} \\ &= \frac{\partial L}{\partial Y} \frac{1 + e^{-Z} - 1}{(1 + e^{-Z})^2} \\ &= \frac{\partial L}{\partial Y} (Y - Y^2) \\ &= \frac{\partial L}{\partial Y} Y (1-Y) \\ \end{align}

Hyperbolic Tangent (tanh)

The hyperbolic tangent is defined as

$$ Y = \frac{e^Z - e^{-Z}}{e^Z + e^{-Z}} $$

Given $\frac{\partial L}{\partial Y}, \frac{\partial L}{\partial Z}$ can be calculated as follows,

\begin{align} \frac{\partial L}{\partial Z} &= \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial Z} \\ &= \frac{\partial L}{\partial Y} \frac{(e^Z + e^{-Z})(e^Z + e^{-Z}) - (e^Z - e^{-Z})(e^Z - e^{-Z})}{(e^Z + e^{-Z})^2} \\ &= \frac{\partial L}{\partial Y} (1 - Y^2) \end{align}

Rectified Linear Unit (ReLU)

The ReLU is defined as

\begin{align} Y &= max(0, Z) \\ &= \begin{cases} 0, & \text{if $Z \lt 0$} \\ Z, & \text{if $Z \geq 0$} \end{cases} \\ \end{align}

Given $\frac{\partial L}{\partial Y}, \frac{\partial L}{\partial Z}$ can be calculated as follows,

\begin{align} \frac{\partial L}{\partial Z} &= \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial Z} \\ &= \frac{\partial L}{\partial Y} \begin{cases} 0, & \text{if $Z \lt 0$} \\ 1, & \text{if $Z \geq 0$} \end{cases} \\ \end{align}

Having obtained the value of $\frac{\partial L}{\partial Z}$ we can now calculate the values of the weight gradient, input gradient and the bias gradient. We will start with the simplest case of a Single Neuron, Single Input, Batch Size 1. Then we will derive the equations for Single Neuron, Multiple Inputs, Batch Size 1. We will then extend the analysis to a Multiple Neurons, Multiple Inputs, Batch Size 1 case. Finally, we will derive the equation for the most general case i.e. Multiple Neurons, Multiple Inputs, Batch Size N.

Single Neuron, Single Input, Batch Size 1

Consider the output of a single neuron with a single input, as shown below

where $x$, $w$, $b$, and $z$ are scalars. The output $z$ is given by

$$ z = wx + b $$

Given $\frac{\partial L}{\partial z}$ (as calculated above), the weight gradient, input gradient and the bias gradient can be calculated as follows.

\begin{align} \frac{\partial L}{\partial x} &= \frac{\partial L}{\partial z} \frac{\partial z}{\partial x} \\ &= \frac{\partial L}{\partial z} w \\ \\ \frac{\partial L}{\partial w} &= \frac{\partial L}{\partial z} \frac{\partial z}{\partial w} \\ &= \frac{\partial L}{\partial z} x \\ \\ \frac{\partial L}{\partial b} &= \frac{\partial L}{\partial z} \frac{\partial z}{\partial b} \\ &= \frac{\partial L}{\partial z} \\ \end{align}

Top

Single Neuron, Multiple Inputs, Batch Size 1

Now, consider the output of a single neuron with multiple inputs, as shown below



where $x_1$, $x_2$, $x_3$, $w_1$, $w_2$, $w_3$, $b$ and $z$ are scalars. The output $z$ is given by

\begin{align} z &= w_1x_1 + w_2x_2 + w_3x_3 + b \\ z &= \begin{bmatrix} w_1\ w_2\ w_3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \end{bmatrix} + b \\ z &= WX + b \\ \end{align}

where $W$ is a row vector representing the weights of the neuron and $X$ is a column vector representing the inputs. Note that this choice is completely arbitrary. The equation could also have been written as $z = XW + b$, where $X$ is a row vector and $W$ is a column vector, or $z = WX^T + b$, where both $W$ and $X$ are row vectors and so on. For the discussion below, we will assume $W$ is a row vector and $X$ is a column vector.

Given $\frac{\partial L}{\partial z}$ (as calculated above), the weight gradient, input gradient and the bias gradient can be calculated as follows. Note that the weight gradient has the same dimensions as the weight matrix, the input gradient has the same dimensions as the input matrix and so on.

\begin{align} \frac{\partial L}{\partial X} = \begin{bmatrix} \frac{\partial L}{\partial x_1} \\ \frac{\partial L}{\partial x_2} \\ \frac{\partial L}{\partial x_3} \\ \end{bmatrix} &= \begin{bmatrix} \frac{\partial L}{\partial z}\frac{\partial z}{\partial x_1} \\ \frac{\partial L}{\partial z}\frac{\partial z}{\partial x_2} \\ \frac{\partial L}{\partial z}\frac{\partial z}{\partial x_3} \\ \end{bmatrix} \\ \\ &= \frac{\partial L}{\partial z} \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\ \end{bmatrix} \\ \\ &= \frac{\partial L}{\partial z} W^T \\ \\ \frac{\partial L}{\partial W} = \begin{bmatrix} \frac{\partial L}{\partial w_1}\ \frac{\partial L}{\partial w_2}\ \frac{\partial L}{\partial w_3} \end{bmatrix} &= \begin{bmatrix} \frac{\partial L}{\partial z}\frac{\partial z}{\partial w_1} \ \frac{\partial L}{\partial z}\frac{\partial z}{\partial w_2} \ \frac{\partial L}{\partial z}\frac{\partial z}{\partial w_3} \end{bmatrix} \\ &= \frac{\partial L}{\partial z} \begin{bmatrix} x_1\ x_2\ x_3 \end{bmatrix} \\ &= \frac{\partial L}{\partial z} X^T \\ \\ \frac{\partial L}{\partial b} &= \frac{\partial L}{\partial z}\frac{\partial z}{\partial b} \\ &= \frac{\partial L}{\partial z} \\ \end{align}

Top

Multiple Neurons, Multiple Inputs, Batch Size 1

Now consider the outputs of multiple neurons ($A$ and $B$) each with 3 inputs, as shown below

where $x_1$, $x_2$, $x_3$, $w_{A1}$, $w_{A2}$, $w_{A3}$, $w_{B1}$, $w_{B2}$, $w_{B3}$, $b_1$, $b_2$, $z_1$ and $z_2$ are scalars. The outputs $z_1$ and $z_2$ are given by,
\begin{align} z_1 &= w_{A1}x_1 + w_{A2}x_2 + w_{A3}x_3 + b_1 \\ z_2 &= w_{B1}x_1 + w_{B2}x_2 + w_{B3}x_3 + b_2 \\ \begin{bmatrix}z_1 \\ z_2\end{bmatrix} &= \begin{bmatrix} w_{A1}\ w_{A2}\ w_{A3} \\ w_{B1}\ w_{B2}\ w_{B3} \end{bmatrix} \begin{bmatrix} x_1\\x_2\\x_3 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \\ \\ Z &= WX + B \\ \end{align}

where $W$ is the weight matrix, $X$ is the input matrix, $B$ is the bias matrix and $Z$ is the output matrix.

Given $\frac{\partial L}{\partial Z}$, the weight gradient, input gradient and the bias gradient can be calculated as follows. Note that the weight gradient has the same dimensions as the weight matrix, the input gradient has the same dimensions as the input matrix and so on.

\begin{align} \frac{\partial L}{\partial X} = \begin{bmatrix} \frac{\partial L}{\partial x_1} \\ \frac{\partial L}{\partial x_2} \\ \frac{\partial L}{\partial x_3} \\ \end{bmatrix} &= \begin{bmatrix} \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial x_1} + \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial x_1} \\ \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial x_2} + \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial x_2} \\ \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial x_3} + \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial x_3} \\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} \frac{\partial L}{\partial z_1}w_{A1} + \frac{\partial L}{\partial z_2}w_{B1} \\ \frac{\partial L}{\partial z_1}w_{A2} + \frac{\partial L}{\partial z_2}w_{B2} \\ \frac{\partial L}{\partial z_1}w_{A3} + \frac{\partial L}{\partial z_2}w_{B3} \\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} w_{A1}\ w_{B1} \\ w_{A2}\ w_{B2} \\ w_{A3}\ w_{B3} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial L}{\partial z_1} \\ \frac{\partial L}{\partial z_2} \end{bmatrix} \\ \\ % % &= W^T \frac{\partial L}{\partial Z}\\ \\ % % \frac{\partial L}{\partial W} = \begin{bmatrix} \frac{\partial L}{\partial w_{A1}}\ \frac{\partial L}{\partial w_{A2}}\ \frac{\partial L}{\partial w_{A3}} \\ \frac{\partial L}{\partial w_{B1}}\ \frac{\partial L}{\partial w_{B2}}\ \frac{\partial L}{\partial w_{B3}} \\ \end{bmatrix} &= \begin{bmatrix} \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial w_{A1}}\ \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial w_{A2}}\ \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial w_{A3}}\\ \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial w_{B1}}\ \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial w_{B2}}\ \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial w_{B3}}\\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} \frac{\partial L}{\partial z_1}x_1\ \frac{\partial L}{\partial z_1}x_2\ \frac{\partial L}{\partial z_1}x_3\\ \frac{\partial L}{\partial z_2}x_1\ \frac{\partial L}{\partial z_2}x_2\ \frac{\partial L}{\partial z_2}x_3\\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} \frac{\partial L}{\partial z_1} \\ \frac{\partial L}{\partial z_2} \\ \end{bmatrix} \begin{bmatrix} x_1\ x_2\ x_3 \end{bmatrix} \\ \\ % % &= \frac{\partial L}{\partial Z} X^T \\ \\ % % \frac{\partial L}{\partial B} = \begin{bmatrix} \frac{\partial L}{\partial b_1} \\ \frac{\partial L}{\partial b_2} \\ \end{bmatrix} &= \begin{bmatrix} \frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial b_1} \\ \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial b_2} \\ \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{\partial L}{\partial z_1} \\ \frac{\partial L}{\partial z_2} \\ \end{bmatrix} \\ \\ & = \frac{\partial L}{\partial Z} \\ \end{align}

Top

Multiple Neurons, Multiple Inputs, Batch Size N

Consider the same network (2 neurons, 3 inputs) as in the previous example but now with a batch size of 2 ($N =2$), such that two sets of inputs ($x_{11}$, $x_{12}$, $x_{13}$) and ($x_{21}$, $x_{22}$, $x_{23}$) produce two sets of outputs ($z_{11}$, $z_{12}$) and ($z_{21}$, $z_{22}$), respectively. The outputs are given by
\begin{align} z_{11} &= w_{A1}x_{11} + w_{A2}x_{12} + w_{A3}x_{13} + b_1 \\ z_{12} &= w_{B1}x_{11} + w_{B2}x_{12} + w_{B3}x_{13} + b_2 \\ \\ z_{21} &= w_{A1}x_{21} + w_{A2}x_{22} + w_{A3}x_{23} + b_1 \\ z_{22} &= w_{B1}x_{21} + w_{B2}x_{22} + w_{B3}x_{23} + b_2 \\ \begin{bmatrix} z_{11}\ z_{21}\\ z_{12}\ z_{22} \end{bmatrix} &= \begin{bmatrix} w_{A1}\ w_{A2}\ w_{A3} \\ w_{B1}\ w_{B2}\ w_{B3} \end{bmatrix} \begin{bmatrix} x_{11}\ x_{21}\\ x_{12}\ x_{22}\\ x_{13}\ x_{23} \end{bmatrix} + \begin{bmatrix} b_1\ b_1 \\ b_2\ b_2 \end{bmatrix} \\ \\ Z &= WX + B \\ \end{align}

where $W$ is the weight matrix, $X$ is the input matrix, $B$ is the bias matrix and $Z$ is the output matrix. Comparing the matrices with those in the previous section, you will notice that the $W$ matrix stays the same, the $X$ and $Z$ matrices have two columns corresponding to $N=2$. The $B$ matrix also has two columns, corresponding to $N=2$ but its entries are simply duplicated.

Given $\frac{\partial L}{\partial Z}$, the weight gradient, input gradient and the bias gradient can be calculated as follows.

\begin{align} \frac{\partial L}{\partial X} &= \begin{bmatrix} \frac{\partial L}{\partial x_{11}}\ \frac{\partial L}{\partial x_{21}} \\ \frac{\partial L}{\partial x_{12}}\ \frac{\partial L}{\partial x_{22}} \\ \frac{\partial L}{\partial x_{13}}\ \frac{\partial L}{\partial x_{23}} \\ \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial x_{11}} + \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial x_{11}} \ \ \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial x_{21}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial x_{21}} \\ \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial x_{12}} + \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial x_{12}} \ \ \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial x_{22}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial x_{22}} \\ \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial x_{13}} + \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial x_{13}} \ \ \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial x_{23}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial x_{23}} \\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}}w_{A1} + \frac{\partial L}{\partial z_{12}}w_{B1} \ \ \frac{\partial L}{\partial z_{21}}w_{A1} + \frac{\partial L}{\partial z_{22}}w_{B1} \\ \frac{\partial L}{\partial z_{11}}w_{A2} + \frac{\partial L}{\partial z_{12}}w_{B2} \ \ \frac{\partial L}{\partial z_{21}}w_{A2} + \frac{\partial L}{\partial z_{22}}w_{B2} \\ \frac{\partial L}{\partial z_{11}}w_{A3} + \frac{\partial L}{\partial z_{12}}w_{B3} \ \ \frac{\partial L}{\partial z_{21}}w_{A3} + \frac{\partial L}{\partial z_{22}}w_{B3} \\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} w_{A1}\ w_{B1} \\ w_{A2}\ w_{B2} \\ w_{A3}\ w_{B3} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial L}{\partial z_{11}} \ \frac{\partial L}{\partial z_{21}} \\ \frac{\partial L}{\partial z_{12}} \ \frac{\partial L}{\partial z_{22}} \end{bmatrix} \\ \\ % % &= W^T \frac{\partial L}{\partial Z}\\ \end{align}
\begin{align} \frac{\partial L}{\partial W} &= \begin{bmatrix} \frac{\partial L}{\partial w_{A1}}\ \frac{\partial L}{\partial w_{A2}}\ \frac{\partial L}{\partial w_{A3}} \\ \frac{\partial L}{\partial w_{B1}}\ \frac{\partial L}{\partial w_{B2}}\ \frac{\partial L}{\partial w_{B3}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{A1}} + \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{A1}} \ \ \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{A2}} + \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{A2}} \ \ \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{A3}} + \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{A3}} \\ \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{B1}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{B1}} \ \ \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{B2}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{B2}} \ \ \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{B3}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{B3}}\\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}}x_{11} + \frac{\partial L}{\partial z_{21}}x_{21} \ \ \frac{\partial L}{\partial z_{11}}x_{12} + \frac{\partial L}{\partial z_{21}}x_{22} \ \ \frac{\partial L}{\partial z_{11}}x_{13} + \frac{\partial L}{\partial z_{21}}x_{23} \\ \frac{\partial L}{\partial z_{12}}x_{11} + \frac{\partial L}{\partial z_{22}}x_{21} \ \frac{\partial L}{\partial z_{12}}x_{12} + \frac{\partial L}{\partial z_{22}}x_{22} \ \ \frac{\partial L}{\partial z_{12}}x_{13} + \frac{\partial L}{\partial z_{22}}x_{23} \\ \end{bmatrix} \\ \\ % % &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}} \frac{\partial L}{\partial z_{21}} \\ \frac{\partial L}{\partial z_{12}} \frac{\partial L}{\partial z_{22}} \\ \end{bmatrix} \begin{bmatrix} x_{11}\ x_{12}\ x_{13} \\ x_{21}\ x_{22}\ x_{23} \\ \end{bmatrix} \\ \\ % % &= \frac{\partial L}{\partial Z} X^T \\ \end{align}
\begin{align} \frac{\partial L}{\partial B} = \begin{bmatrix} \frac{\partial L}{\partial b_1} \\ \frac{\partial L}{\partial b_2} \\ \end{bmatrix} &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial b_1} + \frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial b_1} \\ \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial b_2} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial b_2} \\ \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}} + \frac{\partial L}{\partial z_{21}} \\ \frac{\partial L}{\partial z_{12}} + \frac{\partial L}{\partial z_{22}} \\ \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{\partial L}{\partial z_{11}}\ \frac{\partial L}{\partial z_{21}} \\ \frac{\partial L}{\partial z_{12}}\ \frac{\partial L}{\partial z_{22}} \\ \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix} \\ & = \frac{\partial L}{\partial Z}\ \text{unit column vector}\\ \end{align}