Back Propagation in Fully Connected Layer
December 29, 2017
During forward propagation, the output of a fully connected (FC) layer is given by
$$
Y = f(WX + B)
$$
where $W$ is the weight matrix, $X$ is the input matrix, $B$ is
the bias matrix, $f$ is the activation function such as sigmoid,
tanh, ReLU etc. and $Y$ is the output matrix. To understand
back propagation in a FC layer it helps to seperate it into two
layers. The first layer calculates $Z$, the matrix of weighted sum of the
inputs and the second layer applies the activation function to each element of the $Z$ matrix, as follows.
\begin{align}
Z &= WX + B \\
Y &= f(Z) \\
\end{align}
Let, $L$ be the loss(synonymously error or cost) of the network we
want to minimize. During backward propagation, given $\frac{\partial
L}{\partial Y}$, we first calculate $\frac{\partial L}{\partial
Z}$ i.e. back propagation through the second layer. Then using
$\frac{\partial L}{\partial Z}$ we calculate $\frac{\partial
L}{\partial W}$ i.e. the weight gradient, $\frac{\partial
L}{\partial X}$ i.e. the input gradient, and $\frac{\partial
L}{\partial B}$ i.e. the bias gradient. The weight gradient and
the bias gradient are used to adjust (learn) the values of the
weight and bias matrix respectively, while the input gradient is
propagated backward through the network.
Calculating the latter three gradient values amounts to back
propagation through the first layer. For a FC layer, these three
values can be calculated simultaneously (in parallel). In a nut
shell, given $\frac{\partial L}{\partial Y}$ we want to calculate
$\frac{\partial L}{\partial X}$ i.e. propagate the loss
gradient/error gradient backward through the network. The value
of $\frac{\partial L}{\partial X}$ becomes the $\frac{\partial
L}{\partial Y}$ for the preceding layer and so on.
Given $\frac{\partial L}{\partial Y}$ let us first calculate
$\frac{\partial L}{\partial Z}$ for the cases when the activation
function, $f$ is sigmoid, tanh and ReLU respectively.
Sigmoid
The sigmoid activation is defined as
$$
Y = \frac{1}{1 + e^{-Z}}
$$
Given $\frac{\partial L}{\partial Y}$, $\frac{\partial L}{\partial Z}$ can be
calculated as follows,
\begin{align}
\frac{\partial L}{\partial Z}
&= \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial Z} \\
&= \frac{\partial L}{\partial Y} \frac{e^{-Z}}{(1 + e^{-Z})^2} \\
&= \frac{\partial L}{\partial Y} \frac{1 + e^{-Z} - 1}{(1 + e^{-Z})^2} \\
&= \frac{\partial L}{\partial Y} (Y - Y^2) \\
&= \frac{\partial L}{\partial Y} Y (1-Y) \\
\end{align}
Hyperbolic Tangent (tanh)
The hyperbolic tangent is defined as
$$
Y = \frac{e^Z - e^{-Z}}{e^Z + e^{-Z}}
$$
Given $\frac{\partial L}{\partial Y}, \frac{\partial L}{\partial Z}$ can be
calculated as follows,
\begin{align}
\frac{\partial L}{\partial Z}
&= \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial Z} \\
&= \frac{\partial L}{\partial Y} \frac{(e^Z + e^{-Z})(e^Z + e^{-Z}) - (e^Z - e^{-Z})(e^Z - e^{-Z})}{(e^Z + e^{-Z})^2} \\
&= \frac{\partial L}{\partial Y} (1 - Y^2)
\end{align}
Rectified Linear Unit (ReLU)
The ReLU is defined as
\begin{align}
Y &= max(0, Z) \\
&= \begin{cases}
0, & \text{if $Z \lt 0$} \\
Z, & \text{if $Z \geq 0$}
\end{cases} \\
\end{align}
Given $\frac{\partial L}{\partial Y}, \frac{\partial L}{\partial Z}$ can be
calculated as follows,
\begin{align}
\frac{\partial L}{\partial Z}
&= \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial Z} \\
&= \frac{\partial L}{\partial Y} \begin{cases}
0, & \text{if $Z \lt 0$} \\
1, & \text{if $Z \geq 0$}
\end{cases} \\
\end{align}
Having obtained the value of $\frac{\partial L}{\partial Z}$ we can now calculate the
values of the weight gradient, input gradient and the bias gradient. We will start
with the simplest case of a Single Neuron, Single Input, Batch Size 1. Then we
will derive the equations for Single Neuron, Multiple Inputs, Batch Size 1.
We will then extend the analysis to a Multiple Neurons, Multiple Inputs, Batch Size 1
case. Finally, we will derive the equation for the most general case i.e.
Multiple Neurons, Multiple Inputs, Batch Size N.
Single Neuron, Single Input, Batch Size 1
Consider the output of a single neuron with a single input, as shown below
where $x$, $w$, $b$, and $z$ are scalars. The output $z$ is given by
$$
z = wx + b
$$
Given $\frac{\partial L}{\partial z}$ (as calculated above), the weight gradient, input
gradient and the bias gradient can be calculated as follows.
\begin{align}
\frac{\partial L}{\partial x}
&= \frac{\partial L}{\partial z} \frac{\partial z}{\partial x} \\
&= \frac{\partial L}{\partial z} w \\
\\
\frac{\partial L}{\partial w}
&= \frac{\partial L}{\partial z} \frac{\partial z}{\partial w} \\
&= \frac{\partial L}{\partial z} x \\
\\
\frac{\partial L}{\partial b}
&= \frac{\partial L}{\partial z} \frac{\partial z}{\partial b} \\
&= \frac{\partial L}{\partial z} \\
\end{align}
Top
Single Neuron, Multiple Inputs, Batch Size 1
Now, consider the output of a single neuron with multiple inputs, as shown below
where $x_1$, $x_2$, $x_3$, $w_1$, $w_2$, $w_3$, $b$ and $z$ are scalars. The output $z$ is given by
\begin{align}
z &= w_1x_1 + w_2x_2 + w_3x_3 + b \\
z &= \begin{bmatrix} w_1\ w_2\ w_3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \end{bmatrix} + b \\
z &= WX + b \\
\end{align}
where $W$ is a row vector representing the weights of the neuron
and $X$ is a column vector representing the inputs. Note that this
choice is completely arbitrary. The equation could also have been
written as $z = XW + b$, where $X$ is a row vector and $W$ is a
column vector, or $z = WX^T + b$, where both $W$ and $X$ are row
vectors and so on. For the discussion below, we will assume $W$ is
a row vector and $X$ is a column vector.
Given $\frac{\partial L}{\partial z}$ (as calculated above),
the weight gradient, input gradient and the bias gradient can be
calculated as follows. Note that the weight gradient has the same
dimensions as the weight matrix, the input gradient has the same
dimensions as the input matrix and so on.
\begin{align}
\frac{\partial L}{\partial X} =
\begin{bmatrix}
\frac{\partial L}{\partial x_1} \\
\frac{\partial L}{\partial x_2} \\
\frac{\partial L}{\partial x_3} \\
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial L}{\partial z}\frac{\partial z}{\partial x_1} \\
\frac{\partial L}{\partial z}\frac{\partial z}{\partial x_2} \\
\frac{\partial L}{\partial z}\frac{\partial z}{\partial x_3} \\
\end{bmatrix} \\
\\
&= \frac{\partial L}{\partial z} \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\ \end{bmatrix} \\
\\
&= \frac{\partial L}{\partial z} W^T \\
\\
\frac{\partial L}{\partial W} =
\begin{bmatrix}
\frac{\partial L}{\partial w_1}\ \frac{\partial L}{\partial w_2}\ \frac{\partial L}{\partial w_3}
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial L}{\partial z}\frac{\partial z}{\partial w_1} \
\frac{\partial L}{\partial z}\frac{\partial z}{\partial w_2} \
\frac{\partial L}{\partial z}\frac{\partial z}{\partial w_3}
\end{bmatrix} \\
&= \frac{\partial L}{\partial z} \begin{bmatrix} x_1\ x_2\ x_3 \end{bmatrix} \\
&= \frac{\partial L}{\partial z} X^T \\
\\
\frac{\partial L}{\partial b}
&= \frac{\partial L}{\partial z}\frac{\partial z}{\partial b} \\
&= \frac{\partial L}{\partial z} \\
\end{align}
Top
Multiple Neurons, Multiple Inputs, Batch Size 1
Now consider the outputs of multiple neurons ($A$ and $B$) each with 3
inputs, as shown below
where $x_1$, $x_2$, $x_3$, $w_{A1}$, $w_{A2}$, $w_{A3}$, $w_{B1}$,
$w_{B2}$, $w_{B3}$, $b_1$, $b_2$, $z_1$ and $z_2$ are scalars. The
outputs $z_1$ and $z_2$ are given by,
\begin{align}
z_1 &= w_{A1}x_1 + w_{A2}x_2 + w_{A3}x_3 + b_1 \\
z_2 &= w_{B1}x_1 + w_{B2}x_2 + w_{B3}x_3 + b_2 \\
\begin{bmatrix}z_1 \\ z_2\end{bmatrix} &=
\begin{bmatrix}
w_{A1}\ w_{A2}\ w_{A3} \\
w_{B1}\ w_{B2}\ w_{B3}
\end{bmatrix}
\begin{bmatrix}
x_1\\x_2\\x_3
\end{bmatrix} +
\begin{bmatrix}
b_1 \\ b_2
\end{bmatrix} \\
\\
Z &= WX + B \\
\end{align}
where $W$ is the weight matrix, $X$ is the input matrix, $B$ is
the bias matrix and $Z$ is the output matrix.
Given $\frac{\partial L}{\partial Z}$, the weight gradient, input
gradient and the bias gradient can be calculated as follows. Note
that the weight gradient has the same dimensions as the weight
matrix, the input gradient has the same dimensions as the input
matrix and so on.
\begin{align}
\frac{\partial L}{\partial X} =
\begin{bmatrix}
\frac{\partial L}{\partial x_1} \\
\frac{\partial L}{\partial x_2} \\
\frac{\partial L}{\partial x_3} \\
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial x_1} + \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial x_1} \\
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial x_2} + \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial x_2} \\
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial x_3} + \frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial x_3} \\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_1}w_{A1} + \frac{\partial L}{\partial z_2}w_{B1} \\
\frac{\partial L}{\partial z_1}w_{A2} + \frac{\partial L}{\partial z_2}w_{B2} \\
\frac{\partial L}{\partial z_1}w_{A3} + \frac{\partial L}{\partial z_2}w_{B3} \\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
w_{A1}\ w_{B1} \\
w_{A2}\ w_{B2} \\
w_{A3}\ w_{B3} \\
\end{bmatrix}
\begin{bmatrix}
\frac{\partial L}{\partial z_1} \\ \frac{\partial L}{\partial z_2}
\end{bmatrix} \\
\\
%
%
&= W^T \frac{\partial L}{\partial Z}\\
\\
%
%
\frac{\partial L}{\partial W} =
\begin{bmatrix}
\frac{\partial L}{\partial w_{A1}}\ \frac{\partial L}{\partial w_{A2}}\ \frac{\partial L}{\partial w_{A3}} \\
\frac{\partial L}{\partial w_{B1}}\ \frac{\partial L}{\partial w_{B2}}\ \frac{\partial L}{\partial w_{B3}} \\
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial w_{A1}}\
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial w_{A2}}\
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial w_{A3}}\\
\frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial w_{B1}}\
\frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial w_{B2}}\
\frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial w_{B3}}\\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_1}x_1\
\frac{\partial L}{\partial z_1}x_2\
\frac{\partial L}{\partial z_1}x_3\\
\frac{\partial L}{\partial z_2}x_1\
\frac{\partial L}{\partial z_2}x_2\
\frac{\partial L}{\partial z_2}x_3\\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_1} \\
\frac{\partial L}{\partial z_2} \\
\end{bmatrix}
\begin{bmatrix} x_1\ x_2\ x_3 \end{bmatrix} \\
\\
%
%
&= \frac{\partial L}{\partial Z} X^T \\
\\
%
%
\frac{\partial L}{\partial B} =
\begin{bmatrix}
\frac{\partial L}{\partial b_1} \\
\frac{\partial L}{\partial b_2} \\
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial L}{\partial z_1}\frac{\partial z_1}{\partial b_1} \\
\frac{\partial L}{\partial z_2}\frac{\partial z_2}{\partial b_2} \\
\end{bmatrix} \\
\\
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_1} \\
\frac{\partial L}{\partial z_2} \\
\end{bmatrix} \\
\\
& = \frac{\partial L}{\partial Z} \\
\end{align}
Top
Multiple Neurons, Multiple Inputs, Batch Size N
Consider the same network (2 neurons, 3 inputs) as in the previous
example but now with a batch size of 2 ($N =2$), such that two sets
of inputs ($x_{11}$, $x_{12}$, $x_{13}$) and ($x_{21}$, $x_{22}$,
$x_{23}$) produce two sets of outputs ($z_{11}$, $z_{12}$) and
($z_{21}$, $z_{22}$), respectively. The outputs are given by
\begin{align}
z_{11} &= w_{A1}x_{11} + w_{A2}x_{12} + w_{A3}x_{13} + b_1 \\
z_{12} &= w_{B1}x_{11} + w_{B2}x_{12} + w_{B3}x_{13} + b_2 \\
\\
z_{21} &= w_{A1}x_{21} + w_{A2}x_{22} + w_{A3}x_{23} + b_1 \\
z_{22} &= w_{B1}x_{21} + w_{B2}x_{22} + w_{B3}x_{23} + b_2 \\
\begin{bmatrix}
z_{11}\ z_{21}\\
z_{12}\ z_{22}
\end{bmatrix} &=
\begin{bmatrix}
w_{A1}\ w_{A2}\ w_{A3} \\
w_{B1}\ w_{B2}\ w_{B3}
\end{bmatrix}
\begin{bmatrix}
x_{11}\ x_{21}\\
x_{12}\ x_{22}\\
x_{13}\ x_{23}
\end{bmatrix} +
\begin{bmatrix}
b_1\ b_1 \\
b_2\ b_2
\end{bmatrix} \\
\\
Z &= WX + B \\
\end{align}
where $W$ is the weight matrix, $X$ is the input matrix, $B$ is
the bias matrix and $Z$ is the output matrix. Comparing the
matrices with those in the previous section, you will notice that
the $W$ matrix stays the same, the $X$ and $Z$ matrices have two
columns corresponding to $N=2$. The $B$ matrix also has two columns,
corresponding to $N=2$ but its entries are simply duplicated.
Given $\frac{\partial L}{\partial Z}$, the weight gradient, input gradient
and the bias gradient can be calculated as follows.
\begin{align}
\frac{\partial L}{\partial X} &=
\begin{bmatrix}
\frac{\partial L}{\partial x_{11}}\ \frac{\partial L}{\partial x_{21}} \\
\frac{\partial L}{\partial x_{12}}\ \frac{\partial L}{\partial x_{22}} \\
\frac{\partial L}{\partial x_{13}}\ \frac{\partial L}{\partial x_{23}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial x_{11}} + \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial x_{11}} \ \
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial x_{21}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial x_{21}} \\
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial x_{12}} + \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial x_{12}} \ \
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial x_{22}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial x_{22}} \\
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial x_{13}} + \frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial x_{13}} \ \
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial x_{23}} + \frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial x_{23}} \\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}}w_{A1} + \frac{\partial L}{\partial z_{12}}w_{B1} \ \
\frac{\partial L}{\partial z_{21}}w_{A1} + \frac{\partial L}{\partial z_{22}}w_{B1} \\
\frac{\partial L}{\partial z_{11}}w_{A2} + \frac{\partial L}{\partial z_{12}}w_{B2} \ \
\frac{\partial L}{\partial z_{21}}w_{A2} + \frac{\partial L}{\partial z_{22}}w_{B2} \\
\frac{\partial L}{\partial z_{11}}w_{A3} + \frac{\partial L}{\partial z_{12}}w_{B3} \ \
\frac{\partial L}{\partial z_{21}}w_{A3} + \frac{\partial L}{\partial z_{22}}w_{B3} \\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
w_{A1}\ w_{B1} \\
w_{A2}\ w_{B2} \\
w_{A3}\ w_{B3} \\
\end{bmatrix}
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}} \
\frac{\partial L}{\partial z_{21}} \\
\frac{\partial L}{\partial z_{12}} \
\frac{\partial L}{\partial z_{22}}
\end{bmatrix} \\
\\
%
%
&= W^T \frac{\partial L}{\partial Z}\\
\end{align}
\begin{align}
\frac{\partial L}{\partial W} &=
\begin{bmatrix}
\frac{\partial L}{\partial w_{A1}}\ \frac{\partial L}{\partial w_{A2}}\ \frac{\partial L}{\partial w_{A3}} \\
\frac{\partial L}{\partial w_{B1}}\ \frac{\partial L}{\partial w_{B2}}\ \frac{\partial L}{\partial w_{B3}}
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{A1}} +
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{A1}} \ \
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{A2}} +
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{A2}} \ \
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{A3}} +
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{A3}} \\
\frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{B1}} +
\frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{B1}} \ \
\frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{B2}} +
\frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{B2}} \ \
\frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{B3}} +
\frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{B3}}\\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}}x_{11} +
\frac{\partial L}{\partial z_{21}}x_{21} \ \
\frac{\partial L}{\partial z_{11}}x_{12} +
\frac{\partial L}{\partial z_{21}}x_{22} \ \
\frac{\partial L}{\partial z_{11}}x_{13} +
\frac{\partial L}{\partial z_{21}}x_{23} \\
\frac{\partial L}{\partial z_{12}}x_{11} +
\frac{\partial L}{\partial z_{22}}x_{21} \
\frac{\partial L}{\partial z_{12}}x_{12} +
\frac{\partial L}{\partial z_{22}}x_{22} \ \
\frac{\partial L}{\partial z_{12}}x_{13} +
\frac{\partial L}{\partial z_{22}}x_{23} \\
\end{bmatrix} \\
\\
%
%
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}} \frac{\partial L}{\partial z_{21}} \\
\frac{\partial L}{\partial z_{12}} \frac{\partial L}{\partial z_{22}} \\
\end{bmatrix}
\begin{bmatrix}
x_{11}\ x_{12}\ x_{13} \\
x_{21}\ x_{22}\ x_{23} \\
\end{bmatrix} \\
\\
%
%
&= \frac{\partial L}{\partial Z} X^T \\
\end{align}
\begin{align}
\frac{\partial L}{\partial B} =
\begin{bmatrix}
\frac{\partial L}{\partial b_1} \\
\frac{\partial L}{\partial b_2} \\
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}}\frac{\partial z_{11}}{\partial b_1} +
\frac{\partial L}{\partial z_{21}}\frac{\partial z_{21}}{\partial b_1} \\
\frac{\partial L}{\partial z_{12}}\frac{\partial z_{12}}{\partial b_2} +
\frac{\partial L}{\partial z_{22}}\frac{\partial z_{22}}{\partial b_2} \\
\end{bmatrix} \\
\\
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}} + \frac{\partial L}{\partial z_{21}} \\
\frac{\partial L}{\partial z_{12}} + \frac{\partial L}{\partial z_{22}} \\
\end{bmatrix} \\
\\
&=
\begin{bmatrix}
\frac{\partial L}{\partial z_{11}}\ \frac{\partial L}{\partial z_{21}} \\
\frac{\partial L}{\partial z_{12}}\ \frac{\partial L}{\partial z_{22}} \\
\end{bmatrix}
\begin{bmatrix}
1 \\
1 \\
\end{bmatrix}
\\
& = \frac{\partial L}{\partial Z}\ \text{unit column vector}\\
\end{align}