Back Propagation in Softmax Layer
January 28, 2018
Consider a softmax layer with 2 inputs $x_1$ and $x_2$, and 2 outputs
$y_1$ and $y_2$. Each output depends on all the inputs. During forward
propagation, the outputs are given by
\begin{align}
y_1 &= \frac{e^{x_1}}{e^{x_1} + e^{x_2}} \\
y_2 &= \frac{e^{x_2}}{e^{x_1} + e^{x_2}}
\end{align}
Note that, $y_1 + y_2 = 1$. During forward propagation, given inputs
$x_1$ and $x_2$, we calculate the outputs $y_1$ and $y_2$. During
backward propagation, given $\frac{\partial L}{\partial y_1}$
and $\frac{\partial L}{\partial y_2}$, we want to calculate
$\frac{\partial L}{\partial x_1}$ and $\frac{\partial L}{\partial
x_2}$, where $L$ is the loss(error) of the network.
Using the equations above,
\begin{align}
\frac{\partial L}{\partial x_1}
&= \frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial x_1} +
\frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial x_1} \\
&= \frac{\partial L}{\partial y_1}\frac{e^{x_1}(e^{x_1}+e^{x_2}) - e^{x_1}e^{x_1}}{(e^{x_1} + e^{x_2})^2} +
\frac{\partial L}{\partial y_2}\frac{-e^{x_2}e^{x_1}}{(e^{x_1} + e^{x_2})^2} \\
&= \frac{\partial L}{\partial y_1}(y_1 - y_1^2) -
\frac{\partial L}{\partial y_2}(y_1y_2) \\
&= -y_1 (\frac{\partial L}{\partial y_1}y_1 + \frac{\partial L}{\partial y_2}y_2 - \frac{\partial L}{\partial y_1}) \\
&= -y_1 (\sum_{j=1}^{2}\frac{\partial L}{\partial y_j}y_j - \frac{\partial L}{\partial y_1}) \\
\end{align}
In general, for a softmax layer with $N$ inputs and outputs,
\begin{equation}
\frac{\partial L}{\partial x_i} = -y_i (\sum_{j=1}^{N}\frac{\partial L}{\partial y_j}y_j - \frac{\partial L}{\partial y_i})
\end{equation}