Back Propagation in Softmax Layer

January 28, 2018

Consider a softmax layer with 2 inputs $x_1$ and $x_2$, and 2 outputs $y_1$ and $y_2$. Each output depends on all the inputs. During forward propagation, the outputs are given by

\begin{align} y_1 &= \frac{e^{x_1}}{e^{x_1} + e^{x_2}} \\ y_2 &= \frac{e^{x_2}}{e^{x_1} + e^{x_2}} \end{align}

Note that, $y_1 + y_2 = 1$. During forward propagation, given inputs $x_1$ and $x_2$, we calculate the outputs $y_1$ and $y_2$. During backward propagation, given $\frac{\partial L}{\partial y_1}$ and $\frac{\partial L}{\partial y_2}$, we want to calculate $\frac{\partial L}{\partial x_1}$ and $\frac{\partial L}{\partial x_2}$, where $L$ is the loss(error) of the network.

Using the equations above,

\begin{align} \frac{\partial L}{\partial x_1} &= \frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial x_1} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial x_1} \\ &= \frac{\partial L}{\partial y_1}\frac{e^{x_1}(e^{x_1}+e^{x_2}) - e^{x_1}e^{x_1}}{(e^{x_1} + e^{x_2})^2} + \frac{\partial L}{\partial y_2}\frac{-e^{x_2}e^{x_1}}{(e^{x_1} + e^{x_2})^2} \\ &= \frac{\partial L}{\partial y_1}(y_1 - y_1^2) - \frac{\partial L}{\partial y_2}(y_1y_2) \\ &= -y_1 (\frac{\partial L}{\partial y_1}y_1 + \frac{\partial L}{\partial y_2}y_2 - \frac{\partial L}{\partial y_1}) \\ &= -y_1 (\sum_{j=1}^{2}\frac{\partial L}{\partial y_j}y_j - \frac{\partial L}{\partial y_1}) \\ \end{align} In general, for a softmax layer with $N$ inputs and outputs, \begin{equation} \frac{\partial L}{\partial x_i} = -y_i (\sum_{j=1}^{N}\frac{\partial L}{\partial y_j}y_j - \frac{\partial L}{\partial y_i}) \end{equation}