The last layer in a deep neural network is typically the sigmoid layer or the soft max layer. Both layers emit values between 0 and 1. In the former case, the output values are independent while in the latter, the output values add up to 1. In either case, the index corresponding to the maximum output can be used for classification.

The choice of the last layer in the network has an important bearing on the choice of the loss(error) function such as Mean Squared, Cross Entropy or Log-likelihood. The loss function takes as input

- The outputs (a) of the last layer
- The label/truth vector (y) (typically one hot)

**Mean Squared Error (MSE)**: It is defined as $\sum
(y_i-a_i)^2$ i.e. the sum of the squares of the difference between
the true and the predicted value. The $a_i$'s and $y_i$'s have
a value between 0 and 1 and so the difference between the true
and the predicted value is between -1 and 1. However, squaring
the difference diminishes the value, as shown below. Essentially,
MSE attenuates the difference.

What we need is a function which amplifies the differnece between the true and the predicted values in the domain $[0,1]$. The log function has this property, as shown below, and is used in the Cross Entropy and Log-likelihood cost functions.

**Cross Entropy Cost (CEC)**: It is defined as

$$
-\sum \{y_i ln(a_i) + (1-y_i) ln(1-a_i)\}
$$

Note that $a_i$'s have a value between 0 and 1 and the label vector Y is one hot and so the above equation calculates the cost as

$$
-ln(\text{value predicted for the correct class}) - \sum ln(1 - \text{value predicted for the incorrect class})
$$

The CEC makes the cost a function of both the correct values and the incorrect values. Ideally, when the value predicted for the correct class is 1 and the value predicted for the incorrect classes is 0, then the cost is 0. If a sigmoid layer (at the end of the network) is used to obtain the values of $a_i$'s then the values for the correct class and the incorrect classes are independent of each other. The CEC function attempts to reduce the cost by both

- Increasing the value of the correct class towards 1, and
- Decreasing the values of the incorrect classes towards 0.

**Log-likelihood Cost (LLC)**: It is defined as

$$
-\sum y_i ln(a_i)
$$

As before, $a_i$'s have a value between 0 and 1 and the label vector Y is one hot and so the above equation calculates the cost as

$$
-ln(\text{value predicted for the correct class})
$$

The LLC makes the cost a function of only the value predicted for the correct class. Ideally, when the value predicted for the correct class is 1, then the cost is 0. Such a cost function is useful and makes sense when increasing the value of the correct class decreases the value of the incorrect classes such that as the value of the correct class approaches 1, the value of the incorrect classes approaches 0. This can be achieved by using a softmax as the last layer of the neural network. For a softmax layer, $\sum a_i = 1$. Therefore, increasing the value of one class decreases the value of the other classes.

To summarize, use the cross entropy loss function with a sigmoid layer and the log-likelihood loss function with a softmax layer.

References

- Improving the way neural networks learn. Michael Nielsen. link