Written by Ashish Singhal
on May 30, 2021

Cross Entropy Loss

Machine Learning is often used in myriad classification tasks. What makes classification tasks successfull is loss functions which makes the model learn about its training dataset and help it in classification. This post is about Cross-Entropy loss function which is quite often used in machine learning for classification tasks.

Multi-Class Classification

In multi-class classification, a data point can only belong to one label out of many available labels. For e.g., out of the positive, neutral, negative sentiment label available, a review can belong to only one of the label. A review can either be positive, negative or neutral but cannot be both. Consequently, the output vector of neural network in multi-class classification represents the probability of an input instance belonging to the available labels. All the labels are mutually-exclusive i.e. no two labels can appear together. This is why, the sum of probabilities of each label in output vector is equal to 1.

Let’s take an example. Suppose a review is need to classified into positive, negative or neutral review. In this case, the output vector of NN might look like these:

Output Vector: [0.2  0.3  0.5]

Each value in output vector represents the probability of the input instance belonging to positive, negative or neutral review respectively and sum is equal to 1. Assume the ground truth of the review is positive and its ground truth label vector looks like this:

Ground Truth Vector: [1.0  0  0]

To train the NN (or model), loss needs to be calculated pertaining to each class and summed up in cross-entropy loss function. In cross-entropy, the loss for each class is given by:

Loss of class X = -p(X).log(q(X))
where p(X) -> probability of class X in ground truth vector &
      q(X) -> probability of class X in predicted output vector

So, the loss for each class in the above example becomes:

Loss for class Positive = -(1.0).log(0.2) = 0.67
Loss for class Negative = -(0).log(0.3)   = 0
Loss for class Neutral  = -(0).log(0.5)   = 0

We can conclude few points from the above results:

The loss would be higher for class when the ground truth value and predicted output vector values has higher difference. With regards to positive class, ground truth value was 0.2 and predicted value was 0.2 which resulted into 0.67 loss which is higher.
On the same note, if ground truth value and predicted output value for class is same, loss would be 0. Try with above examples.
Cross-entropy is only concerned with losses for actual label given in ground truth label vector. Hence, it doesn’t give importance to other labels while calculating the loss and loss is zero (for negative, neutral labels, loss is zero)

Cross-Entropy loss can be summarized by the below formula:

CE = \(-\sum_{x}p(x)log(q(x))\)

Why sum up over all classes if the loss for most of them is zero?

We can notice from above example that losses for incorrect classes are always zero. Hence, it doesn’t make much sense to calculate loss for every class. Whenever our ground truth label vector is one-hot vector, we can ignore other labels and focus only on the hot class for computing cross-entropy loss. Cross-Entropy loss then becomes:

CE = -log(q(x))

This is called as categorical cross-entropy loss. In multi-class classification, this form is often used for simplicity.

The previous version of CE can be used to find loss between the distributions. Cross-entropy would work on this as well:

Ground Truth Vector: [0.6  0.3  0.1]
Output Vector:       [0.2  0.3  0.5]

Binary Classification

Binary classification is exactly same as multi-class classification with just two labels to classify for. If one label is not the actual label for a instance, automatically the other label becomes the relevant label. In this classificaion, the output is not a probability vector but just a single value obtained by the sigmoid activation funciton. Let a and b be the two labels’ representation. If sigmoid outputs value > 0.5, then label a becomes the predicted label else automatically label b becomes the predicted label. We can also assume labels as 1 and 0. All these can be converted into ground truth and predicted output vector to compute cross-entropy loss.

Ground Truth Vector:                Prediction Vector
For label a (1): [1.0  0.0]             Sigmoid Value 0.8: [0.8  0.2]
For label b (0): [0.0  1.0]             

Suppose an instance has groun truth label a and predicted label vector is same as shown below. Then cross-entropy can be computed as:

CE: -p(a).log(q(a)) - p(b).log(q(b))
Due to binary classification, we can also write as:
CE: -p(a).log(q(a)) - (1 - p(a))log(1-(q(a)))   as p(b) = 1 - p(a)

This is called as binary cross-entropy loss function.

Multi-Label Classification

Each instance have more than one correct labels in multi-label classification unlike in multi-class classification where there is only one correct label. The task in multi-label classification is to find for each label whether label is relevant one or not (i.e. 1 or 0). It is a binary classification problem with respect to every label. Consequently, binary cross-entropy loss would be calculated for each label.

Let there are 3 labels X,Y,Z. Cross-entropy can be calculated as:

CE = Binary_CE(X) + Binary_CE(Y) + Binary_CE(Z)

Discussion and feedback

← → Top