How to Evaluate a Classifier

As you learned in the previous chapter, classification is the task of predicting the class to which input data belongs. One example would be to classify whether the text from an email (input data) is spam (one class) or not spam (another class).

When building a classification system, we need a way to evaluate the performance of the classifier. And we want to have evaluation metrics that are reflective of the classifier’s true performance.

This article will go through the most commonly used metrics and how they help provide a balanced view of a classifier’s performance. We will cover four types of metrics:

Accuracy
Precision
Recall
F1

Accuracy

Accuracy is a simple metric that measures how good the classifier is. You calculate it by dividing the number of accurate predictions by the total number of predictions made. This evaluation metric is perfect for data that has well-balanced target classes.

For spam detection, for example, the target classes—spam or not spam—must be distributed proportionally. Unfortunately, data from the real world is usually not well-balanced and this metric alone might not give you the best evaluation.

Suppose, for example, you have a dataset with 100 spam messages and 9900 non-spam ones. A classifier that classifies everything as non-spam would have an accuracy of 99% (9900/10,000).

Despite having a very high accuracy, the classifier in this example is nevertheless ineffective because it missed every single spam email. You’d therefore want to examine other metrics, too, because accuracy alone doesn’t quantify the classifier’s actual usefulness.

Confusion Matrix

The confusion matrix is a popular evaluation metric for classification models. It is a table that displays the performance of the classification model by looking at the actual and predicted values. It is common to split the data into training and testing data when building a model. During testing, you compare the actual target class with the predicted one. This leads to four results:

True positive predicts a positive outcome, and the prediction is true.
False positive predicts a positive outcome, but the prediction is false. This is also called a type 1 error.
True negative predicts a negative outcome, and the prediction is negative.
False negative predicts a negative outcome, but the prediction is postive. This is also called a type 2 error.

These values make up the confusion matrix, which evaluates quite well for imbalanced data. It is usually advisable to have high true positives and true negatives while having low false positives and false negatives.

For example, in spam filtering, true positives are spam messages sent to the trash folder, while false positives are non-spam messages sent to the trash folder. True negatives are non-spam messages delivered to the inbox, while false negatives are spam messages delivered to the inbox.

Precision

Precision is a valuable metric for evaluating classifiers and is an essential metric for imbalanced data and scenarios where false positive predictions raise more significant concerns. It measures how many of the positive predictions the model made turned out to be true. You calculate it by dividing the total number of true positive outcomes by the total number of predicted positives.

Turn back to the spam filtering example. Precision is the ratio of spam messages sent to the trash to the total number of messages sent to the trash. This metric evaluates how many messages were spam out of all those flagged as spam.

Recall

You calculate recall by dividing the number of predicted true positives by the sum of true positives and false negatives. Recall, also known as sensitivity, tries to quantify the number of positive predictions made. Monitoring recall for imbalanced data and scenarios where false negative predictions have a high cost is vital.

In the case of the spam filtering example, recall refers to the number of spam messages sent to the trash divided by the total number of spam messages.

F1-Score

The F1-score is a metric used to evaluate binary classifiers. These are classifiers that have only two target variables, for example, positive or negative. You calculate it by getting the harmonic mean of recall and precision. It’s a popular performance measurement metric that works well for imbalanced data and can be used to compare precision and recall values. Using the harmonic mean is better than the simple average because it yields a better value if the recall or precision values are small.

Tradeoffs Between Different Metrics

The threshold is crucial for turning probability values into a target class label. This value influences the precision and recall values, which usually have an interesting relationship where increasing one metric leads to a decrease in the other. You usually plot this characteristic on a precision versus recall curve.

You can also use the receiver operating characteristic (ROC) curve. It plots the true positive rate versus the false positive rate at different thresholds. It uses probability to inform you how well the NLP model separates two classes in a binary classification task.

Depending on your use case, you may need fewer false positives (higher precision) or fewer false negatives (higher recall). Adjusting the threshold value can achieve higher recall when you want it.

If you’d like to learn more about classification error metrics, please check out this video.

Conclusion

In this chapter, you learned the basic four metrics to evaluate classification models. In the next chapter, you'll take a deeper dive into these metrics.

Original Source

This material comes from the posts Text Classification Intuition for Software Developers and Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained.