How to Evaluate a Classifier

One of an ML practitioner's most critical tasks is evaluating the model's performance. This is a crucial step because it demonstrates the level of quality and readiness of the model to perform in production environments. This section will teach you the various metrics used to evaluate a classification model. These metrics will help you track overfitting or underfitting scenarios, leading to improved model quality.

As you learned in the previous chapter, classification is the task of predicting the class to which input data belongs. One example would be to classify whether the text from an email (input data) is spam (one class) or not spam (another class).

When building a classification system, we need a way to evaluate the performance of the classifier. And we want to have evaluation metrics that are reflective of the classifier’s true performance.

This article will go through the most commonly used metrics and how they help provide a balanced view of a classifier’s performance. We will cover four types of metrics:

  • Accuracy
  • Precision
  • Recall
  • F1


Accuracy is a simple metric that measures how good the classifier is. You calculate it by dividing the number of accurate predictions by the total number of predictions made. This evaluation metric is perfect for data that has well-balanced target classes.

For spam detection, for example, the target classes—spam or not spam—must be distributed proportionally. Unfortunately, data from the real world is usually not well-balanced and this metric alone might not give you the best evaluation. You might need to use it with other metrics if, for example:

You have a dataset with 100 spam messages and 9900 non-spam ones
You have a classifier that classifies everything as non-spam has 99% accuracy (9900/10,000)

In this example, the classifier is ineffective because it missed every single spam email. Therefore, you’d want to measure other metrics, too, because measuring accuracy doesn’t represent the classifier’s actual usefulness.

Confusion Matrix

The confusion matrix is a popular evaluation metric for classification models. It is a table that displays the performance of the classification model by looking at the actual and predicted values. It is common to split the data into training and testing data when building a model. During testing, you compare the actual target class with the predicted one. This leads to four results:

True positive predicts a positive outcome, and the prediction is true.
False positive predicts a positive outcome, but the prediction is false. This is also called a type 1 error.
True negative predicts a negative outcome, and the prediction is true.
False negative predicts a negative outcome, but the prediction is false. This is also called a type 2 error.

These values make up the confusion matrix, which evaluates quite well for imbalanced data. It is usually advisable to have high true positives and true negatives while having low false positives and false negatives.

For example, in spam filtering, true positives are spam messages sent to the trash folder, while false positives are non-spam messages sent to the trash folder. True negatives are non-spam messages delivered to the inbox, while false negatives are spam messages delivered to the inbox.


Precision is a valuable metric for evaluating classifiers and is an essential metric for imbalanced data and scenarios where false positive predictions raise more significant concerns. It measures how many of the positive predictions the model made turned out to be true. You calculate it by dividing the total number of true positive outcomes by the total number of predicted positives.

Turn back to the spam filtering example. Precision is the ratio of spam messages sent to the trash to the total number of messages sent to the trash. This metric evaluates how many messages were spam out of all those flagged as spam.


You calculate recall by dividing the number of predicted true positives by the sum of true positives and false negatives. Recall, also known as sensitivity, tries to quantify the number of positive predictions made. Monitoring recall for imbalanced data and scenarios where false negative predictions have a high cost is vital.

In the case of the spam filtering example, recall refers to the number of spam messages sent to the trash divided by the total number of spam messages.


The F1-score is a metric used to evaluate binary classifiers. These are classifiers that have only two target variables, for example, positive or negative. You calculate it by getting the harmonic mean of recall and precision. It’s a popular performance measurement metric that works well for imbalanced data and can be used to compare precision and recall values. Using the harmonic mean is better than the simple average because it yields a better value if the recall or precision values are small.

Tradeoffs Between Different Metrics

The threshold is crucial for turning probability values into a target class label. This value influences the precision and recall values, which usually have an interesting relationship where increasing one metric leads to a decrease in the other. You usually plot this characteristic on a precision versus recall curve.

You can also use the receiver operating characteristic (ROC) curve. It plots the true positive rate versus the false positive rate at different thresholds. It uses probability to inform you how well the NLP model separates two classes in a binary classification task.

Depending on your use case, you may need fewer false positives (higher precision) or fewer false negatives (higher recall). Adjusting the threshold value can achieve higher recall when you want it.

If you’d like to learn more about classification error metrics, please check out this video.


In this chapter you learned the basic four metrics to evaluate classification models. In the next chapter you'll take a deeper dive into these metrics.

Original Source

This material comes from the posts Text Classification Intuition for Software Developers and Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained.

What’s Next

That's it for the introduction to NLP! You are now ready to start learning LLMs!