Classification Evaluation Metrics

4 min readJun 17, 2021

Introduction

In today’s post, we will tackle the many useful methods of evaluating machine learning models that are used for classification. This is an important topic as it will help you judge your models' success much easier and allow you to pick the best possible model for your problem.

What are classification problems?

First off, let’s make sure you know what type of problems these evaluation metrics are good for.

A classification problem can come under one of three sub-types, binary, multi-class and multi-label.

Binary classification is the simplest, you need to decide whether the data points to one thing or another e.g., is this email spam or is it not spam?

Multi-class classification problems are similar to binary, but the result is not between two possibilities but multiple possibilities. For example is the traffic light green, yellow or red?

Lastly, you have multi-label also known as multi-output classification where a single instance can have multiple labels. For example, what topics is this YouTube video about?

What are evaluation metrics?

Evaluation metrics are essential in applied machine learning, they allow you to compare multiple models by measuring their quality in different ways and pick the best model for your particular problem.

In this article we will go through some of the most popular metrics:

Confusion Matrix
Accuracy
F1 Score
ROC Curve/ AUC

Confusion Matrix

In a nutshell, the confusion matrix plots true positives and negatives against false positives and negatives. To best understand this, you should know there 4 squares of the matrix true positive (TP), false positive (FP), true negative (TN) and false-negative (FN).

For example, a TP is when your model predicted ‘True’ and the actual values were also ‘True’ whereas FP is when your model predicted ‘True’ but the actual values were ‘False’.

Once you have the confusion matrix made from your predictions you can use them to calculate some useful values.

Recall = TP/(TP+FN) Out of all positive classes, how much was predicted correctly. The goal here is to get it as high as possible.
Precision = TP/(TP+FP) Out of all the positive classes we predicted, how many are actually positive.
Prevalence = TP/Total How often true actually occurs in your data.
Misclassification rate = (FP+FN)/Total

More info can be found here:

https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

Accuracy

This is one of the simple metrics that tells you how many correct predictions you have and is calculated with (correct predictions)/ (total predictions).

One important note is that if you have a class-imbalanced data set this metric is unreliable. For example, if you’re testing a set where you have 100 predictions and 95 of them are actually true and only 5 false.

https://developers.google.com/machine-learning/crash-course/classification/accuracy

F-1 score

To calculate the F1 score you will first need to calculate the recall and precision which is found using the confusion matrix.

It is a measure of a model’s accuracy on a dataset and more formally it is the harmonic mean of precision and recall.

There is a tradeoff between precision and recall when one increases the other decreases and the challenge here is to find the right balance for your problem. You need to decide whether fewer false negatives (higher recall) or fewer false positives (higher precision) is better for your problem.

F1 score can be interpreted in different ways depending on the scenario (assuming target label is a binary label):

Balanced class: In this case, the F1 score can be ignored.
An unbalanced class where both classes are important: In this case, it’s good to look for a model that delivers a high F1 score and a low miss-classification rate.
Unbalanced classes but one are more important: In this case, it is good practice to pick an algorithm that gets a high F1 score only on the important class.

Have a look at the precision-recall curve to get a better understanding and it will also allow you to pick the best level for your problem.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html

ROC Curve/ AUC

This is one of the most common ways to evaluate binary classifiers. This curve is a plot of the false positive rate against the true positive rate (recall). A perfect classifier will have a value of 1.0 for the area under the curve (AUC).

The two main reasons why AUC is a popular measure are listed here:

AUC is scale-invariant which means it measures how well predictions are ranked rather than their absolute values.
AUC is classification-threshold-invariant which means it measures the quality of predictions no matter what the classification threshold is.

It should be noted the two reasons listed above are not always desirable.

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

Summary

To conclude, in my opinion, all relevant evaluation metrics should be used when comparing different models. This article is a simple introduction to these and a good way to quickly lookup what metrics are relevant to the problems you are trying to solve.

You should go deeper into each metric to gain a better understanding and so be able to use them more effectively. Stay tuned for the next blog on regression evaluation metrics.

Classification Evaluation Metrics

Introduction

Contents

Confusion Matrix

Summary

Written by Hubert Rzeminski