Optimal Threshold for Imbalanced Classification | by Audhi Aprilliant | Jan, 2021
How to choose the optimal threshold using a ROC curve and Precision-Recall curve
Classification is one of the supervised learning technique to conduct predictive analytics with the categorical outcome, it might be a binary class or multiclass. Nowadays, there is a lot of research and cases about classification using several algorithms, from a basic to advanced like logistic regression, discriminant analysis, Naïve Bayes, decision tree, random forest, support vector machine, neural network etc. They have been well developed and successfully applied to many application domains. However, imbalanced class distribution of a data set has encountered a serious diﬃculty to most classiﬁer learning algorithms which assume a relatively balanced distribution.
All models are wrong, but some are useful
— George E. P. Box
Further, imbalanced class distribution in datasets occurs when one class, often the one that is of more interest, that is, the positive or minority class, is insufficiently represented. It means that one of the classes is much smaller than the other one. It happens when we are studying a rare phenomenon such as medical diagnosis, risk management, hoax detection and many more.
Overview of the confusion matrix
Before talking intensively with imbalanced classification and how to handle this case, it will be good if we have a good foundation with a confusion matrix. According to Kohavi and Provost (1998), a confusion matrix (also well-known as error matrix) contains information about actual and predicted classifications done by a classification algorithm. Performance of such algorithms is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two-class classifier.
The classification with the two-class classifier will have four possible outcomes as follows.
- True Positive or TP — an outcome where the model correctly predicts the positive class
- False Positive or FP (well-known as Type I Error)— an outcome where the model incorrectly predicts the positive class
- True Negative or TN — an outcome where the model correctly predicts the negative class
- False Negative or FN (well-known as Type II Error) — an outcome where the model incorrectly predicts the negative class
Read more about Type I Error and Type II Error HERE
Furthermore, in order to evaluate our machine learning model or algorithm in classification case, there are a few evaluation metrics to explore but it’s tricky if we meet the imbalanced class.
- Accuracy — a ratio of correctly predicted observation to the total observations
- Recall or Sensitivity — a ratio of correctly predicted positive observations to all observations in actual class — positive
- Specificity — a ratio of correctly predicted negative observations to all observations in actual class — negative
- Precision — a ratio of correctly predicted positive observations to the total predicted positive observations
- F1-Score — a weighted average of Precision and Recall. Therefore, this score takes both False Positives and False Negatives into account
Note: a model that produces no False Positives has a precision of 1.0 while a model that produces no False Negatives has a recall of 1.0
For imbalanced classification, we must choose the correct evaluation metrics to use with the condition they are valid and unbiased. It means that the value of these evaluation metrics will have to represent the actual condition of the data. For instance, accuracy will be actually biased in imbalanced classification because of the different distribution of classes. Take a look at the following study case to understand the statement above.
Suppose we are a Data Scientist in a tech company and asked for developing a machine learning model to predict whether our customer will be a churn or not. We have 165 customers where the 105 customers are categorized as not churn and the rest as churn customer. The model produces a given outcome as follows.
As a balanced classification, accuracy may be the unbiased metrics for evaluation. It represents the model performance correctly over the balanced class distribution. The accuracy, in this case, has a high correlation to the recall, specificity, precision etc. According to the confusion matrix, that’s easier to conclude that our research has been produced as an optimal algorithm or model.
Similar to the previous case but we modified the number of customers for constructing the imbalanced classification. Now, there are 450 customers in total where 15 customers are categorized as churn and the rest, 435 customers as not churn. The model produces a given outcome as follows.
Looking at the accuracy in the confusion matrix above, the conclusion may be misleading because of the imbalanced class distribution. What does happen to the algorithm when it produces the accuracy of 0.98? The accuracy will be biased in this case. It doesn’t represent the model performance as well. The accuracy is high enough but the recall is very bad. Furthermore, the specificity and precision equal to 1.0 because the model or algorithm doesn’t produce the False Positive. That is one of the consequences of imbalanced classification. However, F1-score will be the real representation of model performance cause it considers the recall and precision in its calculation.
Note: to classify the data into positive and negative, there is still no a rigid policy
In addition to some of the evaluation metrics that have been mentioned above, there are two important metrics to understand as follows.
- False Positive Rate — a ratio of incorrectly predicted positive observations to all observations in actual class — positive
- False Negative Rate — a ratio of incorrectly predicted negative observations to all observations in actual class — negative
The default threshold for classification
To compare the uses of evaluation metrics and determine the probability threshold for imbalanced classification, the real data simulation is proposed. The simulation generates the 10,000 samples with two variables, dependent and independent, with the ratio between major and minor classes is about 99:1. It belongs to the imbalanced classification, no doubt.
To deal with imbalanced class, the threshold moving is proposed as the alternative of handling the imbalanced. Generating the synthetic observation or resample a certain data, theoretically, has its own risk, like create a new observation actually doesn’t appear in the data, decrease the valuable information of the data itself or create a flood of information.
ROC curve for finding the optimal threshold
A receiver operating characteristics or known as ROC curve is a two-dimensional plot that illustrates how well a classifier system works as the discrimination cut-off value is changed over the range of the predictor variable. The X-axis or independent variable is the false positive rate for the predictive test. The Y-axis or dependent variable is the true positive rate for the predictive test. A perfect result would be the point (0, 1) indicating 0% false positives and 100% true positives. It is relevant to note that the nearer to the upper-left side of ROC space, the better a classifier is. Moreover, all classifiers in the diagonal line have random behaviour and the ones below this line should be discarded.
Note: Each point in ROC space is a true positive or false positive data pair for a discrimination cut-off value of the predictive test.
The geometric mean or known as G-mean is the geometric mean of sensitivity (known as recall) and specificity. This measure tries to maximize the accuracy of each of the classes while keeping these accuracies balanced. So, it will be one of the unbiased evaluation metrics for imbalanced classification.
Using the G-mean as the unbiased evaluation metrics and the main focus of threshold moving, it produces the optimal threshold for the binary classification in the 0.0131. Theoretically, the observation will be categorized as a minor class when its probability is lower than 0.0131, vice versa.
Youden’s J statistic
Youden’s J index combines sensitivity and specificity into a single measure (Sensitivity + Specificity — 1) and has a value between 0 and 1. Youden’s index is often used in conjunction with ROC analysis. It is also equivalent to the vertical distance above the diagonal line to the ROC curve for a single decision threshold.
The Youden’s J index gives a equals result of the threshold as using G-mean. It produces the optimal threshold for the binary classification in the 0.0131.
The precision-Recall curve for finding the optimal threshold
The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false-positive rate, and high recall relates to a low false-negative rate.
There are several evaluation metrics that are ready to use as the main focus for calculation. They are G-mean, F1-score etc. As long as they are unbiased metrics for imbalanced classification, they can be applied in the calculation.
Using the Precision-Recall curve and F1-score, it produces a threshold of 0.3503 for determining whether a given observation belongs to the major or minor class. It differs too much from the previous technique using the ROC curve because of the approaches.
Additional method — threshold tuning
Threshold tuning is a common technique to determine an optimal threshold for imbalanced classification. The sequence of the threshold is generated by the researcher need while the previous techniques using the ROC and Precision & Recall to create a sequence of those thresholds. The advantages are the customization of the threshold sequence as the need but it will have a higher cost of computation.
np.arrange(0.0, 1.0, 0.0001) means that there are 10,000 candidates of a threshold. Using a looping mechanism, it tries to find out the optimal threshold with subject to maximize the F1-score as an unbiased metrics. Finally, looping mechanism was stopped and printed out the optimal threshold of 0.3227.
Big thanks to Jason Brownlee who has been giving me a motivation to learn and work harder related to Statistics and machine learning implementation especially in threshold moving technique with a clear and proper article. Thanks!
The machine learning algorithm mainly works well on the balanced classification because of their algorithm assumption using the balanced distribution of the target variable. Further, accuracy is no longer relevant to the imbalanced case, it’s biased. So, the main focus must be switched to those unbiased like G-mean, F1-score etc. Threshold moving using ROC curve, Precision-Recall curve, threshold tuning curve can be the alternative solution to handling the imbalanced distribution since the resampling technique seems like it doesn’t make sense to the business logic. However, the options are open and the implementation must keep consideration of the business needs.
 A. Ali, S.M. Shamsuddin, A. Ralescu. Classification with class imbalance problem: a review (2013). International Journal of Soft Computing and Its Applications. 5(3): 1–30.
 A. Wong, M.S. Kamel. Classification of imbalanced data: a review (2011). International Journal of Pattern Recognition and Artificial Intelligence. 23(4): 687–719.
 J. Brownlee. A Gentle Introduction to Threshold-Moving for Imbalanced Classification (2020). https://machinelearningmastery.com/.
 N. Smits. A note on Youden’s Jand its cost ratio (2010). BMC Med Res Methodol 10(89). https://doi.org/10.1186/1471-2288-10-89.
 S. Yang, G. Berdine. The receiver operating characteristic (ROC) curve (2017). The Southwest Respiratory and Critical Care Chronicles. 5(19):34–36.
 S. Visa, B. Ramsay, A. Ralescu, E.v.d. Knaap. Confusion matrix-based feature selection (2011). Proceedings of The 22nd Midwest Artificial Intelligence and Cognitive Science Conference 2011, Cincinnati, Ohio, USA. April 16–17, 2011.
 T. Fawcett. Introduction to ROC analysis (2006). Pattern Recognition Letters. 27(8):861–874.
Read More …