Classification problems are a subset of machine learning problems. Given some data points, you’d like to assign a class to each one of them based on some model. One of the popular techniques for assigning these classes is Extreme Gradient Boosting (XGBoost), a popular ensemble machine learning algorithm.
Within this technique, each decision tree tries to correct the error of the previous tree using the Gradient Descent algorithm. XGBoost differs from other Gradient Boosting Methods in the techniques employed for reducing the training time and computation required .
Now, XGBoost can predict the class labels in a classification problem, i.e., given an unknown data point, XGBoost can be used to determine which class it belongs to. However, there are some problems where the probability of a data point belonging to a class matters more than the assigned label. This is especially true for imbalanced classification problems.
Consider the classification of a manufactured part as proper or improper using images. Classification gives you two classes. However, you may want to take graded actions based on the probability. For example, if the probability of it being improper is >70%, reject it right away, if it is between 50-70%, store it separately for rework, if it is between 30-50%, store it separately for manual inspection, and if it is <30%, accept it.
Fortunately, XGBoost can also help with probability prediction, enabling us to solve such kinds of problems.
In this article, we will see how XGBoost can be used to predict probability, when you should accept the result right away, and when you should take them with a grain of salt.
XGBoost probability executive summary
The table below summarizes the key XGBoost probability concepts this article will explore in more detail.
XGBoost for probability measurement
Before we start, we need to understand that some machine learning algorithms are probabilistic, i.e., they calculate the probability first, and then predict the class. XGBoost is not one of them.
So how does XGBoost predict probability? Well, first of all, it doesn’t output probability, but a metric related to the probability of the class label. Visualizing this for XGBoost, or any other ensemble technique is difficult.
However, we can build intuition by looking at how a DecisionTree Classifier predicts probabilities. Suppose we pass in a data point through a classifier, and it ends at a particular leaf. In that case, the classifier looks at the ratio of positive class labels at that leaf to all the labels that ended up at that leaf and outputs that number as the probability (see the two examples below). Muhammad Iqbal Tawakal explained his experimental validation of the concept in this article.
As you can see, while this is not an exact measure of probability, it helps. The problem is that it tilts toward the majority class and this causes an issue when there is an imbalanced dataset. That’s where the calibration of probability comes in.
How is the calibration performed? It is essentially curve fitting. With one curve of predicted probabilities and one curve of true probabilities, and you fit a curve between the two to convert your predicted probabilities to a more realistic calibrated version.
How do you get the true probabilities? You group observations having the same predicted probability then calculate the true probability using class labels.
For example, if the predicted probability of 20 training observations is 30%, then you group these 20. If 6 of them (i.e. 30%) belong to the positive class, the probability is as good as calibrated. If not, we treat the number we get (say 8/20 or 40%) as the true probability. In practice, you would do this for bins of probability (say 15%-20%, 20%-25%, and so on), and then use the mean predicted and true probabilities of the bin for getting the curve.
Once the predicted and true probabilities are available, two methods can be used for evaluating the calibration curve:
- isotonic regression (a weighted least squares regression method)
- Platt scaling (logistic regression method)
Platt scaling is preferred when the distortion is sigmoid-shaped, whereas isotonic regression works for any type of monotonic distortion. Luckily for us, the sklearn Python package already has both of these methods implemented, and we don’t have to worry about much of the mathematics behind them.
How to use XGBoost predictions on a dataset
Let’s go through an example. We will consider the banknote authentication dataset. It consists of 1372 data points (well-balanced) and uses four attributes (variance, skewness, kurtosis, and entropy of the image of the notes) to predict if the note is forged or not. Here’s what we will do:
- Split the data into the train-validation-test sets
- Train our XGBoost model on the training data
- Evaluate probabilities on the validation dataset
- Visualize the histogram of the probability distribution
- Calibrate probability on the validation set using Isotonic Regression (by using the actual y values in the validation set)
- Evaluate calibrated probabilities on the test set
- Visualize the histogram of the calibrated probability distribution
The code is below:
For your reference, the histograms of the uncalibrated and calibrated probabilities are given below:
As you can see, the uncalibrated probability was more or less toward the extremes, meaning the model has fit very well, and we are more or less confident when we say if a note is forged or not. However, for some data points, the uncalibrated probability indicates that the model is not very sure.
As mentioned in this article, boosted trees tend to push the probabilities away from the extremes. The calibration tends to correct this, and we see that the calibrated probabilities make the model appear much more sure. This same result or trend doesn't need to be replicated in any other problem. It depends on the dataset properties like size, whether it’s balanced or imbalanced, and model effectiveness.
You may wonder why we performed a train-validation-test split instead of the conventional train-test split on the data. The reason is that just as you shouldn’t evaluate a model on the same data on which it was trained. Likewise, you shouldn’t evaluate the calibrated probability on the same set used for calibration. You could have as well called the split train-calibrate-test split.
In the above example, if the probability estimate was not required on the test set, then a simple train–calibrate split would have been enough.
While XGBoost has been the kingmaker in several Kaggle competitions, it is not a magic bullet that works everywhere. You need to examine the problem at hand and often evaluate several algorithms to see what works best.
XGBoost generally works well with structured data, especially when the number of features is less than the number of observations. However, it can overfit with noisy data (bagging techniques like RandomForest perform better here). Similarly, it is not ideal for problems involving image recognition or natural language processing (neural networks are preferred in these cases).
Moreover, the probability predictions of XGBoost, are not accurate by design and calibration can also fix them only to the extent that your training data allows.
For more on XGBoost’s use cases and limitations, check out this thread on Kaggle that includes the observations and experiences of people in the data science community.
XGBoost best practices
If you decide XGBoost is the right solution for your use case, here are four best practices to help you get the most out of it.
- Evaluate if you need probability measurement in the first place.
- Perform calibration of the probabilities output by XGBoost.
- While lack of calibration can lead to bad probabilities, they can more often be a result of a bad model, and model optimization, using methods like feature selection, dimensionality reduction, and parameter tuning should be considered first, before jumping into calibration.
- If possible, evaluate the nature of the distortion between the predicted and actual probabilities, and use Logistic Regression if the distortion is sigmoid-shaped.
XGBoost is a robust machine-learning algorithm that optimizes computational time and resource requirements. While the XGBoost Classifier can be used to predict class labels, it can also provide a measure of probability. However, since the algorithm is not probabilistic by design, it is prone to errors and requires calibration. Isotonic regression and logistic regression are two calibration methods that are generally used. Logistic regression is preferred more when the distortion between the predicted and actual probabilities is sigmoid-shaped.
We saw how to calibrate the probabilities predicted by XGBoost in Python. While XGBoost works for a wide variety of use cases, you may want to evaluate whether it suits your problem. If you don’t get satisfactory probability output, then fine-tuning the model and calibration are the next steps