How CatBoost Algorithm Works In Machine Learning
CatBoost is the first Russian machine learning algorithm developed to be open source. The algorithm was developed in the year 2017 by machine learning researchers and engineers at Yandex (a technology company).
The intention is to serve multi-functional purposes such as
Learn the popular CatBoost algorithm in machine learning, along with the implementation. #machinelearning #datascience #catboost #classification #regression #python
One of the many unique features that the CatBoost algorithm offers is the integration to work with diverse data types to solve a wide range of data problems faced by numerous businesses.
Not just that, but CatBoost also offers accuracy just like the other algorithm in the tree family.
Before we get started, let’s have a look at the topics you are going to learn in this article.
What is CatBoost Algorithm?
The term CatBoost is an acronym that stands for “Category” and “Boosting.” Does this mean the “Category’ in CatBoost means it only works for categorical features?
The answer is, “No.”
According to the CatBoost documentation, CatBoost supports numerical, categorical, and text features but has a good handling technique for categorical data.
The CatBoost algorithm has quite a number of parameters to tune the features in the processing stage.
Which produces a prediction model in an ensemble of weak prediction models, typically decision trees.
Gradient boosting is a robust machine learning algorithm that performs well when used to provide solutions to different types of business problems such as
Again, it can return an outstanding result with relatively fewer data. Unlike other machine learning algorithms that only perform well after learning from extensive data.
We would suggest you read the article How the gradient boosting algorithms works if you want to learn more about the gradient boosting algorithms functionality.
Features of CatBoost
Here we would look at the various features the CatBoost algorithm offers and why it stands out.
CatBoost can improve the performance of the model while reducing overfitting and the time spent on tuning.
CatBoost has several parameters to tune. Still, it reduces the need for extensive hyper-parameter tuning because the default parameters produce a great result.
The CatBoost algorithm is a high performance and greedy novel gradient boosting implementation.
Hence, CatBoost (when implemented well) either leads or ties in competitions with standard benchmarks.
Categorical Features Support
The key features of CatBoost is one of the significant reasons why it was selected by many boosting algorithms such as LightGBM, XGBoost algorithm ..etc
With other machine learning algorithms. After preprocessing and cleaning your data, the data has to be converted into numerical features so that the machine can understand and make predictions.
This is same like, for any text related models we convert the text data into to numerical data it is know as word embedding techniques.
This process of encoding or conversion is time-consuming. CatBoost supports working with non-numeric factors, and this saves some time plus improves your training results.
CatBoost offers easy-to-use interfaces. The CatBoost algorithm can be used in Python with scikit-learn, R, and command-line interfaces.
Fast and scalable GPU version: the researchers and machine learning engineers designed CatBoost at Yandex to work on data sets as large as tens of thousands of objects without lagging.
Training your model on GPU gives a better speedup when compared to training the model on CPU.
To crown this improvement, the larger the dataset is, the more significant the speedup. CatBoost efficiently supports multi-card configuration. So, for large datasets, use a multi-card configuration.
Faster Training & Predictions
Before the improvement of servers, the maximum number of GPUs per server is 8 GPUs. Some data sets are more extensive than that, but CatBoost uses distributed GPUs.
This feature enables CatBoost to learn faster and make predictions 13-16 times faster than other algorithms.
Supporting Community of Users
The non-availability of a team to contact when you encounter issues with a product you consume can be very annoying. This is not the case for CatBoost.
CatBoost has a growing community where the developers lookout for feedbacks and contributions.
There is a Slack community, a Telegram channel (with English and Russian versions), and Stack Overflow support. If you ever discover a bug, there is a page via GitHub for bug reports.
Is tuning required in CatBoost?
The answer is not straightforward because of the type and features of the dataset. The default settings of the parameters in CatBoost would do a good job.
CatBoost produces good results without extensive hyper-parameter tuning. However, some important parameters can be tuned in CatBoost to get a better result.
These features are easy to tune and are well-explained in the CatBoost documentation. Here are some of the parameters that can be optimized for a better result;
- cat_ features,
- learning_rate & n_estimators,
CatBoost vs. LightGBM vs. XGBoost Comparison
These three popular machine learning algorithms are based on gradient boosting techniques. Hence, a greedy and very powerful.
Several Kagglers have won a Kaggle competition using one of these accuracy-based algorithms.
Before we dive into the several differences that these algorithms possess, it should be noted that the CatBoost algorithm does not require the conversion of the data set to any specific format. Precisely numerical format, unlike XGBoost and Light GBM.
The oldest of these three algorithms is the XGBoost algorithm. It was introduced sometime in March 2014 by Tianqi Chen, and the model became famous in 2016.
Microsoft introduced lightGBM in January 2017. Then Yandex open sources the CatBoost algorithm later in April 2017.
The algorithms differ from one another in implementing the boosted trees algorithm and their technical compatibilities and limitations.
XGBoost was the first to improve GBM’s training time. Followed by LightGBM and CatBoost, each with its techniques mostly related to the splitting mechanism.
Now we would go through a comparison of the three models using some characteristics.
The split function is a useful technique, and there are different ways of splitting features for these three machine learning algorithms.
One right way of splitting features during the processing phase is to inspect the characteristics of the column.
lightGBM uses the histogram-based split finding and utilizes a gradient-based one-side sampling (GOSS) that reduces complexity through gradients.
Small gradients are well trained, which means small training errors, and large gradients are undertrained.
In Light GBM, for GOSS to perform well and to reduce complexity, the focus is on instances with large gradients. While a random sampling technique is implemented on instances with small gradients.
The CatBoost algorithm introduced a unique system called Minimal Variance Sampling (MVS), which is a weighted sampling version of the widely used approach to regularization of boosting models, Stochastic Gradient Boosting.
Also, Minimal Variance Sampling (MVS) is the new default option for subsampling in CatBoost.
With this technique, the number of examples needed for each iteration of boosting decreases, and the quality of the model improves significantly compared to the other gradient boosting models.
The features for each boosting tree are sampled in a way that maximizes the accuracy of split scoring.
In contrast to the two algorithms discussed above, XGBoost does not utilize any weighted sampling techniques.
This is the reason why the splitting process is slower compared to the GOSS of LightGBM and MVS of CatBoost.
A significant change in the implementation of the gradient boosting algorithms such as XGBoost, LightGBM, CatBoost, is the method of tree construction, also called leaf growth.
The CatBoost algorithm grows a balanced tree. In the tree structure, the feature-split pair is performed to choose a leaf.
The split with the smallest penalty is selected for all the level’s nodes according to the penalty function. This method is repeated level by level until the leaves match the depth of the tree.
By default, CatBoost uses symmetric trees ten times faster and gives better quality than non-symmetric trees.
However, in some cases, other tree growing strategies (Lossguide, Depthwise) can provide better results than growing symmetric trees.
The parameters that change the tree growing policy include
LightGBM grows the tree leaf-wise (best-first) tree growth. The leaf-wise growth finds the leaves that minimize the loss and split just those leaves without touching the rest (leaves that maximize the loss), allowing an imbalanced tree structure.
The leaf-wise growth strategy seems to be an excellent method to achieve a lower loss. This is because it does not grow level-wise, but it often results in overfitting when the data set is small.
However, this strategy’s greed with LightGBM can be regularized using these parameters
XGBoost also uses the leaf-wise strategy, just like the LightGBM algorithm. The leaf-wise approach is a good choice for large datasets, which is one reason why XGBoost performs well.
In XGBoost, the parameter that handles the splits process to reduce overfit is
Missing Values Handling
CatBoost supports three modes for processing
- missing values,
- “Min,” and “Max.”
For “Forbidden,” CatBoost treats missing values as not supported.
The presence of the missing values is interpreted as errors. For “Min,” missing values are processed as the minimum value for a feature.
With this method, the split that separates missing values from all other values is considered when selecting splits.
“Max” works just the same as “Min,” but the difference is the change from minimum to maximum values.
The method of handling missing values for LightGBM and XGBoost is similar. The missing values will be allocated to the side that reduces the loss in each split.
Categorical Features Handling
CatBoost uses one-hot encoding for handling categorical features. By default, CatBoost uses one-hot encoding for categorical features with a small number of different values in most modes.
The number of categories for one-hot encoding can be controlled by the one_hot_max_size parameter in Python and R.
On the other hand, the CatBoost algorithm categorical encoding is known to make the model slower.
However, the engineers at Yandex have in the documentation stated that one-hot encoding should not be used during pre-processing because it affects the model’s speed.
LightGBM uses integer-encoding for handling the categorical features. This method has been found to perform better than one-hot encoding.
The categorical features must be encoded to non-negative integers (an integer that is either positive or zero).
The parameter that refers to handling categorical features in LightGBM is categorical_features.
XGBoost was not engineered to handle categorical features. The algorithm supports only numerical features.
This, in turn, means that the encoding process would be done manually by the user.
Some manual methods of encoding include label encoding, mean encoding, and one-hot.
When and When Not to Use CatBoost
We have discussed all of the goods of the CatBoost algorithm without addressing the procedure for using it to achieve a better result.
In this section, we would look at when CatBoost is sufficient for our data, and when it is not.
When To Use CatBoost
Short training time on a robust data
Unlike some other machine learning algorithms, CatBoost performs well with a small data set.
However, it is advisable to be mindful of overfitting. A little tweak to the parameters might be needed here.
Working on a small data set
This is one of the significant strengths of the CatBoost algorithm. Suppose your data set has categorical features, and converting it to numerical format seems to be quite a lot of work.
In that case, you can capitalize on the strength of CatBoost to make the process of building your model easy.
When you are working on a categorical dataset
CatBoost is incredibly faster than many other machine learning algorithms. The splitting, tree structure, and training process are optimized to be faster on GPU and CPU.
Training on GPU is 40 times faster than on CPU, two times faster than LightGBM, and 20 times faster than XGBoost.
When To Not Use CatBoost
There are not many disadvantages of using CatBoost for whatever data set.
So far, the hassle why many do not consider using CatBoost is because of the slight difficulty in tuning the parameters to optimize the model for categorical features.
Practical Implementation of CatBoost Algorithm in Python
CatBoost Algorithm Overview in Python 3.x
- Import the libraries/modules needed
- Import data
- Data cleaning and preprocessing
- Train-test split
- CatBoost training and prediction
- Model Evaluation
Before we build the cat boost model, Let’s have
Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Survival (0 = No; 1 = Yes)
Number of Siblings/Spouses Aboard
Number of Parents/Children Aboard
Passenger Fare (British pound)
Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Before we implement the CatBoost, we need to install the the catboost library.
- Command: pip install catboost
You can get the complete code in our Github account. For you reference we have included the notebook please scroll the complete IPython notebook.
In this article, we have discussed and shed light on the CatBoost algorithm.
The CatBoost algorithm is excellent and is also dominating as the algorithm is used by many because of the features it offers, most especially handling categorical features.
This article covered an introduction to the CatBoost algorithm, the unique features of CatBoost, the difference between CatBoost, LightGBM, and XGBoost.
Also, we covered the answer to if hyper-parameter tuning is required for CatBoost and an introduction to CatBoost in Python.
Recommended Machine Learning Courses
Machine Learning A to Z Course
Python Data Science Specialization Course
Complete Supervised Learning Algorithms
Read More …