Hyperparameter Optimization in Gradient Boosting Packages with Bayesian Optimization | by Osman Mamun | Dec, 2020
Gradient Boosting is an ensemble based machine learning algorithm, first proposed by Jerome H. Friedman in a paper titled Greedy Function Approximation: A Gradient Boosting Machine. It differs from other ensemble based method in way how the individual decision trees are built and combined together to make the final model. For example, in a Random Forest algorithm, several decision trees are built in parallel to each other based on a subsample of data. By virtue of the construction procedure each model will be local to that subsample and won’t generalize well, which will lead to high variance. However, when several of these high variance models are combined together to form a consensus, it leads to surprisingly better performance for a lot of regression task. In contrast, a Gradient Boosting algorithm is built iteratively by combining prediction from several weak learners. In each iteration step higher emphasis (or weight) is given to those data points which were wrongly predicted in the previous steps. Since a weak learner is used for each subtree the model has high bias but when these models are sequentially built and combined with a predefined stopping criteria, they result in a very powerful algorithm. The effectiveness of gradient boosting algorithm is obvious when we look into the success story of different gradient boosting libraries in machine learning competitions or scientific research domain. There are several implementation of gradient boosting algorithm, namely 1. XGBoost, 2. CatBoost, and 3. LightGBM.
XGBoost became widely known and famous for its success in several kaggle competition. It does not only perform well on problems that involves structured data, it’s also very flexible and fast compared to the originally proposed Gradient Boosting method. While the original Gradient Boosting requires the trees to be built in a sequential order, the XGBoost implementation parallelize the tree building task thus significantly speeding up the training process by leveraging parallel computation architecture. In this article, we will use the sklearn API of the XGBoost implementation. For installation of XGBoost and a detailed discussion of the underlying theory, see the link XGBoost.
CatBoost is another implementation of Gradient Boosting algorithm, which is also very fast and scalable, supports categorical and numerical features, and gives better prediction with default hyperparameter. It is developed by Yandex researchers and used for search, recommendation systems, and even for self-driving cars. It’s also open-source with a flexible sklearn API. For installation and documentation, follow this link CatBoost.
LightGBM is another implementation of the Gradient Boosting by Microsoft. It’s generally faster and has lower memory usage because of the way the tree is built, which is slightly different than XGBoost. For a detailed discussion of the differences and similarities between these three implementations, I found this article to be very helpful. To install lightgbm and documentation, follow this link LightGBM.
Here, we will use Bayesian optimization to find the optimal hyperparameters as opposed to grid search or random search as Bayesian optimization is perfect for multidimensional hyperparameter optimization that we commonly encounter in all these Gradient Boosting implementations. Bayesian optimization is a probabilistic optimization method where an utility function is utilized to choose the next point to evaluate. The choice of the utility function depends on the problem at hand and requires both the prediction and uncertainty involved with the prediction to propose the next point. Usually a Gaussian Process is used as the surrogate probabilistic model. For a detailed discussion about Bayesian optimization, interested readers can checkout these links: bayesopt and the beauty of bayesian optimization explained in simple terms.
Having set up the preamble for our work, now it’s time to get our hands dirty with the real dataset and coding to implement Bayesian optimization for tuning hyperparameters of different Gradient Boosting implementations. First load the adult dataset from Penn Machine learning benchmark.
Read More …