Python Snippet of Bayesian Optimization for Xgboost
Introduction
Bayesian optimization is usually a faster alternative than GridSearch when we’re trying to find out the best combination of hyperparameters of the algorithm. In Python, there’s a handful package that allows to apply it, the bayes_opt.
This post is a code snippet to start using the package functions along xgboost to solve a regression problem.
The Code
Preparing the environment.
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn import datasets
import bayes_opt as bopt
boston = datasets.load_boston()
dm_input = xgb.DMatrix(boston['data'], label=boston.target)
To run the Bayesian Optimization, it’s required to create a custom function. The function has to outcome the target metric for a parameters' combination trial.
def objective(self, max_depth, eta, max_delta_step, colsample_bytree, subsample):
cur_params = {'objective': 'reg:linear',
'max_depth': int(max_depth),
'eta': eta,
'max_delta_step': int(max_delta_step),
'colsample_bytree': colsample_bytree,
'subsample': subsample}
cv_results = xgb.cv(params=cur_params,
dtrain=self.dm_input,
nfold=3,
seed=3,
num_boost_round=50000,
early_stopping_rounds=50,
metrics='rmse')
return -1 * cv_results['test-rmse-mean'].min()
In the case of a regression problem, there’s an important detail, as the Bayes Optimization seeks to maximize the output value and we’re trying to minimize the target rmse metric. We have to assign a minus sign to the output function for it properly search the minimum rmse.
Another detail is that the arguments of the custom function are restricted only to hyperparameters. It creates a problem given we also have to pass the input dataset to the xgboost training function. I see two ways to handle it. We could either use the dataset like a global variable or declare the custom function inside a class which has the dataset like an attribute. The examples follows using the class option.
class custom_bayesopt:
def __init__(self, dm_input):
self.dm_input = dm_input
def objective(self, max_depth, eta, max_delta_step, colsample_bytree, subsample):
cur_params = {'objective': 'reg:linear',
'max_depth': int(max_depth),
'eta': eta,
'max_delta_step': int(max_delta_step),
'colsample_bytree': colsample_bytree,
'subsample': subsample}
cv_results = xgb.cv(params=cur_params,
dtrain=self.dm_input,
nfold=3,
seed=3,
num_boost_round=50000,
early_stopping_rounds=50,
metrics='rmse')
return -1 * cv_results['test-rmse-mean'].min()
Now, we call the Bayesian process, passing the custom function and the hyperparameters' boundaries.
bopt_process = bopt.BayesianOptimization(custom_bayesopt(dm_input).objective,
{'max_depth': (2, 15),
'eta': (0.01, 0.3),
'max_delta_step': (0, 10),
'colsample_bytree': (0, 1),
'subsample': (0, 1)},
random_state=np.random.RandomState(1))
It’s possible to register the outcome events into a log file. It is especially useful when you want to create a new bayesian optimization instance, allowing the reusing of the saved information. Check more about here.
logger = bopt.observer.JSONLogger(path="bopt.log.json")
bopt_process.subscribe(bopt.event.Events.OPTMIZATION_STEP, logger)
bopt_process.maximize(n_iter=10, init_points=12)
Finishing the iterating process. The winner hyperparameters are on the max
attribute:
bopt_process.max
{'target': -2.9679186666666664,
'params': {'colsample_bytree': 0.935372077775139,
'eta': 0.013944731934196423,
'max_delta_step': 0.07555792154893812,
'max_depth': 3.2975847928232245,
'subsample': 0.7161419730372364}}