### Python Snippet of Bayesian Optimization for Xgboost

Wed, Oct 16, 2019 2-minute read

## Introduction

Bayesian optimization is usually a faster alternative than GridSearch when we’re trying to find out the best combination of hyperparameters of the algorithm. In Python, there’s a handful package that allows to apply it, the bayes_opt.

This post is a code snippet to start using the package functions along xgboost to solve a regression problem.

## The Code

Preparing the environment.

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn import datasets
import bayes_opt as bopt

boston = datasets.load_boston()
dm_input = xgb.DMatrix(boston['data'], label=boston.target)


To run the Bayesian Optimization, it’s required to create a custom function. The function has to outcome the target metric for a parameters' combination trial.

def objective(self, max_depth, eta, max_delta_step, colsample_bytree, subsample):
cur_params =  {'objective': 'reg:linear',
'max_depth': int(max_depth),
'eta': eta,
'max_delta_step': int(max_delta_step),
'colsample_bytree': colsample_bytree,
'subsample': subsample}

cv_results = xgb.cv(params=cur_params,
dtrain=self.dm_input,
nfold=3,
seed=3,
num_boost_round=50000,
early_stopping_rounds=50,
metrics='rmse')

return -1 * cv_results['test-rmse-mean'].min()


In the case of a regression problem, there’s an important detail, as the Bayes Optimization seeks to maximize the output value and we’re trying to minimize the target rmse metric. We have to assign a minus sign to the output function for it properly search the minimum rmse.

Another detail is that the arguments of the custom function are restricted only to hyperparameters. It creates a problem given we also have to pass the input dataset to the xgboost training function. I see two ways to handle it. We could either use the dataset like a global variable or declare the custom function inside a class which has the dataset like an attribute. The examples follows using the class option.

class custom_bayesopt:
def __init__(self, dm_input):
self.dm_input = dm_input

def objective(self, max_depth, eta, max_delta_step, colsample_bytree, subsample):
cur_params =  {'objective': 'reg:linear',
'max_depth': int(max_depth),
'eta': eta,
'max_delta_step': int(max_delta_step),
'colsample_bytree': colsample_bytree,
'subsample': subsample}

cv_results = xgb.cv(params=cur_params,
dtrain=self.dm_input,
nfold=3,
seed=3,
num_boost_round=50000,
early_stopping_rounds=50,
metrics='rmse')

return -1 * cv_results['test-rmse-mean'].min()


Now, we call the Bayesian process, passing the custom function and the hyperparameters' boundaries.

bopt_process = bopt.BayesianOptimization(custom_bayesopt(dm_input).objective,
{'max_depth': (2, 15),
'eta': (0.01, 0.3),
'max_delta_step': (0, 10),
'colsample_bytree': (0, 1),
'subsample': (0, 1)},
random_state=np.random.RandomState(1))


It’s possible to register the outcome events into a log file. It is especially useful when you want to create a new bayesian optimization instance, allowing the reusing of the saved information. Check more about here.

logger = bopt.observer.JSONLogger(path="bopt.log.json")

bopt_process.subscribe(bopt.event.Events.OPTMIZATION_STEP, logger)

bopt_process.maximize(n_iter=10, init_points=12)


Finishing the iterating process. The winner hyperparameters are on the max attribute:

bopt_process.max

{'target': -2.9679186666666664,
'params': {'colsample_bytree': 0.935372077775139,
'eta': 0.013944731934196423,
'max_delta_step': 0.07555792154893812,
'max_depth': 3.2975847928232245,
'subsample': 0.7161419730372364}}