SciKit Learn Search over Parameters AND Models

When training a machine learning model, we often use cross validation to search over parameters within a given model type. For example, choosing the best alpha in a lasso model. But what if we don’t know the best model type to use? SciKitLearn makes searching over different models straightforward.

I wrote the following code to search over a tuples where the first entry in each tuple is the type of model and the second entry is the set of parameters to search over for that model. Each tuple has different parameters since each model requires different parameters.

#loop over model types and associated parameter sets
model_param_set = [
    (
     linear_model.Lasso(),
     {'alpha':[0.01, 0.05]}
    ),
    (
     linear_model.Ridge(),
     {'alpha':[0.1, 0.05]}
    ),
    (
        ensemble.GradientBoostingRegressor(),
     {"n_estimators": [100, 150],
      "max_leaf_nodes": [4, 10],
      "max_depth": [None],
      "random_state": [2],
      "min_samples_split": [5, 10]}
     ),
    (
       xgb.XGBClassifier(),
     {'n_estimators':[10, 100, 200],
      'tree_method':['auto'],
      'subsample':[0.67, 0.33, 0.25],
      'colsample_level':[0.06, 0.03, 0.01],
      'verbose':[0],
      'n_jobs':[6],
      'random_state':[1234]}   
    )
]

To add models/parameters, just add another tuple or edit the existing tuples.

Then, loop over the tuples with sklearn, saving the resulting scores to find the model+parameter combination that performs the best in the cross validation stage.

#empty lists to populate with best models/params
model_list = []
model_score_list = []
for m,p in model_param_set:

  clf = GridSearchCV(m, #model
                     p, #parameters
                     cv = GroupKFold(n_splits=3), #use this type of cross validation
                     scoring='roc_auc') #how to score models

  clf.fit(X_train, #training features
          y_train, #training labels
          groups = df["groups"].loc[train_ix] #specifies groups for cross validation
          )

  print('max score: ', clf.best_score_)
  print('best model: ', clf.best_estimator_)

  #add best of each model to list
  model_score_list.append(clf.best_score_)
  model_list.append(clf.best_estimator_)

I use a grouped cross validation to keep groups together, but the approach works for more traditional cross validation.

Finally, choose the best model+parameter tuple and plot the ROC for that set.

#combine model list and model scores into one dict
model_dict = dict(zip(model_list, model_score_list))
#find the max score and associated model
max_value_key = max(model_dict, key=model_dict.get)
print(max(model_dict.values())) #this is the winning score
print(max_value_key) #the winning model
#plot the AUC
fpr, tpr, thresholds = metrics.roc_curve(y_test_cv, y_pred_cv)
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()

Updated: