Streamlining Classifier Model Comparisons with SK-Learn

Juan Felipe Alvarez Jaramillo
3 min readJul 17, 2019

--

A classification problem is a type of supervised machine learning problem where we try to create a model that learns from a dataset that contains some previously known classes. We then use that model to predict the class of new observations.

One tricky aspect involved in the deployment of these classification problems is that there might be multiple models that can be trained from the same dataset. Just to cite a few examples, the Sci-kit learn library contains at least 8 different mathematical approaches that could accomplish the classification task, each containing their own hyper-parameters that could increase or decrease performance when tweaked.

Typically, finding the proper classification algorithm will involve multiple reruns of the code until a good result is achieved. Since this is a highly repetitive process I thought I would write a blog to share a program that could be re-used in every run to obtain a model comparison summary that ranks the candidate algorithms from best to worst (the blue loop on the image of the left). If you are familiar with SAS Enterprise Miner, I wanted to create something similar to the SAS Model Comparison node to make the process more efficient.

In this example, I will be working with a very popular publicly available dataset that you can download from this Kaggle page, or even load it directly from sk-learn with sklearn.datasets.load_breast_cancer. However, the code snippet that follows should be able to work with any classification dataset that has been properly preprocessed.

The code below assumes that the data has been loaded and preprocessed to have a normalised 0–1 scale, and also that it has already been split into train and test subsets.

For the comparison, we will be working with the following classifiers:

Each object has been set with an initial hyperparameter selection. These parameters will be changed in the second iteration to see if the model improves. And it is very likely that we will keep on playing with the parameters until we hit the jack-pot.

Now we create the loop that will make our lives easier:

Now, every time we run our script, we will get a neat Pandas data frame containing a rank of the best-to-worst classification algorithms, based on their Test Score:

If we wish to experiment and fine-tune our models parameters, we only need to change their arguments and re-run the code. No need to have multiple scripts whatsoever.

I hope that this loop makes your ML training process more efficient!

As usual, the complete code and the datasets can be found on my github.

--

--

Juan Felipe Alvarez Jaramillo
Juan Felipe Alvarez Jaramillo

Written by Juan Felipe Alvarez Jaramillo

Data and analytics expert, driven by curiosity and fueled by a hacker’s mentality. MSc Business Analytics from Alliance Manchester Business School.

No responses yet