KNN Hyperparameter Optimization

In this tutorial we will be using NiaPy to optimize the hyper-parameters of a KNN classifier, using the Hybrid Bat Algorithm. We will be testing our implementation on the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.

Dependencies

Before we get started, make sure you have the following packages installed:

niapy: pip install niapy --pre
scikit-learn: pip install scikit-learn

Defining the problem

Our problem consists of 4 variables for which we must find the most optimal solution in order to maximize classification accuracy of K-nearest neighbors classifier. Those variables are:

Number of neighbors (integer)
Weight function {‘uniform’, ‘distance’}
Algorithm {‘ball_tree’, ‘kd_tree’, ‘brute’}
Leaf size (integer), used with the ‘ball_tree’ and ‘kd_tree’ algorithms

The solution will be a 4 dimensional vector with each variable representing a tunable parameter of the KNN classifier. Since the problem variables in niapy are continuous real values, we must map our solution vector \(\vec x; x_i \in [0, 1]\) to integers:

Number of neighbors: \(y_1 = \lfloor 5 + x_1 \times 10 \rfloor; y_1 \in [5, 15]\)
Weight function: \(y_2 = \lfloor x_2 \rceil; y_2 \in [0, 1]\)
Algorithm: \(y_3 = \lfloor x_3 \times 2 \rfloor; y_3 \in [0, 2]\)
Leaf size: \(y_4 = \lfloor 10 + x_4 \times 40 \rfloor; y_4 \in [10, 50]\)

Implementation

First we will implement two helper functions, which map our solution vector to the parameters of the classifier, and construct said classifier.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

from niapy.problems import Problem
from niapy.task import OptimizationType, Task
from niapy.algorithms.modified import HybridBatAlgorithm


def get_hyperparameters(x):
    """Get hyperparameters for solution `x`."""
    algorithms = ('ball_tree', 'kd_tree', 'brute')
    n_neighbors = int(5 + x[0] * 10)
    weights = 'uniform' if x[1] < 0.5 else 'distance'
    algorithm = algorithms[int(x[2] * 2)]
    leaf_size = int(10 + x[3] * 40)

    params =  {
        'n_neighbors': n_neighbors,
        'weights': weights,
        'algorithm': algorithm,
        'leaf_size': leaf_size
    }
    return params


def get_classifier(x):
    """Get classifier from solution `x`."""
    params = get_hyperparameters(x)
    return KNeighborsClassifier(**params)

Next, we need to write a custom problem class. As discussed, the problem will be 4 dimensional, with lower and upper bounds set to 0 and 1 respectively. The class will also store our training dataset, on which 2 fold cross validation will be performed. The fitness function, which we’ll be maximizing, will be the mean of the cross validation scores.

class KNNHyperparameterOptimization(Problem):
    def __init__(self, X_train, y_train):
        super().__init__(dimension=4, lower=0, upper=1)
        self.X_train = X_train
        self.y_train = y_train

    def _evaluate(self, x):
        model = get_classifier(x)
        scores = cross_val_score(model, self.X_train, self.y_train, cv=2, n_jobs=-1)
        return scores.mean()

We will then load the breast cancer dataset, and split it into a train and test set in a stratified fashion.

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1234)

Now it’s time to run the algorithm. We set the maximum number of iterations to 100, and set the population size of the algorithm to 10.

problem = KNNHyperparameterOptimization(X_train, y_train)

# We will be running maximization for 100 iters on `problem`
task = Task(problem, max_iters=100, optimization_type=OptimizationType.MAXIMIZATION)

algorithm = HybridBatAlgorithm(population_size=10, seed=1234)
best_params, best_accuracy = algorithm.run(task)

print('Best parameters:', get_hyperparameters(best_params))

Finally, let’s compare our optimal model with the default one.

default_model = KNeighborsClassifier()
best_model = get_classifier(best_params)

default_model.fit(X_train, y_train)
best_model.fit(X_train, y_train)

default_score = default_model.score(X_test, y_test)
best_score = best_model.score(X_test, y_test)

print('Default model accuracy:', default_score)
print('Best model accuracy:', best_score)

Output:

Best parameters: {'n_neighbors': 8, 'weights': 'uniform', 'algorithm': 'kd_tree', 'leaf_size': 10}
Default model accuracy: 0.9210526315789473
Best model accuracy: 0.9385964912280702