KNN Hyperparameter Optimization

In this tutorial we will be using NiaPy to optimize the hyper-parameters of a KNN classifier, using the Hybrid Bat Algorithm. We will be testing our implementation on the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.

Dependencies

Before we get started, make sure you have the following packages installed:

  • niapy: pip install niapy --pre

  • scikit-learn: pip install scikit-learn

Defining the problem

Our problem consists of 4 variables for which we must find the most optimal solution in order to maximize classification accuracy of K-nearest neighbors classifier. Those variables are:

  1. Number of neighbors (integer)

  2. Weight function {‘uniform’, ‘distance’}

  3. Algorithm {‘ball_tree’, ‘kd_tree’, ‘brute’}

  4. Leaf size (integer), used with the ‘ball_tree’ and ‘kd_tree’ algorithms

The solution will be a 4 dimensional vector with each variable representing a tunable parameter of the KNN classifier. Since the problem variables in niapy are continuous real values, we must map our solution vector \(\vec x; x_i \in [0, 1]\) to integers:

  • Number of neighbors: \(y_1 = \lfloor 5 + x_1 \times 10 \rfloor; y_1 \in [5, 15]\)

  • Weight function: \(y_2 = \lfloor x_2 \rceil; y_2 \in [0, 1]\)

  • Algorithm: \(y_3 = \lfloor x_3 \times 2 \rfloor; y_3 \in [0, 2]\)

  • Leaf size: \(y_4 = \lfloor 10 + x_4 \times 40 \rfloor; y_4 \in [10, 50]\)

Implementation

First we will implement two helper functions, which map our solution vector to the parameters of the classifier, and construct said classifier.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

from niapy.problems import Problem
from niapy.task import OptimizationType, Task
from niapy.algorithms.modified import HybridBatAlgorithm


def get_hyperparameters(x):
    """Get hyperparameters for solution `x`."""
    algorithms = ('ball_tree', 'kd_tree', 'brute')
    n_neighbors = int(5 + x[0] * 10)
    weights = 'uniform' if x[1] < 0.5 else 'distance'
    algorithm = algorithms[int(x[2] * 2)]
    leaf_size = int(10 + x[3] * 40)

    params =  {
        'n_neighbors': n_neighbors,
        'weights': weights,
        'algorithm': algorithm,
        'leaf_size': leaf_size
    }
    return params


def get_classifier(x):
    """Get classifier from solution `x`."""
    params = get_hyperparameters(x)
    return KNeighborsClassifier(**params)

Next, we need to write a custom problem class. As discussed, the problem will be 4 dimensional, with lower and upper bounds set to 0 and 1 respectively. The class will also store our training dataset, on which 2 fold cross validation will be performed. The fitness function, which we’ll be maximizing, will be the mean of the cross validation scores.

class KNNHyperparameterOptimization(Problem):
    def __init__(self, X_train, y_train):
        super().__init__(dimension=4, lower=0, upper=1)
        self.X_train = X_train
        self.y_train = y_train

    def _evaluate(self, x):
        model = get_classifier(x)
        scores = cross_val_score(model, self.X_train, self.y_train, cv=2, n_jobs=-1)
        return scores.mean()

We will then load the breast cancer dataset, and split it into a train and test set in a stratified fashion.

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1234)

Now it’s time to run the algorithm. We set the maximum number of iterations to 100, and set the population size of the algorithm to 10.

problem = KNNHyperparameterOptimization(X_train, y_train)

# We will be running maximization for 100 iters on `problem`
task = Task(problem, max_iters=100, optimization_type=OptimizationType.MAXIMIZATION)

algorithm = HybridBatAlgorithm(population_size=10, seed=1234)
best_params, best_accuracy = algorithm.run(task)

print('Best parameters:', get_hyperparameters(best_params))

Finally, let’s compare our optimal model with the default one.

default_model = KNeighborsClassifier()
best_model = get_classifier(best_params)

default_model.fit(X_train, y_train)
best_model.fit(X_train, y_train)

default_score = default_model.score(X_test, y_test)
best_score = best_model.score(X_test, y_test)

print('Default model accuracy:', default_score)
print('Best model accuracy:', best_score)

Output:

Best parameters: {'n_neighbors': 8, 'weights': 'uniform', 'algorithm': 'kd_tree', 'leaf_size': 10}
Default model accuracy: 0.9210526315789473
Best model accuracy: 0.9385964912280702