# KNN Hyperparameter Optimization¶

In this tutorial we will be using NiaPy to optimize the hyper-parameters of a KNN classifier, using the Hybrid Bat Algorithm. We will be testing our implementation on the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.

## Dependencies¶

Before we get started, make sure you have the following packages installed:

• niapy: pip install niapy --pre

• scikit-learn: pip install scikit-learn

## Defining the problem¶

Our problem consists of 4 variables for which we must find the most optimal solution in order to maximize classification accuracy of K-nearest neighbors classifier. Those variables are:

1. Number of neighbors (integer)

2. Weight function {‘uniform’, ‘distance’}

3. Algorithm {‘ball_tree’, ‘kd_tree’, ‘brute’}

4. Leaf size (integer), used with the ‘ball_tree’ and ‘kd_tree’ algorithms

The solution will be a 4 dimensional vector with each variable representing a tunable parameter of the KNN classifier. Since the problem variables in niapy are continuous real values, we must map our solution vector $$\vec x; x_i \in [0, 1]$$ to integers:

• Number of neighbors: $$y_1 = \lfloor 5 + x_1 \times 10 \rfloor; y_1 \in [5, 15]$$

• Weight function: $$y_2 = \lfloor x_2 \rceil; y_2 \in [0, 1]$$

• Algorithm: $$y_3 = \lfloor x_3 \times 2 \rfloor; y_3 \in [0, 2]$$

• Leaf size: $$y_4 = \lfloor 10 + x_4 \times 40 \rfloor; y_4 \in [10, 50]$$

## Implementation¶

First we will implement two helper functions, which map our solution vector to the parameters of the classifier, and construct said classifier.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

from niapy.problems import Problem
from niapy.algorithms.modified import HybridBatAlgorithm

def get_hyperparameters(x):
"""Get hyperparameters for solution x."""
algorithms = ('ball_tree', 'kd_tree', 'brute')
n_neighbors = int(5 + x[0] * 10)
weights = 'uniform' if x[1] < 0.5 else 'distance'
algorithm = algorithms[int(x[2] * 2)]
leaf_size = int(10 + x[3] * 40)

params =  {
'n_neighbors': n_neighbors,
'weights': weights,
'algorithm': algorithm,
'leaf_size': leaf_size
}
return params

def get_classifier(x):
"""Get classifier from solution x."""
params = get_hyperparameters(x)
return KNeighborsClassifier(**params)


Next, we need to write a custom problem class. As discussed, the problem will be 4 dimensional, with lower and upper bounds set to 0 and 1 respectively. The class will also store our training dataset, on which 2 fold cross validation will be performed. The fitness function, which we’ll be maximizing, will be the mean of the cross validation scores.

class KNNHyperparameterOptimization(Problem):
def __init__(self, X_train, y_train):
super().__init__(dimension=4, lower=0, upper=1)
self.X_train = X_train
self.y_train = y_train

def _evaluate(self, x):
model = get_classifier(x)
scores = cross_val_score(model, self.X_train, self.y_train, cv=2, n_jobs=-1)
return scores.mean()


We will then load the breast cancer dataset, and split it into a train and test set in a stratified fashion.

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1234)


Now it’s time to run the algorithm. We set the maximum number of iterations to 100, and set the population size of the algorithm to 10.

problem = KNNHyperparameterOptimization(X_train, y_train)

# We will be running maximization for 100 iters on problem

algorithm = HybridBatAlgorithm(population_size=10, seed=1234)

print('Best parameters:', get_hyperparameters(best_params))


Finally, let’s compare our optimal model with the default one.

default_model = KNeighborsClassifier()
best_model = get_classifier(best_params)

default_model.fit(X_train, y_train)
best_model.fit(X_train, y_train)

default_score = default_model.score(X_test, y_test)
best_score = best_model.score(X_test, y_test)

print('Default model accuracy:', default_score)
print('Best model accuracy:', best_score)


Output:

Best parameters: {'n_neighbors': 8, 'weights': 'uniform', 'algorithm': 'kd_tree', 'leaf_size': 10}
Default model accuracy: 0.9210526315789473
Best model accuracy: 0.9385964912280702