K-Nearest Neighbors (or simply KNN) is a simple supervised algorithm that uses proximity measure to perform classification or regression. It assumes that similar data points are closer to one another.
A supervised machine learning algorithm uses the labeled training (input) data to learn a model that predicts labels of unknown samples. For example, we have a dataset that has two variables: height and age. Here, height is a predictor, and age is the target (output) variable. If we use an algorithm that learns a function using this labeled data, then it would be a supervised algorithm. On the contrary, an unsupervised algorithm does not use the target variable to learn a model.
Consider the above figure. Here, the data points have two attributes (predictors) and two classes, i.e., circle and diamond. We can observe here that there are two groups. Items belonging to the same class are in proximity.
Consider the figure below, where we have an unknown data point. To which class will this point belong if we apply the concept of KNN?
Since the neighbors of the unknown example belong to the class circle, we can easily say that this is also a circle. Furthermore, you can notice here that we used the class information of the input data to predict the output of the unknown sample as KNN is a supervised algorithm.
KNN finds the similarity of data points using some distance measure. There are multiple ways to calculate the distance. The choice of the distance measure depends upon the type of data and problem at hand.
For the sake of this tutorial, we will use the Euclidean distance, which is a popular distance metric.
In the above example, we did not consider a specific number of neighbors to predict the output because the example was simple, and we were able to find the class just by looking at the data. However, in practical cases, we need to choose a specific number of neighbors to decide, i.e., we have to select the value of K in k-nearest neighbors.
Now, let’s go ahead and see the algorithm of KNN.
Given a set of n training examples having m predictors and a target variable y, i.e., pi = (x1i, x2i, x3i, …, xmi, yi), we want to predict the outcome of the test sample q = (x1, x2, x3, …, xm, ?).
Choosing the right value for K is important because it has a strong effect on the performance of the KNN algorithm.
If we take K=1, then we are only considering a single neighbor to predict the outcome. In this case, the algorithm will be unstable and sensitive to outliers (anomalies). Consider the following figure in which we want to classify the unknown example.
If K=1, then the class of the test sample will be equal to the class of the neighbor that is closest to it. Therefore, KNN would predict that the unknown example is a circle as its nearest neighbor is also a circle. However, we can observe that it is surrounded by diamonds mostly, and the circle is probably an outlier.
If we choose a very high value for K, then the algorithm would get affected by the number of training examples of each class. In the above example, if we choose K=11, then the unknown sample will get classified as a circle because it is the most probable class.
Therefore, we cannot choose a value that is too high or too low. The optimal value can be obtained by trying various values for K. We can take aside a portion of the training data known as the validation set. Then, test the algorithm on that data with different K values. Finally, we pick the value that gives us the least error.
Furthermore, we usually take K as an odd number to avoid ties, i.e., an equal number of samples of each class.
Let’s now implement KNN from scratch. First, let’s import some libraries.
import numpy as np import pandas as pd import copy from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split
Here, we will use the iris dataset. It contains four attributes and three classes.
Target Variable (species)
Let’s load it into a Padas DataFrame.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" data = pd.read_csv(url,names=['sepal length','sepal width','petal length','petal width','target']) data.head()
Output Now, let’s separate the dataset into features and the target variable. Then, we split the data for training and testing. We reserve 20% of the data for testing.
X = data.iloc[:, 0:4] #predictors y = data.iloc[:,4] #target variable #splitting the data into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
Now, we write the code to implement KNN. It has five parameters, the value of K, training data including predictors and labels, testing data, and a flag for classification or regression.
def KNN(k, X_trani, y_train, X_test, classification=True): predicted_labels =  #iterate through every sample in the test dataset to predict its outcome for i in range(0, len(X_test)): dt = copy.deepcopy(X_train) #create a copy of X_train #calculate the euclidean distance between the test sample and every training example dt['distance'] = np.sqrt(np.sum(np.square(X_train-X_test.iloc[i]), axis=1)) dt['target'] = y_train #sort the values by distance in the ascending order dt = dt.sort_values('distance') #get the labels of the k nearest neighbors k_labels = dt.iloc[0:k, -1 if classification: predicted_labels.append(k_labels.mode()) #take the most frequent item else: predicted_labels.append(k_labels.mean()) #take the mean return predicted_labels
Let’s call this function to predict the labels of the test data. y_prediction = KNN(5, X_train, y_train, X_test) y_prediction[0:10] #first ten labels
Finally, let’s check the accuracy when K=5.
If we set K=7, then we will get an accuracy of 100%.
y_prediction = KNN(7, X_train, y_train, X_test) accuracy_score(y_test, y_prediction)
KNN is a simple algorithm, and thus it is quite easy to implement. It is non-parametric, meaning that it does not make any prior assumptions about the data. It only assumes that similar items are nearby.
KNN is a lazy learner, which means that it does not learn a specific function from the training data. It simply stores all the training data. While testing, it will use each training sample to predict the outcome. Therefore, it consumes more space and is computationally expensive. While the training complexity is constant O(1), the testing time complexity is O(nd) because we need to calculate the distance with each point. Here n is the number of training samples, and d is the cost of computing distance.