KNN

K-Nearest Neighbors (page 3 of 6)

There is no statistical model associated with a KNN model. For classification problems, the model makes its prediction by using a majority vote based on the k nearest neighbors. As a result, we typically want to select an odd number for our k hyper-parameter to avoid ties. For regression problems (continuous targets), a KNN model makes its prediction based on the average values of its k-nearest neighbors.

The key to this algorithm is to pick the correct ‘k’. If we pick too large of a ‘k’ then we might be comparing an individual who lives in the Fan District in Richmond with an individual who lives in the Downtown area. If we pick too small of a ‘k’ then we might not be capturing all of the diversity within a specific neighborhood. Changing the ‘k’ often changes the predictions.

There is no statistical way to select the correct ‘k’. One approach is to use trial and error where we compare different accuracy scores based on the different ‘k’ hyper-parameters. The other common approach is to take the square root of the sample size (often divided by 2) of the training data set.