k-Nearest Neighbors
Classify a point by majority vote of its k closest neighbors in feature space.
The Mathematics
Distance metrics, majority voting, and the bias-variance tradeoff
Problem Definition
Given labeled training points and a new query point , classify based on the labels of its nearest neighbors.
Euclidean Distance
The distance between two points in -dimensional space:
For 2D data, this simplifies to the familiar formula: .
Choosing k: Bias-Variance Tradeoff
The choice of controls model complexity:
Small k (e.g., k=1): Low bias, high variance. The decision boundary is jagged and sensitive to noise.
Large k (e.g., k=n): High bias, low variance. The model always predicts the majority class.
No Training Phase
KNN is a lazy learner — it stores all training data and defers computation to prediction time. Prediction cost is for points in dimensions.
See It Work
Classifying query point (5, 3.5) with k=3
Query point: (5, 3.5). Classify using its 3 nearest neighbors among 3 classes.
The Code
Bridge from mathematical formulation to Python implementation
Mathematical Formulation
Compute Euclidean distance from query to each training point
Sort all distances in ascending order
Select the k nearest neighbors
Count class votes among k neighbors
Return the class with the most votes
Python Implementation
def knn_classify(X_train, y_train, query, k=3):
distances = []
for i, x in enumerate(X_train): # Compute distances
d = sqrt(sum((q - x)**2 for q, x in zip(query, x)))
distances.append((d, i))
distances.sort() # Sort by distance
top_k = distances[:k] # Select k nearest
votes = {}
for d, idx in top_k: # Count votes
label = y_train[idx]
votes[label] = votes.get(label, 0) + 1
return max(votes, key=votes.get) # Majority vote