k-Nearest Neighbors

Classify a point by majority vote of its k closest neighbors in feature space.

Classification
Phase 1f(x)

The Mathematics

Distance metrics, majority voting, and the bias-variance tradeoff

Problem Definition

Given nn labeled training points (xi,yi)(x_i, y_i) and a new query point qq, classify qq based on the labels of its kk nearest neighbors.

y^(q)=argmaxcxiNk(q)1[yi=c]\hat{y}(q) = \arg\max_c \sum_{x_i \in \mathcal{N}_k(q)} \mathbb{1}[y_i = c]

Euclidean Distance

The distance between two points in dd-dimensional space:

d(q,xi)=j=1d(qjxij)2d(q, x_i) = \sqrt{\sum_{j=1}^{d}(q_j - x_{ij})^2}

For 2D data, this simplifies to the familiar formula: d=(Δx)2+(Δy)2d = \sqrt{(\Delta x)^2 + (\Delta y)^2}.

Choosing k: Bias-Variance Tradeoff

The choice of kk controls model complexity:

\bullet Small k (e.g., k=1): Low bias, high variance. The decision boundary is jagged and sensitive to noise.

\bullet Large k (e.g., k=n): High bias, low variance. The model always predicts the majority class.

No Training Phase

KNN is a lazy learner — it stores all training data and defers computation to prediction time. Prediction cost is O(nd)O(nd) for nn points in dd dimensions.

Phase 2

See It Work

Classifying query point (5, 3.5) with k=3

x₁x₂0.01.83.65.47.29.00.01.63.24.86.48.0query
Class 0Class 1Class 2Query
query=(5,3.5),  k=3\text{query} = (5, 3.5),\; k = 3
Step 1 of 6

Query point: (5, 3.5). Classify using its 3 nearest neighbors among 3 classes.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

d(q,xi)=j(qjxij)2d(q, x_i) = \sqrt{\sum_j (q_j - x_{ij})^2}

Compute Euclidean distance from query to each training point

sort{d(q,xi)}\text{sort}\{d(q, x_i)\}

Sort all distances in ascending order

Nk(q)=top-k nearest\mathcal{N}_k(q) = \text{top-}k \text{ nearest}

Select the k nearest neighbors

count(yi=c),  xiNk\text{count}(y_i = c),\; x_i \in \mathcal{N}_k

Count class votes among k neighbors

y^=argmaxcvotes(c)\hat{y} = \arg\max_c \text{votes}(c)

Return the class with the most votes

Python Implementation

def knn_classify(X_train, y_train, query, k=3):
    distances = []
    for i, x in enumerate(X_train):      # Compute distances
        d = sqrt(sum((q - x)**2 for q, x in zip(query, x)))
        distances.append((d, i))
    distances.sort()                      # Sort by distance
    top_k = distances[:k]                 # Select k nearest
    votes = {}
    for d, idx in top_k:                  # Count votes
        label = y_train[idx]
        votes[label] = votes.get(label, 0) + 1
    return max(votes, key=votes.get)      # Majority vote