k-Nearest Neighbors

Phase 1f(x)

The Mathematics

Distance metrics, majority voting, and the bias-variance tradeoff

Problem Definition

Given $n$ labeled training points $(x_i, y_i)$ and a new query point $q$ , classify $q$ based on the labels of its $k$ nearest neighbors.

\hat{y}(q) = \arg\max_c \sum_{x_i \in \mathcal{N}_k(q)} \mathbb{1}[y_i = c]

Euclidean Distance

The distance between two points in $d$ -dimensional space:

d(q, x_i) = \sqrt{\sum_{j=1}^{d}(q_j - x_{ij})^2}

For 2D data, this simplifies to the familiar formula: $d = \sqrt{(\Delta x)^2 + (\Delta y)^2}$ .

Choosing k: Bias-Variance Tradeoff

The choice of $k$ controls model complexity:

$\bullet$ Small k (e.g., k=1): Low bias, high variance. The decision boundary is jagged and sensitive to noise.

$\bullet$ Large k (e.g., k=n): High bias, low variance. The model always predicts the majority class.

No Training Phase

KNN is a lazy learner — it stores all training data and defers computation to prediction time. Prediction cost is $O(nd)$ for $n$ points in $d$ dimensions.

Phase 2▶

See It Work

Classifying query point (5, 3.5) with k=3

Class 0Class 1Class 2Query

\text{query} = (5, 3.5),\; k = 3

Step 1 of 6

Query point: (5, 3.5). Classify using its 3 nearest neighbors among 3 classes.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

d(q, x_i) = \sqrt{\sum_j (q_j - x_{ij})^2}

Compute Euclidean distance from query to each training point

\text{sort}\{d(q, x_i)\}

Sort all distances in ascending order

\mathcal{N}_k(q) = \text{top-}k \text{ nearest}

Select the k nearest neighbors

\text{count}(y_i = c),\; x_i \in \mathcal{N}_k

Count class votes among k neighbors

\hat{y} = \arg\max_c \text{votes}(c)

Return the class with the most votes

Python Implementation

def knn_classify(X_train, y_train, query, k=3):
    distances = []
    for i, x in enumerate(X_train):      # Compute distances
        d = sqrt(sum((q - x)**2 for q, x in zip(query, x)))
        distances.append((d, i))
    distances.sort()                      # Sort by distance
    top_k = distances[:k]                 # Select k nearest
    votes = {}
    for d, idx in top_k:                  # Count votes
        label = y_train[idx]
        votes[label] = votes.get(label, 0) + 1
    return max(votes, key=votes.get)      # Majority vote