K-Means Clustering

Phase 1f(x)

The Mathematics

WCSS objective, assignment step, update step, and convergence

Objective: Within-Cluster Sum of Squares

K-Means seeks to partition $n$ data points into $k$ clusters $C_1, \ldots, C_k$ by minimizing the total distance from each point to its cluster centroid:

\text{WCSS} = \sum_{j=1}^{k}\sum_{x_i \in C_j} \|x_i - \mu_j\|^2

where $\mu_j$ is the centroid (mean) of cluster $C_j$ .

Lloyd's Algorithm: Two Alternating Steps

1. Assignment Step

Assign each point to the nearest centroid:

c_i = \arg\min_j \|x_i - \mu_j\|^2

2. Update Step

Recompute each centroid as the mean of its assigned points:

\mu_j = \frac{1}{|C_j|}\sum_{x_i \in C_j} x_i

Convergence

Each step decreases (or maintains) WCSS, so the algorithm is guaranteed to converge. However, it may converge to a local minimum:

\text{WCSS}^{(t+1)} \leq \text{WCSS}^{(t)}

K-Means++ Initialization

Random initialization can lead to poor local minima. K-Means++ selects initial centroids spread apart, giving $O(\log k)$ competitive ratio in expectation.

Phase 2▶

See It Work

Watch K-Means alternate between assign and update steps

Clusters & Centroids

WCSS Over Iterations

WCSS = —

Phase: init

Cluster 1Cluster 2Cluster 3Centroid

k = 3,\; \mu_1, \mu_2, \mu_3 \text{ initialized}

Step 1 of 9

Initialize 3 centroids at arbitrary positions. Now alternate between assign and update steps.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

\mu_1, \ldots, \mu_k \leftarrow \text{init}

Initialize k centroids (random or heuristic)

\|x_i - \mu_j\|^2 \;\forall j

Compute distance from each point to each centroid

c_i = \arg\min_j \|x_i - \mu_j\|^2

Assign each point to its nearest centroid

\mu_j = \frac{1}{|C_j|}\sum_{x_i \in C_j} x_i

Update centroid to mean of assigned points

\mu^{(t+1)} = \mu^{(t)} \Rightarrow \text{stop}

Stop when centroids no longer change

Python Implementation

def k_means(X, k=3, max_iter=100):
    centroids = initialize_centroids(X, k)  # Random init
    for iteration in range(max_iter):
        # Assign step
        assignments = []
        for x in X:                           # For each point
            dists = [norm(x - c) for c in centroids]
            assignments.append(argmin(dists))  # Nearest centroid
        # Update step
        for j in range(k):                    # For each cluster
            cluster = X[assignments == j]
            centroids[j] = mean(cluster)      # New centroid
        if converged(centroids):
            break                             # Done
    return assignments, centroids