Principal Component Analysis
Reduce dimensionality by projecting data onto the directions of maximum variance.
The Mathematics
Variance maximization, covariance matrix, and eigendecomposition
Variance Maximization
PCA finds the direction that maximizes the variance of the projected data:
This is equivalent to finding the leading eigenvector of the covariance matrix .
Covariance Matrix
First center the data: . The sample covariance matrix is:
Entry measures how features and co-vary. Diagonal entries are the per-feature variances.
Eigendecomposition
Decompose the symmetric covariance matrix into its eigenvectors (principal directions) and eigenvalues (explained variances):
Columns of are orthonormal eigenvectors; with .
Projection
Project the centered data onto the top principal components to get the lower-dimensional representation:
Choosing k — Scree Plot
Plot the eigenvalues in decreasing order; look for the "elbow" where the curve flattens. The fraction of variance explained by k components is . A common rule of thumb is to retain enough components to explain 95% of the total variance.
See It Work
Watch PCA find the directions of maximum variance
Centered Data & Principal Components
Original dataset: 10 points with strong positive correlation.
The Code
Bridge from mathematical formulation to Python implementation
Mathematical Formulation
Compute mean of each feature
Center data by subtracting the mean
Compute sample covariance matrix
Eigendecomposition of covariance matrix
Sort eigenvalues in descending order
Select top-k eigenvectors as principal components
Project centered data onto principal components
Compute fraction of variance explained by k components
Python Implementation
def pca(X, n_components=1):
X_mean = X.mean(axis=0)
X_c = X - X_mean
n = X.shape[0]
cov = (X_c.T @ X_c) / (n - 1)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
components = eigenvectors[:, :n_components]
X_proj = X_c @ components
total_var = eigenvalues.sum()
explained = eigenvalues[:n_components] / total_var
return X_proj, components, explained