Random Forest

Ensemble of decision trees trained on bootstrapped data with feature subsampling.

Ensemble

Phase 1f(x)

The Mathematics

Bootstrap aggregation, feature subsampling, and variance reduction

Bias-Variance Decomposition

Averaging $T$ uncorrelated models each with variance $\sigma^2$ reduces variance by $1/T$ . With correlation $\rho$ between trees:

\text{Var}(\bar{h}) \xrightarrow{T\to\infty} \rho\sigma^2

Random feature subsampling decorrelates trees (reduces $\rho$ ), so the ensemble variance shrinks even further than plain bagging.

Bootstrap Sampling

Each tree trains on a bootstrap sample: $n$ points drawn with replacement from the training set. On average, each sample omits about 36.8% of the original points (OOB set):

P(\text{not sampled}) = \left(1 - \frac{1}{n}\right)^n \to e^{-1} \approx 0.368

The left-out OOB (out-of-bag) points serve as a free validation set.

Feature Subsampling

At each split, only a random subset of features is considered. For classification with $d$ features, the standard choice is:

m = \lfloor\sqrt{d}\rfloor

This prevents any single dominant feature from appearing in every tree, creating diverse, decorrelated trees.

Majority Vote

For classification, the final prediction is the class chosen by the most trees:

\hat{y}(x) = \text{mode}\{\hat{y}_t(x)\}_{t=1}^T

For regression, trees are averaged: $\hat{y}(x) = \frac{1}{T}\sum_{t=1}^T \hat{y}_t(x)$ .

OOB Error

Each data point can be evaluated by the trees that did not train on it, giving an unbiased estimate of generalization error:

\text{OOB error} = \frac{1}{n}\sum_i \mathbf{1}[\hat{y}_{\text{OOB}}(x_i) \neq y_i]

Free Feature Importance

Feature importance can be computed by measuring how much OOB error increases when a feature's values are randomly permuted — no separate validation set required.

Phase 2▶

See It Work

Watch trees bootstrap, split, and vote to form the ensemble

Data & Decision Boundaries

Ensemble Status

Trees trained0/5

Phaseinit

Class 0Class 1

n_{\text{trees}}=5,\;m=\lfloor\sqrt{2}\rfloor=1

Step 1 of 13

Initialize: 5 trees, 12 data points, feature subsampling m=⌊√2⌋=1.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

B_t = \{i_1,\ldots,i_n\},\; i_j\sim\text{Uniform}(1,n)

Bootstrap sample: draw n indices with replacement

m = \lfloor\sqrt{d}\rfloor

Number of features to consider at each split

\hat{y}_t = \text{tree}(X_{B_t})

Fit a depth-limited decision tree on the bootstrap sample

\hat{y}_t(x) = \text{tree}_t.\text{predict}(x)

Get prediction from tree t for all test points

\text{collect } \hat{y}_1,\ldots,\hat{y}_T

Stack predictions from all trees

\hat{y}(x) = \text{mode}\{\hat{y}_t(x)\}_{t=1}^T

Majority vote: final prediction is the most common class

Python Implementation

def random_forest(X, y, n_trees=5, max_depth=1):
    trees = []
    for t in range(n_trees):
        idx = np.random.choice(len(X), len(X), replace=True)
        X_boot, y_boot = X[idx], y[idx]
        max_features = int(np.sqrt(X.shape[1]))
        feat_idx = np.random.choice(X.shape[1], max_features, replace=False)
        tree = DecisionTree(max_depth=max_depth)
        tree.fit(X_boot[:, feat_idx], y_boot)
        trees.append((tree, feat_idx))
    return trees

def predict(X, trees):
    all_preds = []
    for tree, feat_idx in trees:
        preds = tree.predict(X[:, feat_idx])
        all_preds.append(preds)
    all_preds = np.array(all_preds)
    from scipy.stats import mode
    final = mode(all_preds, axis=0).mode[0]
    return final