Random Forest

Ensemble of decision trees trained on bootstrapped data with feature subsampling.

Ensemble
Phase 1f(x)

The Mathematics

Bootstrap aggregation, feature subsampling, and variance reduction

Bias-Variance Decomposition

Averaging TT uncorrelated models each with variance σ2\sigma^2 reduces variance by 1/T1/T. With correlation ρ\rho between trees:

Var(hˉ)Tρσ2\text{Var}(\bar{h}) \xrightarrow{T\to\infty} \rho\sigma^2

Random feature subsampling decorrelates trees (reduces ρ\rho), so the ensemble variance shrinks even further than plain bagging.

Bootstrap Sampling

Each tree trains on a bootstrap sample: nn points drawn with replacement from the training set. On average, each sample omits about 36.8% of the original points (OOB set):

P(not sampled)=(11n)ne10.368P(\text{not sampled}) = \left(1 - \frac{1}{n}\right)^n \to e^{-1} \approx 0.368

The left-out OOB (out-of-bag) points serve as a free validation set.

Feature Subsampling

At each split, only a random subset of features is considered. For classification with dd features, the standard choice is:

m=dm = \lfloor\sqrt{d}\rfloor

This prevents any single dominant feature from appearing in every tree, creating diverse, decorrelated trees.

Majority Vote

For classification, the final prediction is the class chosen by the most trees:

y^(x)=mode{y^t(x)}t=1T\hat{y}(x) = \text{mode}\{\hat{y}_t(x)\}_{t=1}^T

For regression, trees are averaged: y^(x)=1Tt=1Ty^t(x)\hat{y}(x) = \frac{1}{T}\sum_{t=1}^T \hat{y}_t(x).

OOB Error

Each data point can be evaluated by the trees that did not train on it, giving an unbiased estimate of generalization error:

OOB error=1ni1[y^OOB(xi)yi]\text{OOB error} = \frac{1}{n}\sum_i \mathbf{1}[\hat{y}_{\text{OOB}}(x_i) \neq y_i]

Free Feature Importance

Feature importance can be computed by measuring how much OOB error increases when a feature's values are randomly permuted — no separate validation set required.

Phase 2

See It Work

Watch trees bootstrap, split, and vote to form the ensemble

Data & Decision Boundaries

x₁x₂0.01.63.24.86.48.00.01.63.24.86.48.0

Ensemble Status

Trees trained0/5
Phaseinit
Class 0Class 1
ntrees=5,  m=2=1n_{\text{trees}}=5,\;m=\lfloor\sqrt{2}\rfloor=1
Step 1 of 13

Initialize: 5 trees, 12 data points, feature subsampling m=⌊√2⌋=1.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

Bt={i1,,in},  ijUniform(1,n)B_t = \{i_1,\ldots,i_n\},\; i_j\sim\text{Uniform}(1,n)

Bootstrap sample: draw n indices with replacement

m=dm = \lfloor\sqrt{d}\rfloor

Number of features to consider at each split

y^t=tree(XBt)\hat{y}_t = \text{tree}(X_{B_t})

Fit a depth-limited decision tree on the bootstrap sample

y^t(x)=treet.predict(x)\hat{y}_t(x) = \text{tree}_t.\text{predict}(x)

Get prediction from tree t for all test points

collect y^1,,y^T\text{collect } \hat{y}_1,\ldots,\hat{y}_T

Stack predictions from all trees

y^(x)=mode{y^t(x)}t=1T\hat{y}(x) = \text{mode}\{\hat{y}_t(x)\}_{t=1}^T

Majority vote: final prediction is the most common class

Python Implementation

def random_forest(X, y, n_trees=5, max_depth=1):
    trees = []
    for t in range(n_trees):
        idx = np.random.choice(len(X), len(X), replace=True)
        X_boot, y_boot = X[idx], y[idx]
        max_features = int(np.sqrt(X.shape[1]))
        feat_idx = np.random.choice(X.shape[1], max_features, replace=False)
        tree = DecisionTree(max_depth=max_depth)
        tree.fit(X_boot[:, feat_idx], y_boot)
        trees.append((tree, feat_idx))
    return trees

def predict(X, trees):
    all_preds = []
    for tree, feat_idx in trees:
        preds = tree.predict(X[:, feat_idx])
        all_preds.append(preds)
    all_preds = np.array(all_preds)
    from scipy.stats import mode
    final = mode(all_preds, axis=0).mode[0]
    return final