Decision Tree

Phase 1f(x)

The Mathematics

Entropy, information gain, and recursive partitioning

Entropy

Entropy measures the impurity of a set of labels. A pure set (all same class) has entropy 0; a perfectly balanced set has maximum entropy:

H(S) = -\sum_{c=1}^{C} p_c \log_2 p_c

where $p_c = |S_c|/|S|$ is the proportion of class $c$ in set $S$ .

Information Gain

We split on the feature and threshold that maximize the reduction in entropy:

IG(S, f, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

where $S_L = \{x \in S : x_f \leq t\}$ and $S_R = S \setminus S_L$ .

Recursive Splitting

The tree is built top-down by greedily selecting the best split at each node, then recursing on each partition:

$\bullet$ Find $f^*, t^* = \arg\max_{f,t} IG(S, f, t)$

$\bullet$ Split: $S_L, S_R$ based on $x_{f^*} \leq t^*$

$\bullet$ Recurse until leaves are pure or max depth is reached

Gini Impurity (Alternative)

An alternative to entropy, often used in CART trees:

G(S) = 1 - \sum_{c=1}^{C} p_c^2

Entropy vs. Gini

In practice, entropy and Gini produce very similar trees. Gini is slightly faster to compute (no logarithm). Scikit-learn uses Gini by default.

Phase 2▶

See It Work

Watch the tree grow and partition feature space

Decision Tree

Feature Space Partitions

Class 0Class 1

H(S) = -\sum_c p_c \log_2 p_c = 1

Step 1 of 5

Start with all 10 samples in root node. Entropy = 1.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

H(S) = 0 \Rightarrow \text{leaf}

If all samples have the same class, create a leaf node

IG(S, f, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

Compute information gain for each candidate split

f^*, t^* = \arg\max_{f,t} IG(S, f, t)

Select the feature and threshold with highest information gain

S_L = \{x \in S : x_{f^*} \leq t^*\}

Partition data: left child gets samples below threshold

\text{recurse}(S_L),\; \text{recurse}(S_R)

Recursively build subtrees for each partition

H(S) = -\sum_c p_c \log_2 p_c

Entropy measures impurity of a set of labels

Python Implementation

def build_tree(X, y, depth=0, max_depth=5):
    if len(set(y)) == 1:              # Pure node
        return Leaf(y[0])
    if depth >= max_depth:            # Max depth reached
        return Leaf(majority(y))
    best_feat, best_thresh = None, None
    best_gain = 0
    for feature in range(X.shape[1]):     # Try each feature
        for thresh in unique(X[:, feature]):
            gain = info_gain(y, X[:, feature], thresh)
            if gain > best_gain:          # Track best split
                best_gain = gain
                best_feat, best_thresh = feature, thresh
    left_mask = X[:, best_feat] <= best_thresh
    left = build_tree(X[left_mask], y[left_mask], depth+1)
    right = build_tree(X[~left_mask], y[~left_mask], depth+1)
    return Node(best_feat, best_thresh, left, right)

def info_gain(y, feature, threshold):
    H = entropy(y)                        # Parent entropy
    left = y[feature <= threshold]
    right = y[feature > threshold]
    H_split = (len(left)*entropy(left) + len(right)*entropy(right)) / len(y)
    return H - H_split                    # Information gain