Decision Tree

Recursively split feature space using information gain to build interpretable classifiers.

Classification
Phase 1f(x)

The Mathematics

Entropy, information gain, and recursive partitioning

Entropy

Entropy measures the impurity of a set of labels. A pure set (all same class) has entropy 0; a perfectly balanced set has maximum entropy:

H(S)=c=1Cpclog2pcH(S) = -\sum_{c=1}^{C} p_c \log_2 p_c

where pc=Sc/Sp_c = |S_c|/|S| is the proportion of class cc in set SS.

Information Gain

We split on the feature and threshold that maximize the reduction in entropy:

IG(S,f,t)=H(S)SLSH(SL)SRSH(SR)IG(S, f, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

where SL={xS:xft}S_L = \{x \in S : x_f \leq t\} and SR=SSLS_R = S \setminus S_L.

Recursive Splitting

The tree is built top-down by greedily selecting the best split at each node, then recursing on each partition:

\bullet Find f,t=argmaxf,tIG(S,f,t)f^*, t^* = \arg\max_{f,t} IG(S, f, t)

\bullet Split: SL,SRS_L, S_R based on xftx_{f^*} \leq t^*

\bullet Recurse until leaves are pure or max depth is reached

Gini Impurity (Alternative)

An alternative to entropy, often used in CART trees:

G(S)=1c=1Cpc2G(S) = 1 - \sum_{c=1}^{C} p_c^2

Entropy vs. Gini

In practice, entropy and Gini produce very similar trees. Gini is slightly faster to compute (no logarithm). Scikit-learn uses Gini by default.

Phase 2

See It Work

Watch the tree grow and partition feature space

Decision Tree

class nulln=10, H=1

Feature Space Partitions

x₁x₂0.01.42.84.25.67.00.01.22.43.64.86.0
Class 0Class 1
H(S)=cpclog2pc=1H(S) = -\sum_c p_c \log_2 p_c = 1
Step 1 of 5

Start with all 10 samples in root node. Entropy = 1.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

H(S)=0leafH(S) = 0 \Rightarrow \text{leaf}

If all samples have the same class, create a leaf node

IG(S,f,t)=H(S)SLSH(SL)SRSH(SR)IG(S, f, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

Compute information gain for each candidate split

f,t=argmaxf,tIG(S,f,t)f^*, t^* = \arg\max_{f,t} IG(S, f, t)

Select the feature and threshold with highest information gain

SL={xS:xft}S_L = \{x \in S : x_{f^*} \leq t^*\}

Partition data: left child gets samples below threshold

recurse(SL),  recurse(SR)\text{recurse}(S_L),\; \text{recurse}(S_R)

Recursively build subtrees for each partition

H(S)=cpclog2pcH(S) = -\sum_c p_c \log_2 p_c

Entropy measures impurity of a set of labels

Python Implementation

def build_tree(X, y, depth=0, max_depth=5):
    if len(set(y)) == 1:              # Pure node
        return Leaf(y[0])
    if depth >= max_depth:            # Max depth reached
        return Leaf(majority(y))
    best_feat, best_thresh = None, None
    best_gain = 0
    for feature in range(X.shape[1]):     # Try each feature
        for thresh in unique(X[:, feature]):
            gain = info_gain(y, X[:, feature], thresh)
            if gain > best_gain:          # Track best split
                best_gain = gain
                best_feat, best_thresh = feature, thresh
    left_mask = X[:, best_feat] <= best_thresh
    left = build_tree(X[left_mask], y[left_mask], depth+1)
    right = build_tree(X[~left_mask], y[~left_mask], depth+1)
    return Node(best_feat, best_thresh, left, right)

def info_gain(y, feature, threshold):
    H = entropy(y)                        # Parent entropy
    left = y[feature <= threshold]
    right = y[feature > threshold]
    H_split = (len(left)*entropy(left) + len(right)*entropy(right)) / len(y)
    return H - H_split                    # Information gain