KLAR

Decision Tree

Recursively split feature space using information gain to build interpretable classifiers.

Classification
Phase 1f(x)

The Mathematics

Entropy, information gain, and recursive partitioning

Entropy

Entropy measures the impurity of a set of labels. A pure set (all same class) has entropy 0; a perfectly balanced set has maximum entropy:

H(S)=c=1Cpclog2pcH(S) = -\sum_{c=1}^{C} p_c \log_2 p_c

where pc=Sc/Sp_c = |S_c|/|S| is the proportion of class cc in set SS.

Information Gain

We split on the feature and threshold that maximize the reduction in entropy:

IG(S,f,t)=H(S)SLSH(SL)SRSH(SR)IG(S, f, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

where SL={xS:xft}S_L = \{x \in S : x_f \leq t\} and SR=SSLS_R = S \setminus S_L.

Recursive Splitting

The tree is built top-down by greedily selecting the best split at each node, then recursing on each partition:

\bullet Find f,t=argmaxf,tIG(S,f,t)f^*, t^* = \arg\max_{f,t} IG(S, f, t)

\bullet Split: SL,SRS_L, S_R based on xftx_{f^*} \leq t^*

\bullet Recurse until leaves are pure or max depth is reached

Gini Impurity (Alternative)

An alternative to entropy, often used in CART trees:

G(S)=1c=1Cpc2G(S) = 1 - \sum_{c=1}^{C} p_c^2

Entropy vs. Gini

In practice, entropy and Gini produce very similar trees. Gini is slightly faster to compute (no logarithm). Scikit-learn uses Gini by default.

Phase 2

See It Work

Watch the tree grow and partition feature space

Decision Tree

class nulln=10, H=1

Feature Space Partitions

x₁x₂0.01.42.84.25.67.00.01.22.43.64.86.0
Class 0Class 1
H(S)=cpclog2pc=1H(S) = -\sum_c p_c \log_2 p_c = 1
Step 1 of 5

Start with all 10 samples in root node. Entropy = 1.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

H(S)=0leafH(S) = 0 \Rightarrow \text{leaf}

If all samples have the same class, create a leaf node

IG(S,f,t)=H(S)SLSH(SL)SRSH(SR)IG(S, f, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

Compute information gain for each candidate split

f,t=argmaxf,tIG(S,f,t)f^*, t^* = \arg\max_{f,t} IG(S, f, t)

Select the feature and threshold with highest information gain

SL={xS:xft}S_L = \{x \in S : x_{f^*} \leq t^*\}

Partition data: left child gets samples below threshold

recurse(SL),  recurse(SR)\text{recurse}(S_L),\; \text{recurse}(S_R)

Recursively build subtrees for each partition

H(S)=cpclog2pcH(S) = -\sum_c p_c \log_2 p_c

Entropy measures impurity of a set of labels

Python Implementation

def build_tree(X, y, depth=0, max_depth=5):
    if len(set(y)) == 1:              # Pure node
        return Leaf(y[0])
    if depth >= max_depth:            # Max depth reached
        return Leaf(majority(y))
    best_feat, best_thresh = None, None
    best_gain = 0
    for feature in range(X.shape[1]):     # Try each feature
        for thresh in unique(X[:, feature]):
            gain = info_gain(y, X[:, feature], thresh)
            if gain > best_gain:          # Track best split
                best_gain = gain
                best_feat, best_thresh = feature, thresh
    left_mask = X[:, best_feat] <= best_thresh
    left = build_tree(X[left_mask], y[left_mask], depth+1)
    right = build_tree(X[~left_mask], y[~left_mask], depth+1)
    return Node(best_feat, best_thresh, left, right)

def info_gain(y, feature, threshold):
    H = entropy(y)                        # Parent entropy
    left = y[feature <= threshold]
    right = y[feature > threshold]
    H_split = (len(left)*entropy(left) + len(right)*entropy(right)) / len(y)
    return H - H_split                    # Information gain