Decision Tree
Recursively split feature space using information gain to build interpretable classifiers.
The Mathematics
Entropy, information gain, and recursive partitioning
Entropy
Entropy measures the impurity of a set of labels. A pure set (all same class) has entropy 0; a perfectly balanced set has maximum entropy:
where is the proportion of class in set .
Information Gain
We split on the feature and threshold that maximize the reduction in entropy:
where and .
Recursive Splitting
The tree is built top-down by greedily selecting the best split at each node, then recursing on each partition:
Find
Split: based on
Recurse until leaves are pure or max depth is reached
Gini Impurity (Alternative)
An alternative to entropy, often used in CART trees:
Entropy vs. Gini
In practice, entropy and Gini produce very similar trees. Gini is slightly faster to compute (no logarithm). Scikit-learn uses Gini by default.
See It Work
Watch the tree grow and partition feature space
Decision Tree
Feature Space Partitions
Start with all 10 samples in root node. Entropy = 1.
The Code
Bridge from mathematical formulation to Python implementation
Mathematical Formulation
If all samples have the same class, create a leaf node
Compute information gain for each candidate split
Select the feature and threshold with highest information gain
Partition data: left child gets samples below threshold
Recursively build subtrees for each partition
Entropy measures impurity of a set of labels
Python Implementation
def build_tree(X, y, depth=0, max_depth=5):
if len(set(y)) == 1: # Pure node
return Leaf(y[0])
if depth >= max_depth: # Max depth reached
return Leaf(majority(y))
best_feat, best_thresh = None, None
best_gain = 0
for feature in range(X.shape[1]): # Try each feature
for thresh in unique(X[:, feature]):
gain = info_gain(y, X[:, feature], thresh)
if gain > best_gain: # Track best split
best_gain = gain
best_feat, best_thresh = feature, thresh
left_mask = X[:, best_feat] <= best_thresh
left = build_tree(X[left_mask], y[left_mask], depth+1)
right = build_tree(X[~left_mask], y[~left_mask], depth+1)
return Node(best_feat, best_thresh, left, right)
def info_gain(y, feature, threshold):
H = entropy(y) # Parent entropy
left = y[feature <= threshold]
right = y[feature > threshold]
H_split = (len(left)*entropy(left) + len(right)*entropy(right)) / len(y)
return H - H_split # Information gain