Neural Network
Multi-layer perceptron that learns nonlinear decision boundaries via backpropagation.
The Mathematics
Layered transformations, sigmoid activation, and backpropagation
Network Architecture
A multi-layer perceptron (MLP) stacks linear transformations with nonlinear activations. Each layer computes:
where (the input). Stacking layers allows the network to compose simple functions into highly nonlinear decision boundaries.
Sigmoid Activation
The sigmoid squashes any real value into and has a convenient derivative:
The self-referential derivative means we can reuse already-computed activations during backpropagation — no redundant computation.
Forward Pass & Loss
For multi-class classification, the final layer uses softmax and the categorical cross-entropy loss penalizes confident errors across all classes:
Backpropagation
Gradients flow backwards via the chain rule. The weight gradient for layer is:
where the error signal propagates backward: .
Universal Approximation
A single hidden layer with enough neurons can approximate any continuous function on a compact domain to arbitrary precision:
Use ReLU in Practice
Although sigmoid is intuitive, production networks prefer in hidden layers. ReLU avoids the vanishing gradient problem that plagues deep sigmoid networks — gradients don't saturate near 0 or 1.
See It Work
Watch the network learn to separate 4 quadrant clusters
Network Architecture (2 → 4 → 4)
Decision Region
Cross-Entropy Loss
Epoch: 0
Loss: 1.386
Phase: init
Initialize weights: loss = 1.386
The Code
Bridge from mathematical formulation to Python implementation
Mathematical Formulation
Initialize hidden layer weights randomly
Forward pass through hidden layer with sigmoid
Compute raw output scores for each class
Convert logits to class probabilities via softmax
Categorical cross-entropy loss over all classes
Output layer error signal (averaged over batch)
Backpropagate error through hidden layer
Update all weights via gradient descent
Python Implementation
def sigmoid(z): return 1 / (1 + np.exp(-z))
def softmax(z):
exp_z = np.exp(z - z.max(axis=1, keepdims=True))
return exp_z / exp_z.sum(axis=1, keepdims=True)
def mlp_train(X, y, hidden=4, lr=0.1, epochs=200):
n, d = X.shape
K = 4
W1 = np.random.randn(d, hidden) * 0.3
b1 = np.zeros(hidden)
W2 = np.random.randn(hidden, K) * 0.3
b2 = np.zeros(K)
Y = np.eye(K)[y]
losses = []
for epoch in range(epochs):
h = sigmoid(X @ W1 + b1)
logits = h @ W2 + b2
y_hat = softmax(logits)
loss = -np.mean(np.sum(Y * np.log(y_hat + 1e-8), axis=1))
losses.append(loss)
d_out = (y_hat - Y) / n
dW2 = h.T @ d_out
db2 = d_out.sum(axis=0)
d_hid = (d_out @ W2.T) * h * (1 - h)
dW1 = X.T @ d_hid
db1 = d_hid.sum(axis=0)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return W1, b1, W2, b2