Neural Network

Multi-layer perceptron that learns nonlinear decision boundaries via backpropagation.

Deep Learning
Phase 1f(x)

The Mathematics

Layered transformations, sigmoid activation, and backpropagation

Network Architecture

A multi-layer perceptron (MLP) stacks linear transformations with nonlinear activations. Each layer ll computes:

h(l)=σ ⁣(W(l)h(l1)+b(l))h^{(l)} = \sigma\!\left(W^{(l)} h^{(l-1)} + b^{(l)}\right)

where h(0)=xh^{(0)} = x (the input). Stacking layers allows the network to compose simple functions into highly nonlinear decision boundaries.

Sigmoid Activation

The sigmoid squashes any real value into (0,1)(0,1) and has a convenient derivative:

σ(z)=11+ez,σ(z)=σ(z)(1σ(z))\sigma(z) = \frac{1}{1+e^{-z}}, \quad \sigma'(z) = \sigma(z)(1-\sigma(z))

The self-referential derivative means we can reuse already-computed activations during backpropagation — no redundant computation.

Forward Pass & Loss

For multi-class classification, the final layer uses softmax and the categorical cross-entropy loss penalizes confident errors across all KK classes:

L=1ni=1nk=0K1YiklogY^ik\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n\sum_{k=0}^{K-1} Y_{ik}\log\hat{Y}_{ik}

Backpropagation

Gradients flow backwards via the chain rule. The weight gradient for layer ll is:

LW(l)=1nh(l1)δ(l)\frac{\partial\mathcal{L}}{\partial W^{(l)}} = \frac{1}{n}\, h^{(l-1)\top} \delta^{(l)}

where the error signal δ(l)\delta^{(l)} propagates backward: δ(l)=(W(l+1)δ(l+1))σ(h(l))\delta^{(l)} = (W^{(l+1)\top}\delta^{(l+1)}) \odot \sigma'(h^{(l)}).

Universal Approximation

A single hidden layer with enough neurons can approximate any continuous function on a compact domain to arbitrary precision:

  ϵ>0,    N:  supxf(x)hN(x)<ϵ\forall\;\epsilon > 0,\;\exists\; N:\; \sup_x |f(x) - h_N(x)| < \epsilon

Use ReLU in Practice

Although sigmoid is intuitive, production networks prefer ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z) in hidden layers. ReLU avoids the vanishing gradient problem that plagues deep sigmoid networks — gradients don't saturate near 0 or 1.

Phase 2

See It Work

Watch the network learn to separate 4 quadrant clusters

Network Architecture (2 → 4 → 4)

x₁x₂h₁h₂h₃h₄ŷ₀ŷ₁ŷ₂ŷ₃InputHiddenOutput

Decision Region

x₁x₂0.01.63.24.86.48.00.01.63.24.86.48.0

Cross-Entropy Loss

IterationCE Loss0.000.380.761.141.52

Epoch: 0

Loss: 1.386

Phase: init

Class 0Class 1Class 2Class 3Positive weightNegative weight
W(1)N(0,0.32)W^{(1)}\sim\mathcal{N}(0,0.3^2)
Step 1 of 12

Initialize weights: loss = 1.386

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

W(1)N(0,0.32),  b(1)=0W^{(1)} \sim \mathcal{N}(0, 0.3^2),\; b^{(1)} = \mathbf{0}

Initialize hidden layer weights randomly

h=σ(XW(1)+b(1))h = \sigma(XW^{(1)} + b^{(1)})

Forward pass through hidden layer with sigmoid

logits=hW(2)+b(2)\text{logits} = hW^{(2)} + b^{(2)}

Compute raw output scores for each class

Y^=softmax(logits)\hat{Y} = \text{softmax}(\text{logits})

Convert logits to class probabilities via softmax

L=1ni,kYiklogY^ik\mathcal{L} = -\frac{1}{n}\sum_{i,k} Y_{ik}\log\hat{Y}_{ik}

Categorical cross-entropy loss over all classes

δ(2)=(Y^Y)/n\delta^{(2)} = (\hat{Y} - Y) / n

Output layer error signal (averaged over batch)

δ(1)=(δ(2)W(2))σ(h)\delta^{(1)} = (\delta^{(2)} W^{(2)\top}) \odot \sigma'(h)

Backpropagate error through hidden layer

W(1)W(1)ηW(1)LW^{(1)} \leftarrow W^{(1)} - \eta \nabla_{W^{(1)}} \mathcal{L}

Update all weights via gradient descent

Python Implementation

def sigmoid(z): return 1 / (1 + np.exp(-z))

def softmax(z):
    exp_z = np.exp(z - z.max(axis=1, keepdims=True))
    return exp_z / exp_z.sum(axis=1, keepdims=True)

def mlp_train(X, y, hidden=4, lr=0.1, epochs=200):
    n, d = X.shape
    K = 4
    W1 = np.random.randn(d, hidden) * 0.3
    b1 = np.zeros(hidden)
    W2 = np.random.randn(hidden, K) * 0.3
    b2 = np.zeros(K)
    Y = np.eye(K)[y]
    losses = []
    for epoch in range(epochs):
        h = sigmoid(X @ W1 + b1)
        logits = h @ W2 + b2
        y_hat = softmax(logits)
        loss = -np.mean(np.sum(Y * np.log(y_hat + 1e-8), axis=1))
        losses.append(loss)
        d_out = (y_hat - Y) / n
        dW2 = h.T @ d_out
        db2 = d_out.sum(axis=0)
        d_hid = (d_out @ W2.T) * h * (1 - h)
        dW1 = X.T @ d_hid
        db1 = d_hid.sum(axis=0)
        W1 -= lr * dW1; b1 -= lr * db1
        W2 -= lr * dW2; b2 -= lr * db2
    return W1, b1, W2, b2