Neural Network

Multi-layer perceptron that learns nonlinear decision boundaries via backpropagation.

Deep Learning

Phase 1f(x)

The Mathematics

Layered transformations, sigmoid activation, and backpropagation

Network Architecture

A multi-layer perceptron (MLP) stacks linear transformations with nonlinear activations. Each layer $l$ computes:

h^{(l)} = \sigma\!\left(W^{(l)} h^{(l-1)} + b^{(l)}\right)

where $h^{(0)} = x$ (the input). Stacking layers allows the network to compose simple functions into highly nonlinear decision boundaries.

Sigmoid Activation

The sigmoid squashes any real value into $(0,1)$ and has a convenient derivative:

\sigma(z) = \frac{1}{1+e^{-z}}, \quad \sigma'(z) = \sigma(z)(1-\sigma(z))

The self-referential derivative means we can reuse already-computed activations during backpropagation — no redundant computation.

Forward Pass & Loss

For multi-class classification, the final layer uses softmax and the categorical cross-entropy loss penalizes confident errors across all $K$ classes:

\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n\sum_{k=0}^{K-1} Y_{ik}\log\hat{Y}_{ik}

Backpropagation

Gradients flow backwards via the chain rule. The weight gradient for layer $l$ is:

\frac{\partial\mathcal{L}}{\partial W^{(l)}} = \frac{1}{n}\, h^{(l-1)\top} \delta^{(l)}

where the error signal $\delta^{(l)}$ propagates backward: $\delta^{(l)} = (W^{(l+1)\top}\delta^{(l+1)}) \odot \sigma'(h^{(l)})$ .

Universal Approximation

A single hidden layer with enough neurons can approximate any continuous function on a compact domain to arbitrary precision:

\forall\;\epsilon > 0,\;\exists\; N:\; \sup_x |f(x) - h_N(x)| < \epsilon

Use ReLU in Practice

Although sigmoid is intuitive, production networks prefer $\text{ReLU}(z) = \max(0, z)$ in hidden layers. ReLU avoids the vanishing gradient problem that plagues deep sigmoid networks — gradients don't saturate near 0 or 1.

Phase 2▶

See It Work

Watch the network learn to separate 4 quadrant clusters

Network Architecture (2 → 4 → 4)

Decision Region

Cross-Entropy Loss

Epoch: 0

Loss: 1.386

Phase: init

Class 0Class 1Class 2Class 3Positive weightNegative weight

W^{(1)}\sim\mathcal{N}(0,0.3^2)

Step 1 of 12

Initialize weights: loss = 1.386

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

W^{(1)} \sim \mathcal{N}(0, 0.3^2),\; b^{(1)} = \mathbf{0}

Initialize hidden layer weights randomly

h = \sigma(XW^{(1)} + b^{(1)})

Forward pass through hidden layer with sigmoid

\text{logits} = hW^{(2)} + b^{(2)}

Compute raw output scores for each class

\hat{Y} = \text{softmax}(\text{logits})

Convert logits to class probabilities via softmax

\mathcal{L} = -\frac{1}{n}\sum_{i,k} Y_{ik}\log\hat{Y}_{ik}

Categorical cross-entropy loss over all classes

\delta^{(2)} = (\hat{Y} - Y) / n

Output layer error signal (averaged over batch)

\delta^{(1)} = (\delta^{(2)} W^{(2)\top}) \odot \sigma'(h)

Backpropagate error through hidden layer

W^{(1)} \leftarrow W^{(1)} - \eta \nabla_{W^{(1)}} \mathcal{L}

Update all weights via gradient descent

Python Implementation

def sigmoid(z): return 1 / (1 + np.exp(-z))

def softmax(z):
    exp_z = np.exp(z - z.max(axis=1, keepdims=True))
    return exp_z / exp_z.sum(axis=1, keepdims=True)

def mlp_train(X, y, hidden=4, lr=0.1, epochs=200):
    n, d = X.shape
    K = 4
    W1 = np.random.randn(d, hidden) * 0.3
    b1 = np.zeros(hidden)
    W2 = np.random.randn(hidden, K) * 0.3
    b2 = np.zeros(K)
    Y = np.eye(K)[y]
    losses = []
    for epoch in range(epochs):
        h = sigmoid(X @ W1 + b1)
        logits = h @ W2 + b2
        y_hat = softmax(logits)
        loss = -np.mean(np.sum(Y * np.log(y_hat + 1e-8), axis=1))
        losses.append(loss)
        d_out = (y_hat - Y) / n
        dW2 = h.T @ d_out
        db2 = d_out.sum(axis=0)
        d_hid = (d_out @ W2.T) * h * (1 - h)
        dW1 = X.T @ d_hid
        db1 = d_hid.sum(axis=0)
        W1 -= lr * dW1; b1 -= lr * db1
        W2 -= lr * dW2; b2 -= lr * db2
    return W1, b1, W2, b2