Linear Regression

Fit a line to data by minimizing mean squared error using gradient descent.

Regression
Phase 1f(x)

The Mathematics

Linear model, MSE loss, and gradient descent optimization

The Model

Linear regression fits a line y^=wx+b\hat{y} = wx + b to data points (xi,yi)(x_i, y_i) by finding the weight ww and bias bb that best explain the relationship.

y^i=wxi+b\hat{y}_i = w x_i + b

Loss Function: Mean Squared Error

We measure how wrong our predictions are using the average squared difference between predicted y^i\hat{y}_i and actual yiy_i:

L(w,b)=1ni=1n(y^iyi)2\mathcal{L}(w,b) = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2

This is a convex function in ww and bb, so gradient descent is guaranteed to find the global minimum.

Gradient Descent

We compute the partial derivatives and update parameters in the direction of steepest descent:

Lw=2ni=1n(y^iyi)xi\frac{\partial \mathcal{L}}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i) \cdot x_iLb=2ni=1n(y^iyi)\frac{\partial \mathcal{L}}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

Update rule with learning rate α\alpha:

wwαLw,bbαLbw \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}, \quad b \leftarrow b - \alpha \frac{\partial \mathcal{L}}{\partial b}

Convergence

Since MSE is convex, gradient descent converges to the global minimum. The optimal solution can also be found analytically via the normal equation:

w=(xixˉ)(yiyˉ)(xixˉ)2,b=yˉwxˉw^* = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad b^* = \bar{y} - w^*\bar{x}

Why Gradient Descent?

While the normal equation works for simple linear regression, gradient descent scales to millions of features and is the foundation for training neural networks.

Phase 2

See It Work

Watch gradient descent fit a line to data

Data & Regression Line

xy0.01.83.65.47.29.00.04.08.012.016.020.0

Loss Over Iterations

IterationLoss0.0033.3666.73100.09133.45

w = 0.000, b = 0.000

Loss = 121.32

y^=wx+b,w=0,  b=0\hat{y} = wx + b,\quad w=0,\; b=0
Step 1 of 13

Initialize: w=0, b=0. MSE loss = 121.32.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

w=0,  b=0w = 0,\; b = 0

Initialize weight and bias to zero

y^i=wxi+b\hat{y}_i = w x_i + b

Forward pass: compute predictions for all points

L=1ni=1n(y^iyi)2\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2

Compute mean squared error loss

Lw=2n(y^iyi)xi\frac{\partial \mathcal{L}}{\partial w} = \frac{2}{n}\sum(\hat{y}_i - y_i)x_i

Gradient of loss with respect to weight

Lb=2n(y^iyi)\frac{\partial \mathcal{L}}{\partial b} = \frac{2}{n}\sum(\hat{y}_i - y_i)

Gradient of loss with respect to bias

wwαLww \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}

Update weight using gradient descent

bbαLbb \leftarrow b - \alpha \frac{\partial \mathcal{L}}{\partial b}

Update bias using gradient descent

Python Implementation

def linear_regression(X, y, lr=0.01, epochs=100):
    w, b = 0.0, 0.0                  # Initialize parameters
    n = len(X)
    for epoch in range(epochs):       # Training loop
        y_hat = w * X + b             # Forward pass
        loss = (1/n) * sum((y_hat - y)**2)  # MSE loss
        dw = (2/n) * sum((y_hat - y) * X)   # Gradient w.r.t. w
        db = (2/n) * sum(y_hat - y)         # Gradient w.r.t. b
        w -= lr * dw                  # Update w
        b -= lr * db                  # Update b
    return w, b