Linear Regression

Fit a line to data by minimizing mean squared error using gradient descent.

Regression

Phase 1f(x)

The Mathematics

Linear model, MSE loss, and gradient descent optimization

The Model

Linear regression fits a line $\hat{y} = wx + b$ to data points $(x_i, y_i)$ by finding the weight $w$ and bias $b$ that best explain the relationship.

\hat{y}_i = w x_i + b

Loss Function: Mean Squared Error

We measure how wrong our predictions are using the average squared difference between predicted $\hat{y}_i$ and actual $y_i$ :

\mathcal{L}(w,b) = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2

This is a convex function in $w$ and $b$ , so gradient descent is guaranteed to find the global minimum.

Gradient Descent

We compute the partial derivatives and update parameters in the direction of steepest descent:

\frac{\partial \mathcal{L}}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i) \cdot x_i

\frac{\partial \mathcal{L}}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

Update rule with learning rate $\alpha$ :

w \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}, \quad b \leftarrow b - \alpha \frac{\partial \mathcal{L}}{\partial b}

Convergence

Since MSE is convex, gradient descent converges to the global minimum. The optimal solution can also be found analytically via the normal equation:

w^* = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad b^* = \bar{y} - w^*\bar{x}

Why Gradient Descent?

While the normal equation works for simple linear regression, gradient descent scales to millions of features and is the foundation for training neural networks.

Phase 2▶

See It Work

Watch gradient descent fit a line to data

Data & Regression Line

Loss Over Iterations

w = 0.000, b = 0.000

Loss = 121.32

\hat{y} = wx + b,\quad w=0,\; b=0

Step 1 of 13

Initialize: w=0, b=0. MSE loss = 121.32.

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

w = 0,\; b = 0

Initialize weight and bias to zero

\hat{y}_i = w x_i + b

Forward pass: compute predictions for all points

\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2

Compute mean squared error loss

\frac{\partial \mathcal{L}}{\partial w} = \frac{2}{n}\sum(\hat{y}_i - y_i)x_i

Gradient of loss with respect to weight

\frac{\partial \mathcal{L}}{\partial b} = \frac{2}{n}\sum(\hat{y}_i - y_i)

Gradient of loss with respect to bias

w \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}

Update weight using gradient descent

b \leftarrow b - \alpha \frac{\partial \mathcal{L}}{\partial b}

Update bias using gradient descent

Python Implementation

def linear_regression(X, y, lr=0.01, epochs=100):
    w, b = 0.0, 0.0                  # Initialize parameters
    n = len(X)
    for epoch in range(epochs):       # Training loop
        y_hat = w * X + b             # Forward pass
        loss = (1/n) * sum((y_hat - y)**2)  # MSE loss
        dw = (2/n) * sum((y_hat - y) * X)   # Gradient w.r.t. w
        db = (2/n) * sum(y_hat - y)         # Gradient w.r.t. b
        w -= lr * dw                  # Update w
        b -= lr * db                  # Update b
    return w, b