Naive Bayes

Probabilistic classifier using Bayes' theorem with the conditional independence assumption.

Classification
Phase 1f(x)

The Mathematics

Bayes' theorem, conditional independence, and Gaussian likelihood

Bayes' Theorem

Classification by posterior probability. We want the class CkC_k that maximizes the probability given the observed features:

P(Ckx)=P(xCk)P(Ck)P(x)P(C_k | x) = \frac{P(x | C_k)\, P(C_k)}{P(x)}

Since P(x)P(x) is constant across classes, we only need to maximize the numerator: P(xCk)P(Ck)P(x|C_k)\,P(C_k).

Conditional Independence

The “naive” assumption: features are independent given the class label. This factorizes the joint likelihood:

P(xCk)=j=1dP(xjCk)P(x | C_k) = \prod_{j=1}^{d} P(x_j | C_k)

This is rarely true in practice, yet Naive Bayes often performs surprisingly well — especially in high dimensions where estimating the full joint is infeasible.

Gaussian Likelihood

For continuous features we model each feature as Gaussian: xjCkN(μkj,σkj2)x_j | C_k \sim \mathcal{N}(\mu_{kj}, \sigma_{kj}^2):

P(xjCk)=12πσkjexp ⁣((xjμkj)22σkj2)P(x_j | C_k) = \frac{1}{\sqrt{2\pi}\,\sigma_{kj}} \exp\!\left(-\frac{(x_j - \mu_{kj})^2}{2\sigma_{kj}^2}\right)

Parameters μkj\mu_{kj} and σkj\sigma_{kj} are estimated from the training points in class kk.

Log Posterior

To avoid numerical underflow with many features, work in log space. The prediction is:

y^=argmaxk ⁣[logP(Ck)+j=1dlogP(xjCk)]\hat{y} = \arg\max_k \!\left[\log P(C_k) + \sum_{j=1}^d \log P(x_j | C_k)\right]

Where It Shines

Despite the naive independence assumption, Naive Bayes excels at text classification (spam filters, sentiment analysis) where features are bag-of-words counts. It also trains in O(nd)O(nd) — a single pass through the data — and handles missing features gracefully.

Phase 2

See It Work

Watch Naive Bayes estimate class distributions and classify the query

Data & Gaussian Distributions

x₁x₂0.01.63.24.86.48.00.01.63.24.86.48.0

Log Posteriors

log P(C₀|x)

log P(C₁|x)

Query Point

x₁ = 3.5, x₂ = 4

Class 0Class 1
P(y^x)P(xy^)P(y^)P(\hat{y}|x)\propto P(x|\hat{y})P(\hat{y})
Step 1 of 9

Dataset: 12 points, 2 classes. We will classify the query point (3.5, 4.0).

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

P(Ck)={i:yi=k}nP(C_k) = \frac{|\{i: y_i=k\}|}{n}

Estimate class prior from training data

μk=1nkyi=kxi\mu_k = \frac{1}{n_k}\sum_{y_i=k} x_i

Compute per-class mean for each feature

σk=std({xi:yi=k})\sigma_k = \text{std}(\{x_i: y_i=k\})

Compute per-class standard deviation for each feature

logP(xCk)=12j[(xjμkjσkj)2+log(2πσkj2)]\log P(x|C_k) = -\frac{1}{2}\sum_j\left[\left(\frac{x_j-\mu_{kj}}{\sigma_{kj}}\right)^2 + \log(2\pi\sigma_{kj}^2)\right]

Log-likelihood under Gaussian assumption (naive independence)

logP(Ck)\log P(C_k)

Log of class prior probability

logP(xCk)\log P(x|C_k)

Log-likelihood of query under class k Gaussian

y^=argmaxk[logP(Ck)+logP(xCk)]\hat{y} = \arg\max_k [\log P(C_k) + \log P(x|C_k)]

Predict class with highest log-posterior

Python Implementation

import numpy as np

def gaussian_nb(X_train, y_train, x_query):
    classes = np.unique(y_train)
    priors, means, stds = {}, {}, {}
    for c in classes:
        X_c = X_train[y_train == c]
        priors[c] = len(X_c) / len(X_train)
        means[c] = X_c.mean(axis=0)
        stds[c] = X_c.std(axis=0) + 1e-9
    def log_likelihood(x, mu, sigma):
        return -0.5 * np.sum(
            ((x - mu) / sigma)**2 + np.log(2*np.pi*sigma**2)
        )
    log_posteriors = {}
    for c in classes:
        log_prior = np.log(priors[c])
        log_like = log_likelihood(x_query, means[c], stds[c])
        log_posteriors[c] = log_prior + log_like
    prediction = max(log_posteriors, key=log_posteriors.get)
    return prediction, log_posteriors