Naive Bayes

Probabilistic classifier using Bayes' theorem with the conditional independence assumption.

Classification

Phase 1f(x)

The Mathematics

Bayes' theorem, conditional independence, and Gaussian likelihood

Bayes' Theorem

Classification by posterior probability. We want the class $C_k$ that maximizes the probability given the observed features:

P(C_k | x) = \frac{P(x | C_k)\, P(C_k)}{P(x)}

Since $P(x)$ is constant across classes, we only need to maximize the numerator: $P(x|C_k)\,P(C_k)$ .

Conditional Independence

The “naive” assumption: features are independent given the class label. This factorizes the joint likelihood:

P(x | C_k) = \prod_{j=1}^{d} P(x_j | C_k)

This is rarely true in practice, yet Naive Bayes often performs surprisingly well — especially in high dimensions where estimating the full joint is infeasible.

Gaussian Likelihood

For continuous features we model each feature as Gaussian: $x_j | C_k \sim \mathcal{N}(\mu_{kj}, \sigma_{kj}^2)$ :

P(x_j | C_k) = \frac{1}{\sqrt{2\pi}\,\sigma_{kj}} \exp\!\left(-\frac{(x_j - \mu_{kj})^2}{2\sigma_{kj}^2}\right)

Parameters $\mu_{kj}$ and $\sigma_{kj}$ are estimated from the training points in class $k$ .

Log Posterior

To avoid numerical underflow with many features, work in log space. The prediction is:

\hat{y} = \arg\max_k \!\left[\log P(C_k) + \sum_{j=1}^d \log P(x_j | C_k)\right]

Where It Shines

Despite the naive independence assumption, Naive Bayes excels at text classification (spam filters, sentiment analysis) where features are bag-of-words counts. It also trains in $O(nd)$ — a single pass through the data — and handles missing features gracefully.

Phase 2▶

See It Work

Watch Naive Bayes estimate class distributions and classify the query

Data & Gaussian Distributions

Log Posteriors

log P(C₀|x)

—

log P(C₁|x)

—

Query Point

x₁ = 3.5, x₂ = 4

Class 0Class 1

P(\hat{y}|x)\propto P(x|\hat{y})P(\hat{y})

Step 1 of 9

Dataset: 12 points, 2 classes. We will classify the query point (3.5, 4.0).

Phase 3</>

The Code

Bridge from mathematical formulation to Python implementation

Mathematical Formulation

P(C_k) = \frac{|\{i: y_i=k\}|}{n}

Estimate class prior from training data

\mu_k = \frac{1}{n_k}\sum_{y_i=k} x_i

Compute per-class mean for each feature

\sigma_k = \text{std}(\{x_i: y_i=k\})

Compute per-class standard deviation for each feature

\log P(x|C_k) = -\frac{1}{2}\sum_j\left[\left(\frac{x_j-\mu_{kj}}{\sigma_{kj}}\right)^2 + \log(2\pi\sigma_{kj}^2)\right]

Log-likelihood under Gaussian assumption (naive independence)

\log P(C_k)

Log of class prior probability

\log P(x|C_k)

Log-likelihood of query under class k Gaussian

\hat{y} = \arg\max_k [\log P(C_k) + \log P(x|C_k)]

Predict class with highest log-posterior

Python Implementation

import numpy as np

def gaussian_nb(X_train, y_train, x_query):
    classes = np.unique(y_train)
    priors, means, stds = {}, {}, {}
    for c in classes:
        X_c = X_train[y_train == c]
        priors[c] = len(X_c) / len(X_train)
        means[c] = X_c.mean(axis=0)
        stds[c] = X_c.std(axis=0) + 1e-9
    def log_likelihood(x, mu, sigma):
        return -0.5 * np.sum(
            ((x - mu) / sigma)**2 + np.log(2*np.pi*sigma**2)
        )
    log_posteriors = {}
    for c in classes:
        log_prior = np.log(priors[c])
        log_like = log_likelihood(x_query, means[c], stds[c])
        log_posteriors[c] = log_prior + log_like
    prediction = max(log_posteriors, key=log_posteriors.get)
    return prediction, log_posteriors