Why Least-Squares? Unpacking the Probabilistic Heart of Linear Regression ❤️🎲

July 26, 2025

This content originally appeared on DEV Community and was authored by Randhir Kumar

Hey everyone! My name is Randhir, and as someone diving deep into ethical hacking, machine learning, deep learning, and web development, I’m constantly building and exploring. Right now, I’m excited to be working on my AI SaaS tool, TailorMails.dev, a personalized cold email tool that crafts outreach based on LinkedIn bios. Understanding the “why” behind core algorithms is crucial for these projects, and it’s something I love sharing.

We often use the least-squares cost function in Linear Regression, but have you ever stopped to wonder why it’s the right choice?

Today, let’s explore the powerful Probabilistic Interpretation of Linear Regression. This theoretical justification reveals the hidden statistical elegance behind our beloved least-squares objective. Get ready to connect the dots!

Linear Regression: The Core Problem

Our primary goal in Linear Regression (Chapter 1, remember?) is to learn a hypothesis function, $hθ(x)=θTxh_\theta(x) = \theta^Tx$ , that can predict a continuous target variable ( $y$ ) based on input features ( $x$ ).

Goal: Find the optimal parameters ( $θ\theta$ ) for our hypothesis function.
How? We define a cost function (typically the least-squares cost function), which measures the squared differences between our predictions and the actual values.
Objective: Minimize this cost!

The Probabilistic Lens: Key Assumptions

The core of the probabilistic interpretation rests on a specific set of assumptions about how the target variables $y$ are related to the input features $x$ .

Relationship with Error Term:
It is assumed that the target variable $y^{(i)}$ for each training example ( $x^{(i)}, y^{(i)}$ ) is related to the input features $x^{(i)}$ and parameters $θ\theta$ by the equation:
$y^{(i)} = \theta^Tx^{(i)} + \epsilon^{(i)}$
- Here, $ϵ(i)\epsilon^{(i)}$ represents an error term which accounts for unmodelled effects or random noise.
Gaussian Error Distribution:
A crucial assumption is that these error terms $ϵ(i)\epsilon^{(i)}$ are Independently and Identically Distributed (IID) according to a Gaussian (Normal) distribution with a mean of zero and some variance $σ2\sigma^2$ . This can be written as:
$\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$
- This assumption implies that the conditional probability of $y^{(i)}$ given $x^{(i)}$ and $θ\theta$ is also Gaussian: $p(y(i)∣x(i);θ)=12πσexp⁡(−(y(i)−θTx(i))22σ2)p(y^{(i)}|x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y^{(i)} – \theta^Tx^{(i)})^2}{2\sigma^2}\right)$ This essentially states that $y(i)∣x(i);θ∼N(θTx(i),σ2)y^{(i)} | x^{(i)}; \theta \sim \mathcal{N}(\theta^Tx^{(i)}, \sigma^2)$ . It is important to note that $θ\theta$ is treated as a fixed but unknown parameter, not a random variable, hence the use of ; \theta instead of , \theta.

Unveiling the Connection: Maximum Likelihood Estimation (MLE)

Given these probabilistic assumptions, the principle of maximum likelihood estimation (MLE) is applied to find the optimal parameters $θ\theta$ .

Likelihood Function:
The likelihood function $L(θ)L(\theta)$ represents the probability of observing the entire training dataset ( $y⃗\vec{y}$ given $X$ ) for a fixed value of $θ\theta$ . Due to the independence assumption of the $ϵ(i)\epsilon^{(i)}$ terms, $L(θ)L(\theta)$ is expressed as the product of the individual conditional probabilities:
$L(\theta) = p(\vec{y}|X; \theta) = \prod_{i=1}^n p(y^{(i)}|x^{(i)}; \theta)$
Log-Likelihood:
To simplify calculations, it is common practice to maximise the log-likelihood $ℓ(θ)\ell(\theta)$ instead of $L(θ)L(\theta)$ , as maximising a strictly increasing function (like log) yields the same optimal parameters. Taking the logarithm of $L(θ)L(\theta)$ :

$\ell(\theta) = \log L(\theta) = n \log\left(\frac{1}{\sqrt{2\pi}\sigma}\right) – \frac{1}{2\sigma^2} \sum_{i=1}^n (y^{(i)} – \theta^Tx^{(i)})^2$
Equivalence to Least-Squares:
When examining $ℓ(θ)\ell(\theta)$ , it becomes evident that maximising $ℓ(θ)\ell(\theta)$ is equivalent to minimising the term $12∑i=1n(y(i)−θTx(i))2\frac{1}{2} \sum_{i=1}^n (y^{(i)} – \theta^Tx^{(i)})^2$ . This latter term is precisely the least-squares cost function $J(θ)J(\theta)$ that linear regression aims to minimise.

Therefore, the probabilistic interpretation demonstrates that under the assumption of IID Gaussian error terms, least-squares regression corresponds to finding the maximum likelihood estimate of $θ\theta$ . This provides a strong justification for why least-squares is considered a “very natural algorithm” in this context.

Significance in Linear Regression Context

This probabilistic interpretation isn’t just a theoretical exercise; it provides profound insights:

Foundation of Cost Function: The probabilistic interpretation is fundamental because it provides a strong theoretical underpinning for the widely used least-squares cost function in linear regression. Without such an interpretation, the choice of summing squared errors might seem arbitrary, but this shows its statistical optimality under specific, common assumptions.
Irrelevance of $σ2\sigma^2$ : Notably, the final choice of $θ\theta$ that minimises $J(θ)J(\theta)$ (and thus maximises $ℓ(θ)\ell(\theta)$ ) does not depend on the value of $σ2\sigma^2$ . This means that even if the noise variance is unknown, the optimal $θ\theta$ can still be found.
Relationship with Generalised Linear Models (GLMs): Linear regression, viewed through this probabilistic lens, is a special case of Generalised Linear Models (GLMs). GLMs provide a unified framework for various models by assuming the conditional distribution of $y$ given $x$ belongs to the exponential family. For ordinary least squares, the Gaussian distribution is chosen for $\theta$ , and by relating the natural parameter $η\eta$ to $θTx\theta^Tx$ ( $η=θTx\eta = \theta^Tx$ ), the standard linear regression hypothesis $hθ(x)=θTxh_\theta(x) = \theta^Tx$ naturally emerges as the expected value of $y$ given $x$ ( $E[y∣x;θ]E[y|x;\theta]$ ). This highlights linear regression’s place within a broader family of statistical models.
Complement to Solution Methods: While the probabilistic interpretation justifies the objective function, it does not dictate the method used to minimise it. Both the LMS (gradient descent) algorithm and The Normal Equations are different approaches to solve the same minimisation problem of $J(θ)J(\theta)$ . The Normal Equations provide a direct, closed-form solution $θ=(XTX)−1XTy\theta = (X^TX)^{-1} X^Ty$ , while the LMS algorithm uses iterative gradient descent. Both methods, despite their differences in computation, aim to find the $θ\theta$ that is the maximum likelihood estimate under these Gaussian assumptions.

It’s important to recognise that while these probabilistic assumptions provide a compelling justification, they are “by no means necessary for least-squares to be a perfectly good and rational procedure.” Other natural assumptions can also justify the use of the least-squares cost function.

Wrapping Up

The probabilistic interpretation demystifies the least-squares cost function, revealing its deep connection to statistical principles like Maximum Likelihood Estimation. It solidifies Linear Regression’s place as a statistically robust model, giving us confidence in its results.

As I continue to build out my AI SaaS tools, TailorMails.dev (my personalized cold email tool using LinkedIn bios!), understanding these core theoretical underpinnings is just as vital as the practical implementation. It empowers me to make informed design choices and truly comprehend the magic behind the algorithms.

If you found this helpful or insightful, consider supporting my work! You can grab me a virtual coffee here: https://buymeacoffee.com/randhirbuilds. Your support helps me keep learning, building, and sharing!

This content originally appeared on DEV Community and was authored by Randhir Kumar