The Math Behind Machine Learning

Chapter 1

Why Math? (And How Much?)

Machine learning is built on four branches of math: linear algebra, calculus, probability, and optimization. You don't need to be an expert in any of them. You need to understand what each one does for ML — what role it plays in the system.

Here's the shortest possible summary of each role:

Linear algebra is the language. Neural networks are made of matrices. Every forward pass is matrix multiplication. If you don't understand what a matrix multiply does, the rest is opaque.

Calculus is the learning mechanism. Backpropagation — the algorithm that trains every neural network — is just the chain rule from calculus, applied systematically. That's the entire connection.

Probability and statistics give meaning to outputs and losses. When a model outputs 0.87, that's a probability. When you compute cross-entropy, you're measuring surprise. When you regularize, you're encoding a prior belief. These concepts come from probability.

Optimization is the search algorithm. Once you have a gradient (from calculus), you need a strategy for using it. That strategy — how to walk downhill efficiently — is optimization theory.

There's also a fifth piece — information theory — which explains where cross-entropy and KL divergence actually come from. It's smaller but important for understanding why certain loss functions work the way they do.

What you don't need: You don't need to prove theorems. You don't need to hand-derive backpropagation. You don't need measure theory, functional analysis, or topology. You need to understand the operations well enough that when you see output = softmax(X @ W + b), you know exactly what every symbol does and why.

Chapter 2

Linear Algebra

The language of ML. Every piece of data, every weight, every operation is expressed in linear algebra.

Scalars, Vectors, and Matrices

Foundation

A scalar is one number. A vector is a list. A matrix is a grid.

A scalar: 7. A vector: [3, 1, 4, 1, 5] — a single ECG signal with 5 samples is a vector of length 5. A matrix: a grid of numbers with rows and columns. A batch of 32 ECG signals, each 1000 samples long, is a matrix with shape 32×1000. A tensor is the general term — a scalar is a 0D tensor, a vector is 1D, a matrix is 2D, and you can keep going (3D, 4D, etc.).

Used In ML

Everything is a tensor

Your data is a tensor. Your weights are tensors. Your outputs are tensors. The entire forward pass of a neural network is tensor operations. When code says x.shape = [32, 12, 1000], that's a 3D tensor: 32 samples, 12 leads, 1000 time steps each.

The Dot Product

Foundation

Multiply corresponding elements, add them up

Given two vectors [2, 3, 1] and [4, 1, 5]: dot product = 2×4 + 3×1 + 1×5 = 16. One number out. That's it.

Used In ML

The fundamental operation of a neuron

A single neuron computes the dot product of its input vector and its weight vector, then adds a bias. Every neuron in every network does this. The dot product is also the core of attention in Transformers — the "relevance score" between a Query and a Key is their dot product. And convolution? A sliding dot product between the filter and each window of the signal.

Matrix Multiplication

Foundation

A whole layer of neurons at once

If you have 256 neurons, each taking the same 1000-element input, you could compute 256 separate dot products. Or you can pack all 256 weight vectors into a single 1000×256 matrix and do one matrix multiplication. The result is 256 outputs — one per neuron. That's what input @ W does: it runs the entire layer in one operation.

Shape rule: (32×1000) @ (1000×256) = (32×256). The inner dimensions (1000) must match and they "collapse." You get rows from the left × columns from the right.

Used In ML

Why GPUs exist

Matrix multiplication is the bottleneck of every neural network. A GPU is, at its core, a machine optimized for doing thousands of multiply-and-add operations in parallel — which is exactly what matrix multiply requires. The reason ML needs expensive hardware is fundamentally because of this operation.

Transpose, Norms, and Eigenvalues

Transpose

Flip rows and columns

A matrix with shape (3×5) becomes (5×3). Rows become columns, columns become rows. You see this constantly in ML code — K.T in attention, W.T when computing gradients. It's needed whenever you want to align dimensions for a matrix multiply.

Norms

The "length" of a vector

The L2 norm (Euclidean norm): square each element, add them, take the square root. It's the straight-line distance from the origin. Used in weight decay (penalizing large weights by their L2 norm), normalization (dividing by the norm to make a unit vector), and distance calculations.

The L1 norm: sum of absolute values. Used in Lasso regularization — it encourages weights to be exactly zero, which effectively does feature selection.

Eigenvalues & Eigenvectors

The "natural axes" of a transformation

For a square matrix A, an eigenvector v is a direction that A merely stretches (not rotates): Av = λv, where λ is the eigenvalue (the stretching factor). PCA finds the eigenvectors of the data's covariance matrix — these are the directions of maximum variation. The eigenvalues tell you how much variation each direction captures, so you keep the top ones and discard the rest.

Connection to the ML Guide: Chapter 2 (One Neuron) is a dot product + bias. Chapter 3 (MLP) is matrix multiplication + activation. Chapter 6 (Transformer) is dot products between Q and K matrices. It's all linear algebra under the hood.

Chapter 3

Calculus

The learning mechanism. Calculus is what makes "adjust the weights to be less wrong" precise rather than random.

Derivatives: The Slope of the Error

Foundation

A derivative tells you how fast something is changing

If y = x², the derivative dy/dx = 2x. At x = 3, the derivative is 6 — meaning if you nudge x up by a tiny amount, y will increase by about 6 times that amount. In ML terms: if you nudge a weight by a tiny amount, how much does the loss change? That ratio is the gradient.

Partial Derivatives

Derivatives with respect to one variable at a time

A neural network has millions of weights. The loss is a function of all of them simultaneously. A partial derivative asks: "if I change just this one weight, holding all others fixed, how does the loss change?" Compute this for every weight and you have the gradient vector — a list of partial derivatives, one per weight, pointing in the direction of steepest increase of the loss.

Used In ML

Gradient descent = walk opposite to the gradient

The gradient points uphill (toward more error). You want less error. So you step in the opposite direction: weight = weight − learning_rate × gradient. That's gradient descent. The entire training loop is: compute loss → compute gradient of loss with respect to every weight → step opposite → repeat.

The Chain Rule: Why Backpropagation Works

Foundation

Derivatives of composed functions multiply

If y = f(g(x)), then dy/dx = f'(g(x)) × g'(x). Chain the derivatives. That's the entire rule. If you have a chain of 100 functions (100 layers), the derivative of the output with respect to the input is the product of 100 individual derivatives.

Backpropagation

The chain rule applied efficiently from output to input

A neural network is a chain of functions: input → layer 1 → activation → layer 2 → activation → ... → loss. Backpropagation starts at the loss and works backward, computing the derivative at each layer and multiplying it through. This gives you the gradient for every weight in the network in one backward pass. The key insight is efficiency: by going backward and reusing intermediate results, you compute all gradients in roughly the same time as one forward pass.

The Vanishing Gradient Problem

Multiplying many small numbers gives zero

If each layer's derivative is less than 1 (which sigmoid and tanh produce for most inputs), multiplying 100 of them together gives a number astronomically close to zero. Early layers receive essentially no gradient signal — they can't learn. This is a direct mathematical consequence of the chain rule applied over many layers, and it's why ReLU (derivative is exactly 1 for positive inputs) and skip connections (which provide "shortcut" gradient paths) were invented.

Automatic differentiation is how modern frameworks (PyTorch, TensorFlow) compute gradients. You don't hand-derive any derivatives. The framework records every operation during the forward pass, building a computational graph, then applies the chain rule automatically on the backward pass. When you write loss.backward(), this is what happens.

Chapter 4

Probability & Statistics

Gives meaning to what the model outputs and provides the theoretical foundation for loss functions, regularization, and evaluation.

Probability Distributions

Foundation

A distribution describes how likely each outcome is

Flip a fair coin: P(heads) = 0.5, P(tails) = 0.5. That's a distribution — it assigns a probability to every possible outcome, and they all add up to 1. A normal (Gaussian) distribution is the bell curve — most values cluster around the mean, with probability falling off symmetrically. Defined by two numbers: the mean (center) and the variance (width).

Used In ML

Model outputs are distributions

When a classifier outputs [0.1, 0.7, 0.2] via softmax, that's a probability distribution over 3 classes. When a VAE encoder outputs a mean and variance, it's defining a Gaussian distribution in latent space. When you initialize weights randomly, you sample from a distribution (usually Gaussian with a carefully chosen variance).

Conditional Probability and Bayes' Theorem

Foundation

P(A|B) = "probability of A given that B happened"

P(arrhythmia | this ECG) = "how likely is arrhythmia given these specific signal features?" That's what a classifier computes. Bayes' theorem flips the conditioning: P(A|B) = P(B|A) × P(A) / P(B). It lets you update beliefs with new evidence.

Used In ML

Regularization is a prior belief

In Bayesian terms, your training data provides the likelihood, and regularization (weight decay, dropout) encodes a prior — a belief that simpler models are more likely correct. The trained model is the posterior — your updated belief after seeing the data. Weight decay specifically encodes a Gaussian prior centered at zero: "I believe, before seeing any data, that most weights should be small."

Expectation, Variance, and Covariance

Foundation

Summary statistics of a distribution

Expectation (mean): The average value you'd get if you sampled infinitely many times. Weighted sum of all possible values by their probabilities. Variance: How spread out the values are around the mean. High variance = wide spread = more uncertainty. Covariance: Do two variables move together? Positive covariance = when one goes up, the other tends to go up. The covariance matrix captures this for all pairs of variables simultaneously.

Used In ML

Batch normalization, PCA, and weight initialization

Batch normalization computes the mean and variance of each feature within a batch, then normalizes to zero mean and unit variance — this stabilizes training. PCA finds the eigenvectors of the covariance matrix — the directions of maximum variance. Weight initialization (Xavier, He) carefully sets the variance of initial weights so that activations don't explode or vanish as they pass through layers.

Maximum Likelihood Estimation

Foundation

Find the parameters that make the observed data most probable

You have data. You have a model with parameters. Maximum Likelihood Estimation (MLE) says: pick the parameters that maximize the probability of having observed this specific data. It's the mathematical justification for training — you're finding the weights that make your training data most "likely" under the model.

Used In ML

Cross-entropy loss IS maximum likelihood

Minimizing cross-entropy loss is mathematically identical to maximizing the likelihood of the correct labels. This isn't a coincidence — cross-entropy was chosen as a loss function because of this equivalence. Similarly, minimizing MSE is equivalent to maximum likelihood under a Gaussian noise assumption. The loss functions aren't arbitrary; they're derived from probability theory.

Chapter 5

Optimization

You have a loss landscape — a surface where height represents error. Optimization is the art of finding the lowest valley.

The Loss Landscape

Foundation

The loss is a function of all the weights

Imagine a terrain where each coordinate represents a weight value and the height represents the loss. With 2 weights, this is a 3D surface you could visualize. With 1 million weights, it's a 1,000,001-dimensional surface you can't visualize but the math works identically. Training = finding the lowest point on this surface.

Convex vs. Non-Convex

Simple models have one valley. Neural networks have trillions.

A convex function has one global minimum — like a bowl. Linear regression's loss is convex; any downhill walk leads to the same bottom. A non-convex function has many valleys (local minima), saddle points, and flat plateaus. Neural network losses are deeply non-convex. This means gradient descent isn't guaranteed to find the best solution — but in practice, the many "pretty good" valleys it finds tend to generalize similarly well.

Gradient Descent Variants

Batch Gradient Descent

Compute gradient on ALL data, take one step

Accurate gradient, but impossibly slow for large datasets. One step might take hours.

Stochastic Gradient Descent (SGD)

Compute gradient on ONE sample, take one step

Very fast, but the gradient is noisy — one sample isn't representative. The path jitters heavily.

Mini-Batch SGD

Compute gradient on a small batch (32–512 samples)

The practical compromise everyone uses. The batch is large enough to reduce noise, small enough to be fast. When people say "SGD" in practice, they almost always mean this. The batch size is a hyperparameter — larger batches give smoother gradients but use more memory.

Learning Rate and Scheduling

The Most Important Hyperparameter

Step size controls everything

Too high: you overshoot valleys and the loss explodes. Too low: training takes forever and you get stuck in shallow valleys. The learning rate schedule changes the rate during training — common strategies: start high (explore broadly), decay over time (settle into a valley). Warmup: start very low, ramp up, then decay — prevents early instability when the model hasn't seen much data yet.

Momentum, Adam, and Beyond

Momentum

Keep a running average of past gradients

Instead of using only the current gradient, accumulate a "velocity" — like a ball rolling downhill that builds speed. Mathematically: v = β × v_prev + gradient; weight -= lr × v. The β (typically 0.9) controls how much history to keep. This smooths out noise and helps push through flat regions and saddle points.

Adam = Momentum + Adaptive Rates

Per-weight learning rates based on gradient history

Adam tracks two running averages per weight: the first moment (mean of recent gradients — this is momentum) and the second moment (mean of recent squared gradients — this measures volatility). Weights with large, volatile gradients get smaller steps. Weights with small, consistent gradients get larger steps. The combination makes it robust across a wide range of problems without much tuning.

Connection to the ML Guide: This entire chapter is the theory behind Chapter 4 (Backpropagation & Optimizers). The ML Guide tells you what each optimizer does; this chapter tells you why it works mathematically.

Chapter 6

Information Theory

A small but powerful branch that explains where cross-entropy, KL divergence, and the concept of "surprise" come from.

Entropy: Measuring Surprise

Foundation

How uncertain is a distribution?

Entropy H = −Σ p(x) × log(p(x)). A fair coin has maximum entropy (most uncertain — each flip is maximally surprising). A loaded coin that always lands heads has zero entropy (no surprise). Entropy measures the average "surprise" of sampling from a distribution, where surprise of an event = −log(p(event)). Rare events are more surprising.

Used In ML

Decision trees split on entropy reduction

When XGBoost or a random forest chooses which feature to split on, it picks the one that reduces entropy the most — the split that makes each resulting group most "pure" (least uncertain). This is called information gain.

Cross-Entropy: Measuring Model Quality

Foundation

Average surprise when using the model's distribution instead of the true one

Cross-entropy H(p, q) = −Σ p(x) × log(q(x)), where p is the true distribution and q is the model's predicted distribution. If the model perfectly matches reality, cross-entropy equals entropy (minimum possible). If the model is wrong, cross-entropy is higher — the extra "surprise" is wasted bits from a bad model.

Used In ML

THE classification loss function

When the true distribution is a one-hot vector (all probability on the correct class), cross-entropy simplifies to −log(q(correct class)) — the negative log of the probability your model assigned to the right answer. This is exactly the cross-entropy loss from the ML Guide's loss function chapter. Now you know where it comes from: it's the information-theoretic measure of how bad your probability estimates are.

KL Divergence: Distance Between Distributions

Foundation

How different are two distributions?

KL Divergence D_KL(p || q) = Σ p(x) × log(p(x) / q(x)). It's the "extra surprise" from using distribution q when the truth is p. Equivalently: KL = cross-entropy(p, q) − entropy(p). Since entropy(p) is fixed (it's the truth), minimizing cross-entropy and minimizing KL divergence are the same thing.

Not Symmetric

KL(p||q) ≠ KL(q||p)

This isn't a true "distance" — the direction matters. KL(p||q) heavily penalizes places where p has probability mass but q doesn't (the model assigns ~0 probability to something real). This asymmetry is why it's chosen for VAEs: it penalizes the model for having "holes" in its distribution where real data lives.

Used In ML

VAEs, knowledge distillation, RLHF

In VAEs: KL divergence between the encoder's learned distribution and a standard normal — keeps latent space smooth. In knowledge distillation: KL between a large teacher model's outputs and a small student model's outputs. In RLHF: KL between the fine-tuned model and the base model — prevents the model from drifting too far from its original behavior.

Chapter 7

The Map: Which Math Lives Where

Every ML component has a mathematical backbone. Here's the full mapping.

ML Component	Math Used	What It Does
Forward Pass (Making Predictions)
Neuron / Dense Layer	Dot product, matrix multiply	Computes weighted sum of inputs
Activation (ReLU, sigmoid)	Nonlinear functions	Introduces curves; enables learning beyond straight lines
Softmax output	Exponential, normalization	Converts raw scores to probabilities (distribution)
Convolution	Sliding dot product	Detects local patterns regardless of position
Self-Attention	Matrix multiply, softmax, scaling	Computes relevance between all position pairs
Positional Encoding	Sine/cosine functions	Injects order information into orderless attention
Batch Normalization	Mean, variance, normalization	Stabilizes activations across a batch
Loss Functions (Measuring Error)
MSE	Squared difference, mean	Penalizes large prediction errors (regression)
Cross-Entropy	Negative log probability	Penalizes confident wrong answers (classification)
Dice Loss	Set overlap ratio	Measures mask overlap (segmentation)
KL Divergence	Log ratio of distributions	Distance between predicted and target distribution
Contrastive Loss	Cosine similarity, exponentials	Pushes similar items together, different apart
Training (Adjusting Weights)
Backpropagation	Chain rule (calculus)	Computes gradient for every weight efficiently
SGD	Gradient × learning rate	Fixed-speed downhill step
Adam	Running mean & variance of gradients	Adaptive per-weight learning rates
Weight Decay / L2	L2 norm penalty	Shrinks weights toward zero (Gaussian prior)
Learning Rate Schedule	Decay functions (cosine, step)	Reduces step size over training
Data Processing & Evaluation
PCA	Eigendecomposition of covariance matrix	Finds axes of maximum variation
Weight Initialization	Variance-scaled random sampling	Prevents activation explosion/vanishing at start
Data Augmentation	Affine transforms, noise addition	Creates variation without new data

The pattern: Linear algebra handles the data flow. Calculus handles the learning. Probability gives meaning to outputs and losses. Optimization finds the best weights. Information theory justifies the loss functions. Everything connects back to these five branches.

This guide covers the math that ML practitioners encounter daily. Deeper theory (measure-theoretic probability, differential geometry for manifold learning, category theory) exists but is needed only for research at the frontier. If the concepts here are solid, you can read any ML paper and follow the math.

The Math BehindMachine Learning

Why Math? (And How Much?)

Linear Algebra

Scalars, Vectors, and Matrices

A scalar is one number. A vector is a list. A matrix is a grid.

Everything is a tensor

The Dot Product

Multiply corresponding elements, add them up

The fundamental operation of a neuron

Matrix Multiplication

A whole layer of neurons at once

Why GPUs exist

Transpose, Norms, and Eigenvalues

Flip rows and columns

The "length" of a vector

The "natural axes" of a transformation

Calculus

Derivatives: The Slope of the Error

A derivative tells you how fast something is changing

Derivatives with respect to one variable at a time

Gradient descent = walk opposite to the gradient

The Chain Rule: Why Backpropagation Works

Derivatives of composed functions multiply

The chain rule applied efficiently from output to input

Multiplying many small numbers gives zero

Probability & Statistics

Probability Distributions

A distribution describes how likely each outcome is

Model outputs are distributions

Conditional Probability and Bayes' Theorem

P(A|B) = "probability of A given that B happened"

Regularization is a prior belief

Expectation, Variance, and Covariance

Summary statistics of a distribution

Batch normalization, PCA, and weight initialization

Maximum Likelihood Estimation

Find the parameters that make the observed data most probable

Cross-entropy loss IS maximum likelihood

Optimization

The Loss Landscape

The loss is a function of all the weights

Simple models have one valley. Neural networks have trillions.

Gradient Descent Variants

Compute gradient on ALL data, take one step

Compute gradient on ONE sample, take one step

Compute gradient on a small batch (32–512 samples)

Learning Rate and Scheduling

Step size controls everything

Momentum, Adam, and Beyond

Keep a running average of past gradients

Per-weight learning rates based on gradient history

Information Theory

Entropy: Measuring Surprise

How uncertain is a distribution?

Decision trees split on entropy reduction

Cross-Entropy: Measuring Model Quality

Average surprise when using the model's distribution instead of the true one

THE classification loss function

KL Divergence: Distance Between Distributions

How different are two distributions?

KL(p||q) ≠ KL(q||p)

VAEs, knowledge distillation, RLHF

The Map: Which Math Lives Where

The Math Behind
Machine Learning