Built from the ground up. Each concept starts with the core idea, shows the simplest implementation, names its limitation, then shows what was invented to fix it. The final map at the end tells you what people actually use.
Every machine learning system does the same thing. No exceptions. Here it is:
It has a pile of adjustable numbers ("weights"). It multiplies your input by those numbers to produce a guess. It measures how wrong the guess was. It adjusts the numbers to be less wrong. It repeats this millions of times.
A neural network with a billion parameters and a kid adjusting their basketball shot are doing the same thing. The only variables are:
The architecture — the shape of the math (how the numbers are arranged and connected).
The loss function — the definition of "wrong" (a formula that scores the guess).
The optimizer — the adjustment strategy (how the numbers get nudged).
Everything below is specifics on these three things. That's all ML is.
The simplest possible machine learning system is one neuron. It is this:
output = input × weight + bias
That's literally y = mx + b from algebra class. The "weight" is the slope, the "bias" is the y-intercept. Training means finding the weight and bias that produce the least error on your data. With one neuron and the loss function "mean squared error" (average of (guess - truth)²), this is linear regression.
All ML starts here. One neuron = one multiplication + one addition. Checking = the loss function. Adjusting = the optimizer.
One neuron. Loss: Mean Squared Error (average of (guess - truth)², which punishes big errors more than small ones). Optimizer: you can solve this with pure algebra (called OLS — ordinary least squares) or use gradient descent. Predicts a number, not a category.
Real-world relationships are curved. A single neuron can never learn "heart failure risk rises slowly until age 55, then sharply." It's stuck with a constant slope.
To learn curves, you need more than one neuron. Stack them in layers — that's a Multi-Layer Perceptron (MLP). Every neuron in one layer connects to every neuron in the next layer.
But there's a catch: if all each neuron does is input × weight + bias, stacking them is pointless. A chain of linear operations is still linear — you'd still get a straight line. So after each neuron's multiplication, you pass the result through an activation function — a small nonlinear function that introduces curves.
ReLU (the default): if the number is negative, output zero; otherwise output the number unchanged. That's it. Sigmoid: squishes any number to a value between 0 and 1; used when you need a probability. tanh: squishes to between -1 and 1. Softmax: takes a list of numbers and converts them to probabilities that add up to 1 — it does this by raising e to the power of each number, then dividing each by the total. Used on the final layer when picking between categories.
Multiple layers of neurons with activation functions between them can approximate any mathematical function — any shape, any curve. This is called the Universal Approximation Theorem. It's the theoretical foundation of deep learning.
Input (say, 1000 numbers from an ECG) → hidden layer of 256 neurons with ReLU → output layer with softmax. For classification, pair with cross-entropy loss (measures how surprised the model is by the right answer; if it assigned high probability to the correct class, low loss; if it assigned low probability, enormous loss — the logarithm makes confident-and-wrong catastrophically expensive).
hidden = ReLU(input × W1 + b1) # 1000 inputs → 256 neurons output = softmax(hidden × W2 + b2) # 256 → 5 probabilities
The MLP treats its 1000 inputs as 1000 independent numbers. Shuffle them randomly and it has no idea anything changed. It can't learn "a spike followed by a dip" because it doesn't know what "followed by" means. It also can't learn "a spike at position 50 is the same pattern as a spike at position 500" — it treats each position as a completely separate feature.
Before we solve these limitations, we need to understand how the adjustment step works — because everything from here forward depends on it.
After the model makes a guess and the loss function scores it, you need to figure out: for every single weight in the network, should I increase it or decrease it, and by how much?
Backpropagation is the algorithm that answers this. It uses the chain rule from calculus to work backward from the loss through every layer, computing a gradient for each weight. A gradient is just a number that says "if you increased this weight by a tiny amount, the loss would change by this much." Positive gradient = increasing this weight increases error, so decrease the weight. Negative gradient = the opposite.
The optimizer then uses these gradients to update the weights. Different optimizers do this differently:
new_weight = old_weight − learning_rate × gradient. The learning rate (e.g. 0.01) controls step size. "Stochastic" means you compute the gradient on a small random batch of data rather than the entire dataset — faster, noisier, works fine. This is the original optimizer.
Every weight gets the same step size. Some weights need big adjustments, others need tiny ones. And SGD only looks at the current gradient — if the gradient is noisy, it jitters back and forth instead of making progress.
Momentum: keeps a running average of recent gradients. If the gradient has been pointing the same way for several steps, it builds speed — like a ball rolling downhill. Pushes through noise and flat spots. Adaptive rate: tracks how volatile each weight's gradient has been. Volatile weights get smaller, more cautious steps. Stable weights get bigger steps. Every weight gets its own personalized speed.
Same as Adam, but every weight is also shrunk slightly toward zero at each step (multiplied by something like 0.999). This prevents any weight from growing excessively large, keeping the model simpler and better at generalizing. This is the current standard optimizer for neural networks.
The MLP's core limitation was: no sense of position, no sense of sequence. Two architectures were invented to fix this — each solves a different version of the problem.
A spike in an ECG is the same spike whether it's at sample 50 or sample 500. Instead of learning separate weights for every position (like an MLP), learn one small "filter" and slide it across the entire signal. This is called translation invariance — it just means "position doesn't change the pattern."
A filter (or kernel) is a small list of weights, say 5 numbers. Place it over the first 5 samples of your signal. Multiply each filter weight by the corresponding sample. Add them up. That gives one output number. Slide one step right, repeat. One filter scans for one pattern. Use 64 filters in parallel to scan for 64 different patterns — each output list is called a channel (this is what "64 feature maps" means).
# Convolution = a sliding dot product. That's it. for position in range(signal_length - filter_length): output[position] = sum(signal[position:position+5] × filter_weights)
After detecting patterns, shrink the output by keeping only the maximum value from every group of 2 (or 4). This makes the data smaller, makes the network care about whether a pattern exists rather than exactly where, and gives subsequent layers a wider view. Stacking conv + pool layers repeatedly builds a hierarchy: first layer detects edges, second detects shapes, third detects concepts.
Each filter only sees a small window. Stacking many layers widens the view, but the network still struggles when the answer depends on the relationship between the very beginning and the very end of a signal. It has no mechanism for directly comparing distant regions.
Language, music, time-series signals — the meaning of each step depends on what came before. You need a "memory" that accumulates as you read left to right.
The hidden state is a list of numbers that serves as memory. At each time step, combine the current input with the hidden state, produce a new hidden state. After processing the whole sequence, the hidden state summarizes everything.
memory = zeros for sample in ecg_signal: memory = tanh(sample × W_input + memory × W_memory + bias)
By step 500, information from step 1 is gone. The math: when backpropagating through hundreds of steps, you multiply gradients together hundreds of times. Multiplying a number less than 1 by itself hundreds of times gives effectively zero. The gradient "vanishes" — early steps get no learning signal.
Three small neural networks ("gates") control information flow. The forget gate decides what to erase (outputs 0-to-1 per memory slot; 0 = erase, 1 = keep). The input gate decides what new information to write. The output gate decides what to expose. Key insight: by keeping the forget gate at ~1 and the input gate at ~0 for a memory slot, info can travel unchanged across hundreds of steps. That's the whole trick.
You must process step 1 before step 2 before step 3. You can't parallelize this across a GPU. Training is slow on long sequences, and even LSTMs degrade over thousands of steps.
The CNN's limitation: can't directly compare distant regions. The LSTM's limitation: sequential, slow, still forgets over very long ranges. The Transformer solves both.
Instead of scanning left-to-right or through a small window, compute a "relevance score" between every pair of positions in one shot. This is called self-attention.
For every position in the input, create three vectors by multiplying by three learned weight matrices:
Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "Here's my actual information."
Score every position against every other: dot product of one position's Q with every position's K. High score = high relevance. Normalize with softmax (now they're percentages summing to 1). Take a weighted average of all Values using those percentages.
Q = input × W_query # each position: "what am I looking for?" K = input × W_key # each position: "what do I have?" V = input × W_value # each position: "here's my info" scores = (Q × K.transpose) / sqrt(dimension) # scaling prevents extremes weights = softmax(scores) # each row sums to 1 output = weights × V # weighted blend of everyone's info
The sqrt(dimension) division is just a scaling factor to keep dot products from getting too large. Nothing deep.
Do the Q/K/V process 8 times (or 12, or 16), each with different weight matrices. Each "head" can learn to attend to different types of relationships (one for rhythm, one for amplitude, etc.). Concatenate all results and mix with one final weight matrix.
Since attention processes all positions simultaneously, it has no inherent sense of order. Fix: add a unique pattern of numbers to each position's input before attention. The original paper used sine waves at different frequencies; modern systems often just learn the position numbers during training.
Comparing every position to every other position for a sequence of length N requires N² computations. Double the sequence length, quadruple the compute. This makes very long sequences expensive. (Active research area: linear attention, sparse attention, etc.)
Some problems need structures beyond "scan locally" (CNN) or "attend globally" (Transformer). Here are the main specialized designs and the specific limitations they address.
To label every millisecond of a signal (this is the P-wave, this is the QRS complex…), you need to understand the overall concept ("what": that's a heartbeat) AND preserve precise timing ("where": it starts at sample 247). Shrinking the data (via pooling) builds understanding but destroys precise location. This is the fundamental tension.
Encoder (left side): CNN layers + pooling, progressively shrinking. Each level extracts higher-level features but at lower resolution.
Bottleneck (bottom): Smallest representation. Maximum understanding, minimum spatial detail.
Decoder (right side): Transposed convolutions (learnable up-scaling) that expand back to original size.
Skip connections (the bridges): At each level, the encoder's output (sharp but dumb) is concatenated alongside the decoder's output (blurry but smart). "Concatenated" = literally gluing two lists of numbers together along the channel dimension. A subsequent conv layer mixes them.
encoder_out = conv_block(input) # sharp, detailed bottleneck = conv_block(pool(encoder_out)) # small, conceptual decoder_up = upsample(bottleneck) # stretched, blurry # The skip connection: paste sharp next to blurry combined = concatenate(decoder_up, encoder_out) output = conv_block(combined) # now sharp AND smart
If 95% of your signal is "background," a lazy model can predict "all background" and be 95% accurate. Cross-entropy per-point would barely penalize this. Dice loss instead measures overlap: (2 × area where both agree) / (total area of both). If the model misses the small foreground entirely, Dice = 0 — maximum penalty. This forces the model to actually find the small regions that matter.
Molecules, social networks, or the 12 leads of an ECG (which have a physical spatial relationship around the heart). You need an architecture that respects "who is connected to whom."
Each node has a state (a list of numbers). Each round: every node collects its neighbors' states, aggregates them (sum or average — the order doesn't matter, which is called permutation invariance), then updates its own state using the aggregate plus its current state. After several rounds, each node's state encodes its neighborhood.
When your data is rows and columns of features (age, heart rate, blood pressure…) rather than raw signals or images, neural networks often lose to simpler methods. The data has no spatial or sequential structure for CNNs or Transformers to exploit.
A decision tree is a flowchart: "Is heart rate > 100? Go left. Is QT interval > 450ms? Predict arrhythmia." Boosting: train tree #1, compute errors (residuals). Train tree #2 to predict those errors. Add its output to tree #1's. The remaining errors are smaller. Train tree #3 on those. Repeat 500 times.
prediction = 0.5 # start with a naive guess for round in range(500): error = true_labels - prediction new_tree = fit_tree(features, targets=error) prediction += 0.1 × new_tree.predict(features) # 0.1 = learning rate
"Gradient boosted" = uses calculus (gradients) to determine the direction each tree should correct. Optimizer: Newton-style second-order method built into the tree fitting. Loss: log-loss (same math as cross-entropy, different name) for classification, MSE for regression.
A model predicting blood flow might output a value that violates conservation of mass. It doesn't know physics.
Take the model's output. Use automatic differentiation (the same mechanism as backpropagation) to compute its derivatives. Check whether those derivatives satisfy the known physical equation. The gap between what they are and what physics says they should be becomes an extra loss term. The optimizer now minimizes both data error and physics violation simultaneously.
prediction = neural_net(input) data_loss = mean((prediction - measured)²) dp_dt = auto_diff(prediction, wrt=time) physics_loss = mean((dp_dt - known_equation(prediction))²) total_loss = data_loss + physics_loss
Everything above describes what the model looks like. But how you train it depends on what kind of data you have.
"Here's an ECG, a cardiologist labeled it 'arrhythmia.'" The model learns to match input → answer. Most practical ML is this. Loss functions compare the guess to the label (cross-entropy, MSE, Dice).
"Here are 100,000 ECGs. Find the structure yourself." The model learns which signals are similar, what the important dimensions are, or how to compress and reconstruct the data. Loss functions measure reconstruction quality or cluster coherence — not correctness against a label.
Key methods: K-Means (pick K centers, assign points to nearest, move centers to the mean, repeat). SimCLR (create two distorted copies of one input, train a neural network to recognize they're the same — contrastive loss pushes same-source outputs together and different-source outputs apart). PCA (find the axes of maximum variation; linear only). VAE (encoder compresses to a small representation, decoder reconstructs; loss = MSE + KL divergence, which measures how different the learned distribution is from a standard bell curve, keeping the representation space smooth and organized).
"You made 10 dosing decisions. The patient recovered. Here's a reward." The model learns a strategy (policy), not a single right answer. Bellman equation: value of current state = immediate reward + discounted value of best next state (the discount, like 0.99, means future rewards are worth slightly less). Policy gradient: if an action led to good reward, increase its probability; if bad, decrease it.
The loss function is the single most important design choice after the architecture. It defines what "wrong" means — and therefore what the model learns to care about. Every loss function takes two inputs (the model's guess and the truth) and returns one number. The entire system exists to make that number smaller.
Different tasks need fundamentally different definitions of "wrong." Here is every major loss function, what it actually computes, and when you'd use it.
What it computes: Take the difference between your prediction and the truth. Square it. Average across all samples. If you predict 72 and the truth is 80, the loss for that sample is (80 − 72)² = 64.
Why squaring: Two reasons. It makes all errors positive (no canceling out). And it punishes large errors disproportionately — being off by 10 costs 100, but being off by 20 costs 400. This forces the model to prioritize fixing its worst predictions.
When to use: Predicting any continuous number — temperature, ejection fraction, blood pressure, stock price. The most common loss in all of ML.
What it computes: The absolute difference between prediction and truth, averaged. |80 − 72| = 8. No squaring.
Why it exists: MSE's squaring makes the model obsess over outliers — one wildly wrong prediction dominates the loss. MAE treats all errors proportionally. If your data has extreme outliers you can't remove, MAE is more robust.
When to use: Regression where outlier robustness matters more than penalizing large errors.
What it computes: Your model outputs a probability (say 0.9 for "arrhythmia"). If the truth is 1 (yes, arrhythmia), the loss is −log(0.9) = 0.105 — small, the model was right and confident. If the truth is 0 (no arrhythmia), the loss is −log(1 − 0.9) = −log(0.1) = 2.3 — huge, the model was confident and wrong.
Why the logarithm: It makes "confident and wrong" catastrophically expensive. Predicting 0.99 when the truth is 0 costs −log(0.01) = 4.6. Predicting 0.5 when the truth is 0 costs only −log(0.5) = 0.69. The log creates an asymmetric penalty that teaches the model to only be confident when it's actually right.
When to use: Any binary decision — disease present/absent, normal/abnormal, fraud/legitimate.
What it computes: The model outputs probabilities across all classes via softmax (they sum to 1). The loss is simply −log(probability assigned to the correct class). If the correct class got 90% probability, loss = −log(0.9) = 0.105. If it got 2%, loss = −log(0.02) = 3.9.
Why it exists: Same logarithmic logic as BCE, extended to any number of categories. Also called "log-loss" in the context of XGBoost — same formula, different name.
When to use: Classifying into 3+ categories — arrhythmia type, image category, language identification. The most common classification loss.
What it computes: Cross-entropy, but with a multiplier that down-weights easy examples. If the model is already 95% sure of the correct class, that example's loss gets scaled close to zero. If the model is only 20% sure, the loss stays large.
Why it exists: In medical data, 98% of samples might be "normal." Standard cross-entropy spends most of its effort getting slightly better at the easy "normal" cases. Focal loss shifts attention to the hard, rare cases that matter.
When to use: Highly imbalanced classification — rare disease detection, fraud detection, defect identification.
What it computes: Dice score = (2 × area of overlap between prediction and truth) / (total area of both). Perfect overlap = 1, no overlap = 0. Dice loss = 1 − Dice score.
Why it exists: If 95% of your signal is "background," a model predicting "all background" gets 95% accuracy with cross-entropy per point. Dice loss doesn't care about the easy background — it measures whether the model found the small foreground region. A model that misses the foreground entirely scores 0, regardless of how much background it got right.
When to use: Segmentation — labeling every pixel/sample. Especially when the target region is small relative to the background, which is almost always the case in medical imaging.
What it computes: Sum of Dice loss and per-point cross-entropy. Dice handles the overlap problem. Cross-entropy provides smooth, well-behaved gradients that make training stable.
Why it exists: Dice loss alone can produce noisy gradients in early training when predictions are far off. Cross-entropy is smooth everywhere. Combining them gets the best of both.
When to use: State-of-the-art segmentation models (Swin-UNet, nnU-Net) almost universally use this combination.
What it computes: Measures the gap between two probability "shapes." If the model thinks heart types are distributed [30% normal, 70% abnormal] and the target distribution is [50%, 50%], KL divergence quantifies that difference.
Why it exists: In VAEs (Variational Autoencoders), the encoder maps inputs to a distribution in "latent space." KL divergence keeps that distribution close to a standard bell curve, preventing the model from creating a chaotic, fragmented representation. This ensures similar inputs map to nearby points — making the space smooth and useful.
When to use: Always paired with a reconstruction loss (MSE) in VAEs. Also used in knowledge distillation (training a small model to mimic a large one).
What it computes: Take one input. Create two distorted versions (add noise, crop differently, shift timing). Both pass through the model, producing two output vectors. The loss pulls these two vectors together (they came from the same source) while pushing them away from vectors produced by other, different inputs.
Why it exists: It teaches the model the concept of "similarity" without any human labels. After training, the model's internal representation organizes inputs by genuine similarity — not pixel-level similarity but structural, meaningful similarity.
When to use: Self-supervised pre-training when you have massive unlabeled datasets (SimCLR, CLIP). Often followed by fine-tuning with a small labeled dataset.
What it computes: The Bellman equation says: value of current state = immediate reward + (discount × value of best next state). The "loss" is the gap between the model's current value estimate and what the Bellman equation says it should be. Training drives this gap to zero.
The discount factor (γ, typically 0.99): multiplied against future rewards, making them worth slightly less than immediate ones. A reward 100 steps in the future is worth 0.99¹⁰⁰ ≈ 0.37 of its face value. This prevents the model from chasing infinitely distant payoffs.
When to use: Q-learning and its descendants (DQN). Any RL method that estimates the value of states or state-action pairs.
What it computes: If an action led to good reward, increase its probability. If it led to bad reward, decrease it. Formally: adjust weights in the direction of ∇log(probability of the action taken) × reward received. The log ensures that making a rare-but-good action more likely gets a strong signal.
Why it exists: Bellman-based methods estimate values and derive a policy from those values. Policy gradient methods skip the middleman and directly optimize the policy (the decision-making function itself). This works better in continuous action spaces and high-dimensional problems.
When to use: PPO, REINFORCE, actor-critic methods. Robotics, game AI, RLHF for LLMs.
Everything above exists on a spectrum from "useful for understanding concepts" to "what practitioners actually deploy in 2026." This table maps it out. The CONCEPTUAL tier means: important for understanding, rarely used in production. PRODUCTION means: widely used in real systems today. FRONTIER means: state-of-the-art, increasingly adopted.
| Task | Tier | Model | Loss → Optimizer | When to Use |
|---|---|---|---|---|
| Supervised → Classification (input → label) | ||||
| Conceptual | MLP / Logistic Regression | Cross-Entropy → SGD | Learning; tiny datasets; baseline | |
| Production | XGBoost | Log-Loss → Newton | Tabular/spreadsheet data | |
| Frontier | Vision Transformer (ViT) | Cross-Entropy → AdamW | Raw signals, images, long-range patterns | |
| Supervised → Regression (input → number) | ||||
| Conceptual | Linear Regression | MSE → OLS / SGD | Learning; very simple relationships | |
| Production | XGBoost / Deep NN | MSE → Newton / AdamW | Tabular or signal data | |
| Frontier | PINN | MSE + Physics → Adam | When physical laws are known | |
| Supervised → Segmentation (input → label per point) | ||||
| Conceptual | Sliding Window CNN | BCE per-point → SGD | Learning; proof of concept only | |
| Production | U-Net | Dice Loss → Adam | Medical imaging/signals, most segmentation | |
| Frontier | Swin-UNet | Dice + CE → AdamW | When long-range context matters for segmentation | |
| Unsupervised → Clustering (find groups in unlabeled data) | ||||
| Conceptual | K-Means | Euclidean distance | Quick exploration; round-ish clusters | |
| Frontier | SimCLR + clustering | Contrastive → Adam | Large unlabeled datasets; complex structure | |
| Unsupervised → Dimensionality Reduction (simplify) | ||||
| Conceptual | PCA | Variance maximization (SVD) | Quick exploration; linear relationships | |
| Frontier | VAE | MSE + KL Div → Adam | Complex nonlinear structure; generative modeling | |
| Reinforcement → Decision Making (learn a strategy) | ||||
| Conceptual | Q-Table | Bellman Equation | Learning; tiny discrete problems | |
| Production | PPO | Policy Gradient → Adam | Robotics, game AI, RLHF for LLMs | |
| Frontier | Decision Transformer | Reward-to-Go → AdamW | When you have recorded expert trajectories | |
Every term introduced in this document is resolved in the same document. If you encounter a word that feels undefined, it was explained in the section where it first appeared. The chapters build on each other: Chapter 2 depends on nothing, Chapter 3 depends on 2, and so on. Reading top-to-bottom once should close every loop.