Machine Learning from First Principles

Chapter 1

The Universal Mechanic

Every machine learning system does the same thing. No exceptions. Here it is:

It has a pile of adjustable numbers ("weights"). It multiplies your input by those numbers to produce a guess. It measures how wrong the guess was. It adjusts the numbers to be less wrong. It repeats this millions of times.

A neural network with a billion parameters and a kid adjusting their basketball shot are doing the same thing. The only variables are:

The architecture — the shape of the math (how the numbers are arranged and connected).
The loss function — the definition of "wrong" (a formula that scores the guess).
The optimizer — the adjustment strategy (how the numbers get nudged).

Everything below is specifics on these three things. That's all ML is.

Training vs. inference: The loop above (guess → measure → adjust) is called training. When training is done, you throw away the loss function and optimizer. The architecture, now full of good weights, is your model. Using it on new data is called inference — it just runs forward, no adjusting.

Chapter 2

Start with One Neuron

The simplest possible machine learning system is one neuron. It is this:

output = input × weight + bias

That's literally y = mx + b from algebra class. The "weight" is the slope, the "bias" is the y-intercept. Training means finding the weight and bias that produce the least error on your data. With one neuron and the loss function "mean squared error" (average of (guess - truth)²), this is linear regression.

First Principle

Multiply, add, check, adjust

All ML starts here. One neuron = one multiplication + one addition. Checking = the loss function. Adjusting = the optimizer.

Simplest Implementation

Linear Regression

One neuron. Loss: Mean Squared Error (average of (guess - truth)², which punishes big errors more than small ones). Optimizer: you can solve this with pure algebra (called OLS — ordinary least squares) or use gradient descent. Predicts a number, not a category.

Limitation

Can only draw straight lines

Real-world relationships are curved. A single neuron can never learn "heart failure risk rises slowly until age 55, then sharply." It's stuck with a constant slope.

Chapter 3

Stack Neurons → The MLP

To learn curves, you need more than one neuron. Stack them in layers — that's a Multi-Layer Perceptron (MLP). Every neuron in one layer connects to every neuron in the next layer.

But there's a catch: if all each neuron does is input × weight + bias, stacking them is pointless. A chain of linear operations is still linear — you'd still get a straight line. So after each neuron's multiplication, you pass the result through an activation function — a small nonlinear function that introduces curves.

Activation functions, demystified: ReLU (the default): if the number is negative, output zero; otherwise output the number unchanged. That's it. Sigmoid: squishes any number to a value between 0 and 1; used when you need a probability. tanh: squishes to between -1 and 1. Softmax: takes a list of numbers and converts them to probabilities that add up to 1 — it does this by raising e to the power of each number, then dividing each by the total. Used on the final layer when picking between categories.

First Principle

Stacking neurons + nonlinearity = curves

Multiple layers of neurons with activation functions between them can approximate any mathematical function — any shape, any curve. This is called the Universal Approximation Theorem. It's the theoretical foundation of deep learning.

Implementation: MLP

Layers where everything connects to everything

Input (say, 1000 numbers from an ECG) → hidden layer of 256 neurons with ReLU → output layer with softmax. For classification, pair with cross-entropy loss (measures how surprised the model is by the right answer; if it assigned high probability to the correct class, low loss; if it assigned low probability, enormous loss — the logarithm makes confident-and-wrong catastrophically expensive).

hidden = ReLU(input × W1 + b1)    # 1000 inputs → 256 neurons
output = softmax(hidden × W2 + b2) # 256 → 5 probabilities

Limitation

No concept of position, sequence, or structure

The MLP treats its 1000 inputs as 1000 independent numbers. Shuffle them randomly and it has no idea anything changed. It can't learn "a spike followed by a dip" because it doesn't know what "followed by" means. It also can't learn "a spike at position 50 is the same pattern as a spike at position 500" — it treats each position as a completely separate feature.

Before we solve these limitations, we need to understand how the adjustment step works — because everything from here forward depends on it.

Chapter 4

How Adjustment Works: Backpropagation & Optimizers

After the model makes a guess and the loss function scores it, you need to figure out: for every single weight in the network, should I increase it or decrease it, and by how much?

Backpropagation is the algorithm that answers this. It uses the chain rule from calculus to work backward from the loss through every layer, computing a gradient for each weight. A gradient is just a number that says "if you increased this weight by a tiny amount, the loss would change by this much." Positive gradient = increasing this weight increases error, so decrease the weight. Negative gradient = the opposite.

The optimizer then uses these gradients to update the weights. Different optimizers do this differently:

Simplest: SGD

Stochastic Gradient Descent

new_weight = old_weight − learning_rate × gradient. The learning rate (e.g. 0.01) controls step size. "Stochastic" means you compute the gradient on a small random batch of data rather than the entire dataset — faster, noisier, works fine. This is the original optimizer.

Limitation of SGD

Fixed speed, no memory

Every weight gets the same step size. Some weights need big adjustments, others need tiny ones. And SGD only looks at the current gradient — if the gradient is noisy, it jitters back and forth instead of making progress.

Advance: Adam

Adaptive steps + momentum

Momentum: keeps a running average of recent gradients. If the gradient has been pointing the same way for several steps, it builds speed — like a ball rolling downhill. Pushes through noise and flat spots. Adaptive rate: tracks how volatile each weight's gradient has been. Volatile weights get smaller, more cautious steps. Stable weights get bigger steps. Every weight gets its own personalized speed.

Advance: AdamW

Adam + weight decay

Same as Adam, but every weight is also shrunk slightly toward zero at each step (multiplied by something like 0.999). This prevents any weight from growing excessively large, keeping the model simpler and better at generalizing. This is the current standard optimizer for neural networks.

Newton's Method: Instead of just using the slope (first derivative), it also uses the curvature (second derivative) to calculate smarter steps. More expensive to compute, but converges faster. Used inside XGBoost (covered later) for its tree-fitting. Not used in neural networks because computing second derivatives for millions of weights is prohibitively expensive.

Chapter 5

Teaching Structure: CNNs and RNNs

The MLP's core limitation was: no sense of position, no sense of sequence. Two architectures were invented to fix this — each solves a different version of the problem.

The CNN: Local Pattern Detection

First Principle

The same pattern can appear anywhere

A spike in an ECG is the same spike whether it's at sample 50 or sample 500. Instead of learning separate weights for every position (like an MLP), learn one small "filter" and slide it across the entire signal. This is called translation invariance — it just means "position doesn't change the pattern."

Implementation: Convolutional Layer

A small filter that slides across the input

A filter (or kernel) is a small list of weights, say 5 numbers. Place it over the first 5 samples of your signal. Multiply each filter weight by the corresponding sample. Add them up. That gives one output number. Slide one step right, repeat. One filter scans for one pattern. Use 64 filters in parallel to scan for 64 different patterns — each output list is called a channel (this is what "64 feature maps" means).

# Convolution = a sliding dot product. That's it.
for position in range(signal_length - filter_length):
    output[position] = sum(signal[position:position+5] × filter_weights)

Pooling: Shrink and Summarize

Max pooling reduces size

After detecting patterns, shrink the output by keeping only the maximum value from every group of 2 (or 4). This makes the data smaller, makes the network care about whether a pattern exists rather than exactly where, and gives subsequent layers a wider view. Stacking conv + pool layers repeatedly builds a hierarchy: first layer detects edges, second detects shapes, third detects concepts.

Limitation

Fundamentally local

Each filter only sees a small window. Stacking many layers widens the view, but the network still struggles when the answer depends on the relationship between the very beginning and the very end of a signal. It has no mechanism for directly comparing distant regions.

The RNN/LSTM: Sequential Memory

First Principle

Some data must be read in order

Language, music, time-series signals — the meaning of each step depends on what came before. You need a "memory" that accumulates as you read left to right.

Implementation: RNN

Process one step at a time, carry a hidden state

The hidden state is a list of numbers that serves as memory. At each time step, combine the current input with the hidden state, produce a new hidden state. After processing the whole sequence, the hidden state summarizes everything.

memory = zeros
for sample in ecg_signal:
    memory = tanh(sample × W_input + memory × W_memory + bias)

Limitation: Vanishing Gradient

Memory gets overwritten

By step 500, information from step 1 is gone. The math: when backpropagating through hundreds of steps, you multiply gradients together hundreds of times. Multiplying a number less than 1 by itself hundreds of times gives effectively zero. The gradient "vanishes" — early steps get no learning signal.

Advance: LSTM

Learnable gates that control memory

Three small neural networks ("gates") control information flow. The forget gate decides what to erase (outputs 0-to-1 per memory slot; 0 = erase, 1 = keep). The input gate decides what new information to write. The output gate decides what to expose. Key insight: by keeping the forget gate at ~1 and the input gate at ~0 for a memory slot, info can travel unchanged across hundreds of steps. That's the whole trick.

Remaining Limitation

Sequential and slow

You must process step 1 before step 2 before step 3. You can't parallelize this across a GPU. Training is slow on long sequences, and even LSTMs degrade over thousands of steps.

Chapter 6

The Transformer: Everything Sees Everything

The CNN's limitation: can't directly compare distant regions. The LSTM's limitation: sequential, slow, still forgets over very long ranges. The Transformer solves both.

First Principle

Let every position attend to every other position simultaneously

Instead of scanning left-to-right or through a small window, compute a "relevance score" between every pair of positions in one shot. This is called self-attention.

Implementation: Self-Attention

Query, Key, Value — a matchmaking system

For every position in the input, create three vectors by multiplying by three learned weight matrices:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "Here's my actual information."

Score every position against every other: dot product of one position's Q with every position's K. High score = high relevance. Normalize with softmax (now they're percentages summing to 1). Take a weighted average of all Values using those percentages.

Q = input × W_query   # each position: "what am I looking for?"
K = input × W_key     # each position: "what do I have?"
V = input × W_value   # each position: "here's my info"

scores = (Q × K.transpose) / sqrt(dimension)  # scaling prevents extremes
weights = softmax(scores)  # each row sums to 1
output = weights × V       # weighted blend of everyone's info

The sqrt(dimension) division is just a scaling factor to keep dot products from getting too large. Nothing deep.

Multi-Head Attention

Run attention multiple times in parallel

Do the Q/K/V process 8 times (or 12, or 16), each with different weight matrices. Each "head" can learn to attend to different types of relationships (one for rhythm, one for amplitude, etc.). Concatenate all results and mix with one final weight matrix.

Positional Encoding

Injecting position information

Since attention processes all positions simultaneously, it has no inherent sense of order. Fix: add a unique pattern of numbers to each position's input before attention. The original paper used sine waves at different frequencies; modern systems often just learn the position numbers during training.

Limitation

Quadratic cost

Comparing every position to every other position for a sequence of length N requires N² computations. Double the sequence length, quadruple the compute. This makes very long sequences expensive. (Active research area: linear attention, sparse attention, etc.)

Why Transformers won: No vanishing gradient (every position connects to every other in one step). Massively parallelizable (no sequential dependence). This is the architecture behind GPT, BERT, Vision Transformers, and most state-of-the-art models across nearly every domain.

Chapter 7

Specialized Architectures

Some problems need structures beyond "scan locally" (CNN) or "attend globally" (Transformer). Here are the main specialized designs and the specific limitations they address.

U-Net: Labeling Every Point

The Problem

Segmentation needs both "what" and "where"

To label every millisecond of a signal (this is the P-wave, this is the QRS complex…), you need to understand the overall concept ("what": that's a heartbeat) AND preserve precise timing ("where": it starts at sample 247). Shrinking the data (via pooling) builds understanding but destroys precise location. This is the fundamental tension.

Implementation: The U Shape

Squeeze down, then expand back up, with bridges

Encoder (left side): CNN layers + pooling, progressively shrinking. Each level extracts higher-level features but at lower resolution.

Bottleneck (bottom): Smallest representation. Maximum understanding, minimum spatial detail.

Decoder (right side): Transposed convolutions (learnable up-scaling) that expand back to original size.

Skip connections (the bridges): At each level, the encoder's output (sharp but dumb) is concatenated alongside the decoder's output (blurry but smart). "Concatenated" = literally gluing two lists of numbers together along the channel dimension. A subsequent conv layer mixes them.

encoder_out = conv_block(input)           # sharp, detailed
bottleneck  = conv_block(pool(encoder_out)) # small, conceptual
decoder_up  = upsample(bottleneck)          # stretched, blurry

# The skip connection: paste sharp next to blurry
combined = concatenate(decoder_up, encoder_out)
output   = conv_block(combined)  # now sharp AND smart

Loss: Dice Loss

Measuring overlap, not point-by-point accuracy

If 95% of your signal is "background," a lazy model can predict "all background" and be 95% accurate. Cross-entropy per-point would barely penalize this. Dice loss instead measures overlap: (2 × area where both agree) / (total area of both). If the model misses the small foreground entirely, Dice = 0 — maximum penalty. This forces the model to actually find the small regions that matter.

GNN: Networked Data

The Problem

Some data isn't a grid or a sequence — it's a network

Molecules, social networks, or the 12 leads of an ECG (which have a physical spatial relationship around the heart). You need an architecture that respects "who is connected to whom."

Implementation: Message Passing

Nodes collect information from their neighbors

Each node has a state (a list of numbers). Each round: every node collects its neighbors' states, aggregates them (sum or average — the order doesn't matter, which is called permutation invariance), then updates its own state using the aggregate plus its current state. After several rounds, each node's state encodes its neighborhood.

XGBoost: Not a Neural Network

The Problem

Tabular data (spreadsheets) doesn't benefit from spatial structure

When your data is rows and columns of features (age, heart rate, blood pressure…) rather than raw signals or images, neural networks often lose to simpler methods. The data has no spatial or sequential structure for CNNs or Transformers to exploit.

Implementation: Sequential Error Correction

A chain of decision trees, each fixing the last one's mistakes

A decision tree is a flowchart: "Is heart rate > 100? Go left. Is QT interval > 450ms? Predict arrhythmia." Boosting: train tree #1, compute errors (residuals). Train tree #2 to predict those errors. Add its output to tree #1's. The remaining errors are smaller. Train tree #3 on those. Repeat 500 times.

prediction = 0.5  # start with a naive guess
for round in range(500):
    error = true_labels - prediction
    new_tree = fit_tree(features, targets=error)
    prediction += 0.1 × new_tree.predict(features)  # 0.1 = learning rate

"Gradient boosted" = uses calculus (gradients) to determine the direction each tree should correct. Optimizer: Newton-style second-order method built into the tree fitting. Loss: log-loss (same math as cross-entropy, different name) for classification, MSE for regression.

Why XGBoost matters: For tabular data, it regularly beats neural networks. Trains in seconds, handles missing values natively, and you can inspect which features mattered. Neural networks shine on raw, high-dimensional data (signals, images, text) with spatial/temporal structure. Both have their domain.

PINN: Physics as a Constraint

The Problem

Standard neural networks can predict physically impossible things

A model predicting blood flow might output a value that violates conservation of mass. It doesn't know physics.

Implementation: Dual Loss

Normal loss + physics violation loss

Take the model's output. Use automatic differentiation (the same mechanism as backpropagation) to compute its derivatives. Check whether those derivatives satisfy the known physical equation. The gap between what they are and what physics says they should be becomes an extra loss term. The optimizer now minimizes both data error and physics violation simultaneously.

prediction   = neural_net(input)
data_loss    = mean((prediction - measured)²)
dp_dt        = auto_diff(prediction, wrt=time)
physics_loss = mean((dp_dt - known_equation(prediction))²)
total_loss   = data_loss + physics_loss

Chapter 8

The Three Categories of Data

Everything above describes what the model looks like. But how you train it depends on what kind of data you have.

Supervised

Input + correct answer

"Here's an ECG, a cardiologist labeled it 'arrhythmia.'" The model learns to match input → answer. Most practical ML is this. Loss functions compare the guess to the label (cross-entropy, MSE, Dice).

Unsupervised

Input only, no answers

"Here are 100,000 ECGs. Find the structure yourself." The model learns which signals are similar, what the important dimensions are, or how to compress and reconstruct the data. Loss functions measure reconstruction quality or cluster coherence — not correctness against a label.

Key methods: K-Means (pick K centers, assign points to nearest, move centers to the mean, repeat). SimCLR (create two distorted copies of one input, train a neural network to recognize they're the same — contrastive loss pushes same-source outputs together and different-source outputs apart). PCA (find the axes of maximum variation; linear only). VAE (encoder compresses to a small representation, decoder reconstructs; loss = MSE + KL divergence, which measures how different the learned distribution is from a standard bell curve, keeping the representation space smooth and organized).

Reinforcement

Reward after a sequence of actions

"You made 10 dosing decisions. The patient recovered. Here's a reward." The model learns a strategy (policy), not a single right answer. Bellman equation: value of current state = immediate reward + discounted value of best next state (the discount, like 0.99, means future rewards are worth slightly less). Policy gradient: if an action led to good reward, increase its probability; if bad, decrease it.

Chapter 9

Loss Functions: The Language Each Task Speaks

The loss function is the single most important design choice after the architecture. It defines what "wrong" means — and therefore what the model learns to care about. Every loss function takes two inputs (the model's guess and the truth) and returns one number. The entire system exists to make that number smaller.

Different tasks need fundamentally different definitions of "wrong." Here is every major loss function, what it actually computes, and when you'd use it.

For Predicting a Number (Regression)

MSE — Mean Squared Error

The default for regression

What it computes: Take the difference between your prediction and the truth. Square it. Average across all samples. If you predict 72 and the truth is 80, the loss for that sample is (80 − 72)² = 64.

Why squaring: Two reasons. It makes all errors positive (no canceling out). And it punishes large errors disproportionately — being off by 10 costs 100, but being off by 20 costs 400. This forces the model to prioritize fixing its worst predictions.

When to use: Predicting any continuous number — temperature, ejection fraction, blood pressure, stock price. The most common loss in all of ML.

MAE — Mean Absolute Error

When outliers shouldn't dominate

What it computes: The absolute difference between prediction and truth, averaged. |80 − 72| = 8. No squaring.

Why it exists: MSE's squaring makes the model obsess over outliers — one wildly wrong prediction dominates the loss. MAE treats all errors proportionally. If your data has extreme outliers you can't remove, MAE is more robust.

When to use: Regression where outlier robustness matters more than penalizing large errors.

For Picking a Category (Classification)

Binary Cross-Entropy (BCE)

Yes or no — two classes

What it computes: Your model outputs a probability (say 0.9 for "arrhythmia"). If the truth is 1 (yes, arrhythmia), the loss is −log(0.9) = 0.105 — small, the model was right and confident. If the truth is 0 (no arrhythmia), the loss is −log(1 − 0.9) = −log(0.1) = 2.3 — huge, the model was confident and wrong.

Why the logarithm: It makes "confident and wrong" catastrophically expensive. Predicting 0.99 when the truth is 0 costs −log(0.01) = 4.6. Predicting 0.5 when the truth is 0 costs only −log(0.5) = 0.69. The log creates an asymmetric penalty that teaches the model to only be confident when it's actually right.

When to use: Any binary decision — disease present/absent, normal/abnormal, fraud/legitimate.

Cross-Entropy (multi-class)

Picking one category from many

What it computes: The model outputs probabilities across all classes via softmax (they sum to 1). The loss is simply −log(probability assigned to the correct class). If the correct class got 90% probability, loss = −log(0.9) = 0.105. If it got 2%, loss = −log(0.02) = 3.9.

Why it exists: Same logarithmic logic as BCE, extended to any number of categories. Also called "log-loss" in the context of XGBoost — same formula, different name.

When to use: Classifying into 3+ categories — arrhythmia type, image category, language identification. The most common classification loss.

Focal Loss

When one class is rare

What it computes: Cross-entropy, but with a multiplier that down-weights easy examples. If the model is already 95% sure of the correct class, that example's loss gets scaled close to zero. If the model is only 20% sure, the loss stays large.

Why it exists: In medical data, 98% of samples might be "normal." Standard cross-entropy spends most of its effort getting slightly better at the easy "normal" cases. Focal loss shifts attention to the hard, rare cases that matter.

When to use: Highly imbalanced classification — rare disease detection, fraud detection, defect identification.

For Labeling Every Point (Segmentation)

Dice Loss

Measuring overlap, not point-by-point accuracy

What it computes: Dice score = (2 × area of overlap between prediction and truth) / (total area of both). Perfect overlap = 1, no overlap = 0. Dice loss = 1 − Dice score.

Why it exists: If 95% of your signal is "background," a model predicting "all background" gets 95% accuracy with cross-entropy per point. Dice loss doesn't care about the easy background — it measures whether the model found the small foreground region. A model that misses the foreground entirely scores 0, regardless of how much background it got right.

When to use: Segmentation — labeling every pixel/sample. Especially when the target region is small relative to the background, which is almost always the case in medical imaging.

Hybrid: Dice + Cross-Entropy

The practical standard

What it computes: Sum of Dice loss and per-point cross-entropy. Dice handles the overlap problem. Cross-entropy provides smooth, well-behaved gradients that make training stable.

Why it exists: Dice loss alone can produce noisy gradients in early training when predictions are far off. Cross-entropy is smooth everywhere. Combining them gets the best of both.

When to use: State-of-the-art segmentation models (Swin-UNet, nnU-Net) almost universally use this combination.

For Learning Without Labels (Unsupervised / Self-Supervised)

KL Divergence

How different are two probability distributions?

What it computes: Measures the gap between two probability "shapes." If the model thinks heart types are distributed [30% normal, 70% abnormal] and the target distribution is [50%, 50%], KL divergence quantifies that difference.

Why it exists: In VAEs (Variational Autoencoders), the encoder maps inputs to a distribution in "latent space." KL divergence keeps that distribution close to a standard bell curve, preventing the model from creating a chaotic, fragmented representation. This ensures similar inputs map to nearby points — making the space smooth and useful.

When to use: Always paired with a reconstruction loss (MSE) in VAEs. Also used in knowledge distillation (training a small model to mimic a large one).

Contrastive Loss

Same things close, different things far

What it computes: Take one input. Create two distorted versions (add noise, crop differently, shift timing). Both pass through the model, producing two output vectors. The loss pulls these two vectors together (they came from the same source) while pushing them away from vectors produced by other, different inputs.

Why it exists: It teaches the model the concept of "similarity" without any human labels. After training, the model's internal representation organizes inputs by genuine similarity — not pixel-level similarity but structural, meaningful similarity.

When to use: Self-supervised pre-training when you have massive unlabeled datasets (SimCLR, CLIP). Often followed by fine-tuning with a small labeled dataset.

For Learning a Strategy (Reinforcement Learning)

Bellman Error

Is our value estimate consistent?

What it computes: The Bellman equation says: value of current state = immediate reward + (discount × value of best next state). The "loss" is the gap between the model's current value estimate and what the Bellman equation says it should be. Training drives this gap to zero.

The discount factor (γ, typically 0.99): multiplied against future rewards, making them worth slightly less than immediate ones. A reward 100 steps in the future is worth 0.99¹⁰⁰ ≈ 0.37 of its face value. This prevents the model from chasing infinitely distant payoffs.

When to use: Q-learning and its descendants (DQN). Any RL method that estimates the value of states or state-action pairs.

Policy Gradient

Directly reinforce good decisions

What it computes: If an action led to good reward, increase its probability. If it led to bad reward, decrease it. Formally: adjust weights in the direction of ∇log(probability of the action taken) × reward received. The log ensures that making a rare-but-good action more likely gets a strong signal.

Why it exists: Bellman-based methods estimate values and derive a policy from those values. Policy gradient methods skip the middleman and directly optimize the policy (the decision-making function itself). This works better in continuous action spaces and high-dimensional problems.

When to use: PPO, REINFORCE, actor-critic methods. Robotics, game AI, RLHF for LLMs.

The pattern: Regression tasks use MSE (or MAE). Classification tasks use cross-entropy (or focal loss for imbalanced data). Segmentation tasks use Dice loss (usually combined with cross-entropy). Self-supervised tasks use contrastive loss and/or KL divergence. RL tasks use Bellman error or policy gradient. If you know the task, the loss function choice is usually obvious.

Chapter 10

The Map: What People Actually Use

Everything above exists on a spectrum from "useful for understanding concepts" to "what practitioners actually deploy in 2026." This table maps it out. The CONCEPTUAL tier means: important for understanding, rarely used in production. PRODUCTION means: widely used in real systems today. FRONTIER means: state-of-the-art, increasingly adopted.

Task	Tier	Model	Loss → Optimizer	When to Use
Supervised → Classification (input → label)
	Conceptual	MLP / Logistic Regression	Cross-Entropy → SGD	Learning; tiny datasets; baseline
	Production	XGBoost	Log-Loss → Newton	Tabular/spreadsheet data
	Frontier	Vision Transformer (ViT)	Cross-Entropy → AdamW	Raw signals, images, long-range patterns
Supervised → Regression (input → number)
	Conceptual	Linear Regression	MSE → OLS / SGD	Learning; very simple relationships
	Production	XGBoost / Deep NN	MSE → Newton / AdamW	Tabular or signal data
	Frontier	PINN	MSE + Physics → Adam	When physical laws are known
Supervised → Segmentation (input → label per point)
	Conceptual	Sliding Window CNN	BCE per-point → SGD	Learning; proof of concept only
	Production	U-Net	Dice Loss → Adam	Medical imaging/signals, most segmentation
	Frontier	Swin-UNet	Dice + CE → AdamW	When long-range context matters for segmentation
Unsupervised → Clustering (find groups in unlabeled data)
	Conceptual	K-Means	Euclidean distance	Quick exploration; round-ish clusters
	Frontier	SimCLR + clustering	Contrastive → Adam	Large unlabeled datasets; complex structure
Unsupervised → Dimensionality Reduction (simplify)
	Conceptual	PCA	Variance maximization (SVD)	Quick exploration; linear relationships
	Frontier	VAE	MSE + KL Div → Adam	Complex nonlinear structure; generative modeling
Reinforcement → Decision Making (learn a strategy)
	Conceptual	Q-Table	Bellman Equation	Learning; tiny discrete problems
	Production	PPO	Policy Gradient → Adam	Robotics, game AI, RLHF for LLMs
	Frontier	Decision Transformer	Reward-to-Go → AdamW	When you have recorded expert trajectories

How to read this table: Find your problem in the gray group headers. Look at the PRODUCTION row — that's what people actually deploy. The CONCEPTUAL row is what you'd learn in a course; it's rarely the best choice for real work. The FRONTIER row is state-of-the-art — better performance, more complexity. Every term in every cell is defined in the chapters above.

Every term introduced in this document is resolved in the same document. If you encounter a word that feels undefined, it was explained in the section where it first appeared. The chapters build on each other: Chapter 2 depends on nothing, Chapter 3 depends on 2, and so on. Reading top-to-bottom once should close every loop.

Machine Learningfrom First Principles

The Universal Mechanic

Start with One Neuron

Multiply, add, check, adjust

Linear Regression

Can only draw straight lines

Stack Neurons → The MLP

Stacking neurons + nonlinearity = curves

Layers where everything connects to everything

No concept of position, sequence, or structure

How Adjustment Works: Backpropagation & Optimizers

Stochastic Gradient Descent

Fixed speed, no memory

Adaptive steps + momentum

Adam + weight decay

Teaching Structure: CNNs and RNNs

The CNN: Local Pattern Detection

The same pattern can appear anywhere

A small filter that slides across the input

Max pooling reduces size

Fundamentally local

The RNN/LSTM: Sequential Memory

Some data must be read in order

Process one step at a time, carry a hidden state

Memory gets overwritten

Learnable gates that control memory

Sequential and slow

The Transformer: Everything Sees Everything

Let every position attend to every other position simultaneously

Query, Key, Value — a matchmaking system

Run attention multiple times in parallel

Injecting position information

Quadratic cost

Specialized Architectures

U-Net: Labeling Every Point

Segmentation needs both "what" and "where"

Squeeze down, then expand back up, with bridges

Measuring overlap, not point-by-point accuracy

GNN: Networked Data

Some data isn't a grid or a sequence — it's a network

Nodes collect information from their neighbors

XGBoost: Not a Neural Network

Tabular data (spreadsheets) doesn't benefit from spatial structure

A chain of decision trees, each fixing the last one's mistakes

PINN: Physics as a Constraint

Standard neural networks can predict physically impossible things

Normal loss + physics violation loss

The Three Categories of Data

Input + correct answer

Input only, no answers

Reward after a sequence of actions

Loss Functions: The Language Each Task Speaks

For Predicting a Number (Regression)

The default for regression

When outliers shouldn't dominate

For Picking a Category (Classification)

Yes or no — two classes

Picking one category from many

When one class is rare

For Labeling Every Point (Segmentation)

Measuring overlap, not point-by-point accuracy

The practical standard

For Learning Without Labels (Unsupervised / Self-Supervised)

How different are two probability distributions?

Same things close, different things far

For Learning a Strategy (Reinforcement Learning)

Is our value estimate consistent?

Directly reinforce good decisions

The Map: What People Actually Use

Machine Learning
from First Principles