ML Tools &
Frameworks

The actual software stack for machine learning. What each tool does, why it exists, what it replaced, and when to use it. Organized from hardware up to high-level platforms.

Chapter 1

The Stack

ML software is layered, like a building. Each layer depends on the one below it. Most practitioners work on only one or two layers and never touch the others. Here's the whole thing, top to bottom:

Layer 5 — Platforms

Hugging Face, MLflow, Weights & Biases

Pretrained models, experiment tracking, deployment pipelines. You work here when you're using someone else's model or managing your own in production.

Layer 4 — ML Libraries

Scikit-learn, XGBoost, LightGBM

Ready-made algorithms for classical ML. You pass in data, call .fit(), get a model. No need to define layers or write training loops.

Layer 3 — Deep Learning Frameworks

PyTorch, TensorFlow, JAX

You define your own architecture, loss function, and training loop. The framework handles automatic differentiation (backpropagation) and GPU execution. This is where most deep learning research and custom model building happens.

Layer 2 — Numerical Computing

NumPy, Pandas, SciPy

Array math, data loading, preprocessing. Everything above is built on top of these. NumPy is the foundation of Python's entire scientific ecosystem.

Layer 1 — Hardware & Drivers

CUDA, cuDNN, GPUs, TPUs

The physical chips and the low-level software that talks to them. You rarely touch this directly, but everything above depends on it.

The key insight: You don't need all layers. Building a standard classifier on tabular data? You need Layers 2 and 4 (NumPy/Pandas + Scikit-learn/XGBoost). Training a custom Transformer? You need Layers 1–3 (GPU + NumPy + PyTorch). Fine-tuning someone else's language model? Layers 2, 3, and 5 (Pandas + PyTorch + Hugging Face).

Chapter 2

Hardware: GPUs & TPUs

Why ML can't run on a normal computer — and what it uses instead.

The Problem

Neural networks are bottlenecked by matrix multiplication

A single forward pass through a modest network might involve multiplying a 512×1024 matrix by a 1024×512 matrix — that's ~268 million multiply-and-add operations. Multiply by batch size, number of layers, and training iterations, and you're doing quadrillions of arithmetic operations. A CPU does these one at a time (or a handful at a time). You need hardware that does thousands simultaneously.

GPU (Graphics Processing Unit)

Thousands of simple cores running in parallel

A CPU has 8–64 powerful cores. A GPU has thousands of simpler cores (an NVIDIA A100 has 6,912 CUDA cores). Each core is slower than a CPU core, but they all work simultaneously on different parts of the same matrix multiply. A GPU can do a large matrix multiplication 10–100× faster than a CPU. NVIDIA dominates ML hardware. The key GPUs (as of 2026): A100, H100, H200, B200 — each generation roughly doubles throughput.

GPU Memory (VRAM)

The actual bottleneck in practice

The model's weights, the input data, intermediate activations, and gradients all need to fit in GPU memory simultaneously. An H100 has 80GB. A large model can easily exceed this — which forces you to split the model across multiple GPUs, use smaller batch sizes, or use memory-saving techniques. More often than raw compute speed, VRAM is what limits what you can train.

TPU (Tensor Processing Unit)

Google's custom chip

Designed specifically for matrix operations. Available through Google Cloud. The key difference: TPUs are optimized for very large batch sizes and have high-bandwidth connections between chips for distributed training. Used primarily within Google's ecosystem (TensorFlow, JAX). Most of the industry uses NVIDIA GPUs; TPUs are a significant but narrower alternative.

Tensor Cores: Specialized circuits within modern NVIDIA GPUs that perform small matrix multiplies (4×4) in a single clock cycle, at reduced precision (FP16 or BF16 instead of FP32). This is why "mixed precision training" became standard — you do most computation in half-precision (faster, less memory) and keep a full-precision copy of the weights for the update step. The performance gain is roughly 2–4× over using full precision.

Chapter 3

CUDA & Low-Level Compute

The invisible layer between your Python code and the GPU silicon.

What CUDA Is

NVIDIA's programming language for GPUs

CUDA (Compute Unified Device Architecture) is a C/C++-based language and runtime that lets you write code that runs on NVIDIA GPUs. You almost never write CUDA yourself. PyTorch and TensorFlow call CUDA underneath — when you write tensor.to('cuda'), you're moving data to the GPU and all subsequent operations use CUDA-optimized code.

cuDNN

NVIDIA's library of optimized neural network operations

Convolutions, attention, pooling, normalization — all have heavily optimized CUDA implementations in cuDNN. When PyTorch runs a convolution, it's actually calling cuDNN's implementation, which has been hand-tuned for each GPU generation. This is why NVIDIA's dominance is so durable — the software optimization stack is decades deep.

Alternatives to CUDA

ROCm (AMD), Metal (Apple), OpenCL

ROCm is AMD's GPU computing platform — functional but with less library support. Metal/MPS lets you use Apple Silicon GPUs from PyTorch. OpenCL is vendor-neutral but less optimized. In practice, CUDA's ecosystem advantage means NVIDIA GPUs remain the default for serious ML work.


Chapter 4

NumPy & Pandas

The foundation layer. Data loading, preprocessing, and array math.

NumPy

What It Is

Fast array math in Python

Python is slow for number-crunching. NumPy fixes this by implementing array operations in C under the hood. Instead of writing a Python loop over 1 million numbers, you call one NumPy function that processes all 1 million in compiled C code. The core object is the ndarray — an n-dimensional array of numbers (the Python equivalent of a tensor).

Key Operations

The operations you actually use

np.array() — create arrays. np.dot() or @ — matrix multiply. .reshape() — change dimensions without changing data. .mean(), .std() — statistics. Slicing (data[:, 0:100]) — select subsets. Broadcasting — NumPy's rule for automatically handling operations between arrays of different shapes (e.g., adding a 1×10 vector to every row of a 1000×10 matrix).

Used In ML

Everything is built on NumPy

PyTorch tensors, Pandas DataFrames, and scikit-learn all use NumPy arrays as their interchange format. You load data as NumPy arrays, preprocess them, then convert to whatever format your framework needs. Understanding NumPy's shape system and broadcasting rules is the single most practical skill for ML in Python.

Pandas

What It Is

Spreadsheets in Python

Pandas provides the DataFrame — a table with named columns and an index, like an Excel sheet you can program. Built on top of NumPy. You use it for loading CSVs, filtering rows, grouping, merging tables, and computing statistics.

Used In ML

Data loading and exploration, not training

Pandas is for the stage before model training: loading data, inspecting distributions, handling missing values, feature engineering. Once data is clean and ready, you convert to NumPy arrays or PyTorch tensors for actual model work. Pandas is too slow for anything inside a training loop.


Chapter 5

PyTorch & TensorFlow

The deep learning frameworks — where you define architectures, train models, and run inference.

What They Solve

Building a neural network from NumPy alone is possible but painful. You'd need to manually implement every operation, hand-code backpropagation, and write your own GPU kernels. A deep learning framework gives you three things: a tensor library (like NumPy but GPU-accelerated), automatic differentiation (backprop computed automatically), and a library of layers, losses, and optimizers.

PyTorch

Created by Meta (Facebook), 2016

The current standard for research and increasingly for production

Define-by-run (eager execution): you write normal Python and PyTorch records operations as they happen, building the computational graph on the fly. This means you can use Python if-statements, loops, and debuggers normally — the graph is dynamic. When you call loss.backward(), it walks backward through the recorded operations to compute all gradients.

Core Concepts

The objects you work with

torch.Tensor — like a NumPy array but can live on GPU and tracks gradients. nn.Module — base class for all layers and models; you subclass it to define your architecture. nn.Linear, nn.Conv1d, nn.Transformer — pre-built layers. optim.Adam — optimizers. DataLoader — feeds batches of data to your training loop efficiently.

# Minimal PyTorch training loop
model = MyNetwork()           # your nn.Module subclass
optimizer = Adam(model.parameters(), lr=0.001)

for batch in dataloader:
    prediction = model(batch.input)           # forward pass
    loss = cross_entropy(prediction, batch.label) # compute loss
    loss.backward()                           # compute all gradients
    optimizer.step()                          # update weights
    optimizer.zero_grad()                     # reset gradients for next batch

TensorFlow / Keras

Created by Google, 2015

First mover, now second to PyTorch in research

Originally used define-and-run (static graph): you built the entire computation graph first, then executed it. This was faster at runtime but painful to debug. TensorFlow 2.0 (2019) switched to eager execution by default, converging with PyTorch's approach. Keras is TensorFlow's high-level API — simpler syntax, less control.

When to Use Which

PyTorch for flexibility, TensorFlow for deployment

PyTorch dominates research (>80% of papers), has a more Pythonic API, and has become the default for most new projects. TensorFlow has stronger production/deployment tools (TensorFlow Serving, TensorFlow Lite for mobile), and TPU integration is smoother. JAX (Google, newer) is a third option favored by some researchers for its functional style and automatic vectorization — think "NumPy with autograd and GPU support."


Chapter 6

Scikit-learn & XGBoost

When you don't need deep learning — the tools for classical ML and tabular data.

Scikit-learn

The Swiss Army Knife

Every classical algorithm in one consistent API

Logistic regression, random forests, SVMs, K-means, PCA, preprocessing, cross-validation, metrics — all with the same interface: model.fit(X, y) to train, model.predict(X) to infer. It doesn't support GPUs or deep learning, but for tabular data and classical methods, it's the first tool you reach for.

# Scikit-learn: train → predict → evaluate in 3 lines
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
Used For

Baselines, classical ML, preprocessing pipelines

Even on deep learning projects, scikit-learn is used for data splitting (train_test_split), scaling (StandardScaler), evaluation (classification_report, roc_auc_score), and cross-validation. It's also where you start when you're unsure if a problem even needs deep learning — train a random forest in 10 seconds, get a baseline, then decide if a neural network is worth the complexity.

XGBoost / LightGBM / CatBoost

Gradient Boosted Trees

The production standard for tabular data

XGBoost, LightGBM (Microsoft), and CatBoost (Yandex) are all implementations of the same idea — gradient boosted decision trees — with different engineering tradeoffs. LightGBM is fastest for large datasets. CatBoost handles categorical features natively. XGBoost is the most established. All three routinely beat neural networks on spreadsheet-style data.

These connect directly to the ML Guide's XGBoost section — the architecture is explained there, these are the tools that implement it.


Chapter 7

Hugging Face & Pretrained Models

The shift from "build models from scratch" to "start from someone else's trained model and adapt it."

The Problem

Training large models from scratch is expensive

Training GPT-scale models costs millions of dollars in compute. Training a Vision Transformer from scratch on your 10,000-image dataset would take weeks and produce mediocre results. The solution: transfer learning. Start with a model that someone else already trained on massive data. Fine-tune it on your smaller, specific dataset. The pretrained model already understands general patterns; you just teach it your specific task.

Hugging Face

The app store for pretrained models

A platform hosting hundreds of thousands of pretrained models (text, vision, audio, multimodal) with a unified Python library (transformers) to load and use them. Download a state-of-the-art model in three lines of code. The datasets library provides ready-to-use datasets. The Trainer class handles fine-tuning boilerplate.

# Load a pretrained model and classify text
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was fantastic!")
# → [{'label': 'POSITIVE', 'score': 0.9998}]
ONNX

Framework-agnostic model format

ONNX (Open Neural Network Exchange) lets you export a model trained in PyTorch and run it in TensorFlow, or in specialized inference engines (TensorRT, ONNX Runtime) that are faster than either framework. It's the "PDF of models" — a portable format that decouples training from deployment.


Chapter 8

Scaling: Spark, Ray & Distributed Training

When your data or your model doesn't fit on one machine.

Scaling Data Processing

The Problem

Pandas dies on big data

Pandas loads everything into RAM. If your dataset is 500GB, Pandas can't open it. You need tools that distribute data across many machines and process it in chunks.

Apache Spark (PySpark)

Distributed data processing across clusters

Spark splits your data across hundreds of machines, processes it in parallel, and gives you a DataFrame API similar to Pandas. Used for ETL (Extract, Transform, Load) — cleaning and preparing massive datasets before they go to a training pipeline. Spark MLlib includes distributed versions of classical ML algorithms, but it's not used for deep learning.

Polars / Dask

Modern alternatives

Polars: a Rust-based DataFrame library that's 10–100× faster than Pandas on a single machine. Handles datasets that would choke Pandas without needing a cluster. Increasingly the default for data that's "big but fits on one beefy machine." Dask: parallelizes Pandas across cores or machines, keeping the familiar API.

Scaling Model Training

The Problem

One GPU isn't enough

Large models don't fit in one GPU's memory. Even models that do fit train faster across multiple GPUs. You need to distribute the work.

Data Parallelism

Same model on each GPU, different data

Copy the model to 4 GPUs. Each GPU processes a different batch. After the forward and backward pass, average the gradients across all GPUs and update the weights identically. Each GPU sees 1/4 of the data per step, so training is ~4× faster. PyTorch: DistributedDataParallel (DDP). This is the simplest and most common form of scaling.

Model Parallelism & FSDP

Split the model across GPUs

When the model itself doesn't fit on one GPU: Pipeline parallelism puts different layers on different GPUs. Tensor parallelism splits individual layers across GPUs. FSDP (Fully Sharded Data Parallel) shards model weights, gradients, and optimizer state across all GPUs, gathering them only when needed. This is how models with hundreds of billions of parameters are trained.

DeepSpeed & FSDP

Libraries that manage the complexity

DeepSpeed (Microsoft) and PyTorch's built-in FSDP handle the sharding, communication, and memory optimization automatically. You write mostly normal PyTorch code and these libraries handle the distributed plumbing. Ray Train provides a higher-level API for distributed training across both GPUs and cloud clusters.


Chapter 9

MLOps & Experiment Tracking

Training a model is one thing. Managing hundreds of experiments, deploying the best one, and monitoring it in production is another.

The Problem

"You trained 200 models. Which one was best? What settings did it use?"

Without tracking, ML research degenerates into a mess of unnamed checkpoints and forgotten hyperparameters. You need systematic logging.

Weights & Biases (W&B)

Experiment tracking and visualization

Add a few lines to your training loop. W&B logs every metric, hyperparameter, system stat, and model checkpoint to a web dashboard. You can compare runs, visualize loss curves side-by-side, and reproduce any experiment. The current industry standard for experiment tracking.

MLflow

Open-source experiment + model management

Similar to W&B but self-hosted. Tracks experiments, packages models, and provides a model registry (versioned storage of production models). Stronger on the deployment/registry side than W&B.

Deployment

Getting models into production

TorchServe / TF Serving: serve models behind an API endpoint. Triton Inference Server (NVIDIA): high-performance, supports multiple frameworks, handles batching and GPU scheduling. Docker + Kubernetes: containerize the model and orchestrate scaling. Edge deployment: TensorFlow Lite (mobile), ONNX Runtime (any device), CoreML (Apple devices).


Chapter 10

The Map: Which Tool for Which Job

One table. Find your task, get the toolchain.

Task Tools Why These
Data Preparation
Load & explore dataPandas, PolarsDataFrames with grouping, filtering, stats
Preprocess & scaleScikit-learn, NumPyStandardScaler, train_test_split, pipelines
Large-scale ETLSpark (PySpark), DaskDistributed processing across clusters
Classical ML (Tabular Data)
Classification / RegressionXGBoost, LightGBM, Scikit-learnGradient boosted trees beat NNs on tabular
Clustering, PCAScikit-learnK-Means, DBSCAN, PCA all built in
Quick baselineScikit-learnRandom forest in 3 lines; benchmark first
Deep Learning (Signals, Images, Text)
Build custom modelPyTorchFlexible, Pythonic, dominant in research
Quick prototypePyTorch Lightning, KerasLess boilerplate, same power
Fine-tune pretrained modelHugging Face + PyTorchThousands of models, one-line downloads
Train on multiple GPUsPyTorch DDP / FSDP, DeepSpeedData & model parallelism
Production & Operations
Track experimentsW&B, MLflowCompare runs, reproduce results
Serve model via APITriton, TorchServe, TF ServingGPU inference at scale
Deploy to mobile/edgeONNX Runtime, TF Lite, CoreMLOptimized for constrained devices
Export across frameworksONNXTrain in PyTorch, deploy anywhere
The practical starting point: For most people starting in ML, the stack is: Pandas (load data) → Scikit-learn (baseline model) → PyTorch (if you need deep learning) → Hugging Face (if a pretrained model exists for your task). Add complexity only when you've hit the limits of the simpler tool.

This guide covers the major tools as of 2026. The ecosystem moves fast — new frameworks appear regularly — but the layers of the stack (hardware → compute → arrays → frameworks → platforms) are stable. Understanding the layers matters more than memorizing specific tools.