The actual software stack for machine learning. What each tool does, why it exists, what it replaced, and when to use it. Organized from hardware up to high-level platforms.
ML software is layered, like a building. Each layer depends on the one below it. Most practitioners work on only one or two layers and never touch the others. Here's the whole thing, top to bottom:
Pretrained models, experiment tracking, deployment pipelines. You work here when you're using someone else's model or managing your own in production.
Ready-made algorithms for classical ML. You pass in data, call .fit(), get a model. No need to define layers or write training loops.
You define your own architecture, loss function, and training loop. The framework handles automatic differentiation (backpropagation) and GPU execution. This is where most deep learning research and custom model building happens.
Array math, data loading, preprocessing. Everything above is built on top of these. NumPy is the foundation of Python's entire scientific ecosystem.
The physical chips and the low-level software that talks to them. You rarely touch this directly, but everything above depends on it.
Why ML can't run on a normal computer — and what it uses instead.
A single forward pass through a modest network might involve multiplying a 512×1024 matrix by a 1024×512 matrix — that's ~268 million multiply-and-add operations. Multiply by batch size, number of layers, and training iterations, and you're doing quadrillions of arithmetic operations. A CPU does these one at a time (or a handful at a time). You need hardware that does thousands simultaneously.
A CPU has 8–64 powerful cores. A GPU has thousands of simpler cores (an NVIDIA A100 has 6,912 CUDA cores). Each core is slower than a CPU core, but they all work simultaneously on different parts of the same matrix multiply. A GPU can do a large matrix multiplication 10–100× faster than a CPU. NVIDIA dominates ML hardware. The key GPUs (as of 2026): A100, H100, H200, B200 — each generation roughly doubles throughput.
The model's weights, the input data, intermediate activations, and gradients all need to fit in GPU memory simultaneously. An H100 has 80GB. A large model can easily exceed this — which forces you to split the model across multiple GPUs, use smaller batch sizes, or use memory-saving techniques. More often than raw compute speed, VRAM is what limits what you can train.
Designed specifically for matrix operations. Available through Google Cloud. The key difference: TPUs are optimized for very large batch sizes and have high-bandwidth connections between chips for distributed training. Used primarily within Google's ecosystem (TensorFlow, JAX). Most of the industry uses NVIDIA GPUs; TPUs are a significant but narrower alternative.
The invisible layer between your Python code and the GPU silicon.
CUDA (Compute Unified Device Architecture) is a C/C++-based language and runtime that lets you write code that runs on NVIDIA GPUs. You almost never write CUDA yourself. PyTorch and TensorFlow call CUDA underneath — when you write tensor.to('cuda'), you're moving data to the GPU and all subsequent operations use CUDA-optimized code.
Convolutions, attention, pooling, normalization — all have heavily optimized CUDA implementations in cuDNN. When PyTorch runs a convolution, it's actually calling cuDNN's implementation, which has been hand-tuned for each GPU generation. This is why NVIDIA's dominance is so durable — the software optimization stack is decades deep.
ROCm is AMD's GPU computing platform — functional but with less library support. Metal/MPS lets you use Apple Silicon GPUs from PyTorch. OpenCL is vendor-neutral but less optimized. In practice, CUDA's ecosystem advantage means NVIDIA GPUs remain the default for serious ML work.
The foundation layer. Data loading, preprocessing, and array math.
Python is slow for number-crunching. NumPy fixes this by implementing array operations in C under the hood. Instead of writing a Python loop over 1 million numbers, you call one NumPy function that processes all 1 million in compiled C code. The core object is the ndarray — an n-dimensional array of numbers (the Python equivalent of a tensor).
np.array() — create arrays. np.dot() or @ — matrix multiply. .reshape() — change dimensions without changing data. .mean(), .std() — statistics. Slicing (data[:, 0:100]) — select subsets. Broadcasting — NumPy's rule for automatically handling operations between arrays of different shapes (e.g., adding a 1×10 vector to every row of a 1000×10 matrix).
PyTorch tensors, Pandas DataFrames, and scikit-learn all use NumPy arrays as their interchange format. You load data as NumPy arrays, preprocess them, then convert to whatever format your framework needs. Understanding NumPy's shape system and broadcasting rules is the single most practical skill for ML in Python.
Pandas provides the DataFrame — a table with named columns and an index, like an Excel sheet you can program. Built on top of NumPy. You use it for loading CSVs, filtering rows, grouping, merging tables, and computing statistics.
Pandas is for the stage before model training: loading data, inspecting distributions, handling missing values, feature engineering. Once data is clean and ready, you convert to NumPy arrays or PyTorch tensors for actual model work. Pandas is too slow for anything inside a training loop.
The deep learning frameworks — where you define architectures, train models, and run inference.
Building a neural network from NumPy alone is possible but painful. You'd need to manually implement every operation, hand-code backpropagation, and write your own GPU kernels. A deep learning framework gives you three things: a tensor library (like NumPy but GPU-accelerated), automatic differentiation (backprop computed automatically), and a library of layers, losses, and optimizers.
Define-by-run (eager execution): you write normal Python and PyTorch records operations as they happen, building the computational graph on the fly. This means you can use Python if-statements, loops, and debuggers normally — the graph is dynamic. When you call loss.backward(), it walks backward through the recorded operations to compute all gradients.
torch.Tensor — like a NumPy array but can live on GPU and tracks gradients. nn.Module — base class for all layers and models; you subclass it to define your architecture. nn.Linear, nn.Conv1d, nn.Transformer — pre-built layers. optim.Adam — optimizers. DataLoader — feeds batches of data to your training loop efficiently.
# Minimal PyTorch training loop model = MyNetwork() # your nn.Module subclass optimizer = Adam(model.parameters(), lr=0.001) for batch in dataloader: prediction = model(batch.input) # forward pass loss = cross_entropy(prediction, batch.label) # compute loss loss.backward() # compute all gradients optimizer.step() # update weights optimizer.zero_grad() # reset gradients for next batch
Originally used define-and-run (static graph): you built the entire computation graph first, then executed it. This was faster at runtime but painful to debug. TensorFlow 2.0 (2019) switched to eager execution by default, converging with PyTorch's approach. Keras is TensorFlow's high-level API — simpler syntax, less control.
PyTorch dominates research (>80% of papers), has a more Pythonic API, and has become the default for most new projects. TensorFlow has stronger production/deployment tools (TensorFlow Serving, TensorFlow Lite for mobile), and TPU integration is smoother. JAX (Google, newer) is a third option favored by some researchers for its functional style and automatic vectorization — think "NumPy with autograd and GPU support."
When you don't need deep learning — the tools for classical ML and tabular data.
Logistic regression, random forests, SVMs, K-means, PCA, preprocessing, cross-validation, metrics — all with the same interface: model.fit(X, y) to train, model.predict(X) to infer. It doesn't support GPUs or deep learning, but for tabular data and classical methods, it's the first tool you reach for.
# Scikit-learn: train → predict → evaluate in 3 lines model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) accuracy = model.score(X_test, y_test)
Even on deep learning projects, scikit-learn is used for data splitting (train_test_split), scaling (StandardScaler), evaluation (classification_report, roc_auc_score), and cross-validation. It's also where you start when you're unsure if a problem even needs deep learning — train a random forest in 10 seconds, get a baseline, then decide if a neural network is worth the complexity.
XGBoost, LightGBM (Microsoft), and CatBoost (Yandex) are all implementations of the same idea — gradient boosted decision trees — with different engineering tradeoffs. LightGBM is fastest for large datasets. CatBoost handles categorical features natively. XGBoost is the most established. All three routinely beat neural networks on spreadsheet-style data.
These connect directly to the ML Guide's XGBoost section — the architecture is explained there, these are the tools that implement it.
The shift from "build models from scratch" to "start from someone else's trained model and adapt it."
Training GPT-scale models costs millions of dollars in compute. Training a Vision Transformer from scratch on your 10,000-image dataset would take weeks and produce mediocre results. The solution: transfer learning. Start with a model that someone else already trained on massive data. Fine-tune it on your smaller, specific dataset. The pretrained model already understands general patterns; you just teach it your specific task.
A platform hosting hundreds of thousands of pretrained models (text, vision, audio, multimodal) with a unified Python library (transformers) to load and use them. Download a state-of-the-art model in three lines of code. The datasets library provides ready-to-use datasets. The Trainer class handles fine-tuning boilerplate.
# Load a pretrained model and classify text from transformers import pipeline classifier = pipeline("sentiment-analysis") result = classifier("This movie was fantastic!") # → [{'label': 'POSITIVE', 'score': 0.9998}]
ONNX (Open Neural Network Exchange) lets you export a model trained in PyTorch and run it in TensorFlow, or in specialized inference engines (TensorRT, ONNX Runtime) that are faster than either framework. It's the "PDF of models" — a portable format that decouples training from deployment.
When your data or your model doesn't fit on one machine.
Pandas loads everything into RAM. If your dataset is 500GB, Pandas can't open it. You need tools that distribute data across many machines and process it in chunks.
Spark splits your data across hundreds of machines, processes it in parallel, and gives you a DataFrame API similar to Pandas. Used for ETL (Extract, Transform, Load) — cleaning and preparing massive datasets before they go to a training pipeline. Spark MLlib includes distributed versions of classical ML algorithms, but it's not used for deep learning.
Polars: a Rust-based DataFrame library that's 10–100× faster than Pandas on a single machine. Handles datasets that would choke Pandas without needing a cluster. Increasingly the default for data that's "big but fits on one beefy machine." Dask: parallelizes Pandas across cores or machines, keeping the familiar API.
Large models don't fit in one GPU's memory. Even models that do fit train faster across multiple GPUs. You need to distribute the work.
Copy the model to 4 GPUs. Each GPU processes a different batch. After the forward and backward pass, average the gradients across all GPUs and update the weights identically. Each GPU sees 1/4 of the data per step, so training is ~4× faster. PyTorch: DistributedDataParallel (DDP). This is the simplest and most common form of scaling.
When the model itself doesn't fit on one GPU: Pipeline parallelism puts different layers on different GPUs. Tensor parallelism splits individual layers across GPUs. FSDP (Fully Sharded Data Parallel) shards model weights, gradients, and optimizer state across all GPUs, gathering them only when needed. This is how models with hundreds of billions of parameters are trained.
DeepSpeed (Microsoft) and PyTorch's built-in FSDP handle the sharding, communication, and memory optimization automatically. You write mostly normal PyTorch code and these libraries handle the distributed plumbing. Ray Train provides a higher-level API for distributed training across both GPUs and cloud clusters.
Training a model is one thing. Managing hundreds of experiments, deploying the best one, and monitoring it in production is another.
Without tracking, ML research degenerates into a mess of unnamed checkpoints and forgotten hyperparameters. You need systematic logging.
Add a few lines to your training loop. W&B logs every metric, hyperparameter, system stat, and model checkpoint to a web dashboard. You can compare runs, visualize loss curves side-by-side, and reproduce any experiment. The current industry standard for experiment tracking.
Similar to W&B but self-hosted. Tracks experiments, packages models, and provides a model registry (versioned storage of production models). Stronger on the deployment/registry side than W&B.
TorchServe / TF Serving: serve models behind an API endpoint. Triton Inference Server (NVIDIA): high-performance, supports multiple frameworks, handles batching and GPU scheduling. Docker + Kubernetes: containerize the model and orchestrate scaling. Edge deployment: TensorFlow Lite (mobile), ONNX Runtime (any device), CoreML (Apple devices).
One table. Find your task, get the toolchain.
| Task | Tools | Why These |
|---|---|---|
| Data Preparation | ||
| Load & explore data | Pandas, Polars | DataFrames with grouping, filtering, stats |
| Preprocess & scale | Scikit-learn, NumPy | StandardScaler, train_test_split, pipelines |
| Large-scale ETL | Spark (PySpark), Dask | Distributed processing across clusters |
| Classical ML (Tabular Data) | ||
| Classification / Regression | XGBoost, LightGBM, Scikit-learn | Gradient boosted trees beat NNs on tabular |
| Clustering, PCA | Scikit-learn | K-Means, DBSCAN, PCA all built in |
| Quick baseline | Scikit-learn | Random forest in 3 lines; benchmark first |
| Deep Learning (Signals, Images, Text) | ||
| Build custom model | PyTorch | Flexible, Pythonic, dominant in research |
| Quick prototype | PyTorch Lightning, Keras | Less boilerplate, same power |
| Fine-tune pretrained model | Hugging Face + PyTorch | Thousands of models, one-line downloads |
| Train on multiple GPUs | PyTorch DDP / FSDP, DeepSpeed | Data & model parallelism |
| Production & Operations | ||
| Track experiments | W&B, MLflow | Compare runs, reproduce results |
| Serve model via API | Triton, TorchServe, TF Serving | GPU inference at scale |
| Deploy to mobile/edge | ONNX Runtime, TF Lite, CoreML | Optimized for constrained devices |
| Export across frameworks | ONNX | Train in PyTorch, deploy anywhere |
This guide covers the major tools as of 2026. The ecosystem moves fast — new frameworks appear regularly — but the layers of the stack (hardware → compute → arrays → frameworks → platforms) are stable. Understanding the layers matters more than memorizing specific tools.