You trained a machine learning model. It works well. The accuracy is good. But when you deploy it and real users start making requests, it is slow. Each prediction takes too long. Your server costs are too high. Users are waiting.

This is one of the most common problems in production ML. Building a model that works is one challenge. Making that same model run fast and cheaply at scale is a completely different challenge.

This guide covers the four most impactful techniques for speeding up model inference: batching, GPU acceleration, pruning and quantisation. Each one is explained from scratch with simple analogies and working Python code.

What Is Inference

When you train a model, you are teaching it. You feed it thousands of examples and it slowly adjusts its internal numbers until it gets good at making predictions.

When you run inference, you are using the trained model. You give it new input — a photo, a sentence, a row of data — and it gives you a prediction back. This is what happens every time a user interacts with your deployed model.

Training happens once and can take hours or days. Inference happens constantly and must be fast. A user asking your app to classify an image expects an answer in milliseconds, not seconds.

ℹ️ Latency vs throughput: Latency is how long one prediction takes. Throughput is how many predictions per second you can handle. Some techniques reduce latency. Others increase throughput. Often you need both.

Why Models Feel Slow

A machine learning model is essentially a very large collection of numbers called weights, and a set of mathematical operations that transform your input using those weights to produce an output. A small model might have millions of weights. A large model can have billions.

Every time you run inference, all of those mathematical operations happen. The more weights, the more operations, the longer it takes. Three main things make inference slow:

  • Model size — too many weights, too many operations to run
  • Hardware mismatch — running on a CPU when a GPU would be far faster for this kind of maths
  • Underutilisation — processing one request at a time when the hardware can handle many at once

Each of the techniques in this guide targets one or more of these three problems.


Batching Inputs

How Batching Works

Imagine you are a chef making pizza. You could bake one pizza at a time, wait for it to finish, then bake the next one. Or you could put ten pizzas in the oven at once and cook them all in the same amount of time it would take to cook one.

That is exactly what batching does for model inference. Instead of running your model on one input at a time, you collect a group of inputs together and run the model once on the whole group. The model processes them all in parallel, and you get all the results back at once.

This works especially well on GPUs, because a GPU is built to do many operations at the same time. Running a batch of 32 images through a model on a GPU takes roughly the same time as running one image — but you get 32 results instead of one.

ℹ️ Batch size matters: a batch size that is too small wastes your GPU's parallel power. A batch size that is too large fills up your GPU memory and causes errors. The sweet spot depends on your model and your hardware. Common starting points are 16, 32 or 64.

Batching in Python with PyTorch

Python — running inference with a batch vs one at a time
import torch import time # Pretend this is your trained model model = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True) model.eval() # Pretend these are 64 images (each 3 channels, 224x224 pixels) images = [torch.randn(3, 224, 224) for _ in range(64)] # SLOW — process one image at a time start = time.time() results = [] for img in images: output = model(img.unsqueeze(0)) # add batch dimension results.append(output) print(f'One at a time: {time.time() - start:.2f}s') # FAST — stack all images into a single batch and run once start = time.time() batch = torch.stack(images) # shape: [64, 3, 224, 224] batch_output = model(batch) # all 64 at once print(f'Batched: {time.time() - start:.2f}s') # Batched is typically 5 to 20x faster on GPU
Python — a simple request queue for dynamic batching
import asyncio # In a real server you collect incoming requests into a queue # then process them together every N milliseconds class BatchingServer: def __init__(self, model, max_batch_size=32, wait_ms=10): self.model = model self.max_batch_size = max_batch_size self.wait_ms = wait_ms self.queue = [] async def add_request(self, input_data): self.queue.append(input_data) # Wait a short time to collect more requests await asyncio.sleep(self.wait_ms / 1000) if len(self.queue) >= self.max_batch_size: await self.process_batch() async def process_batch(self): if not self.queue: return batch = self.queue[:self.max_batch_size] self.queue = self.queue[self.max_batch_size:] inputs = torch.stack(batch) outputs = self.model(inputs) return outputs

GPU Acceleration

CPU vs GPU — What Is the Difference

A CPU (the main processor in your computer) is very good at complex tasks that need to be done one at a time in a specific order. It has a small number of very powerful cores — usually 8 to 16 on a modern machine.

A GPU (a graphics card) was originally designed to render video game graphics. It has thousands of smaller, simpler cores that are all running at the same time. It is not as smart as a CPU core, but it can do thousands of simple maths operations in parallel.

Machine learning is basically just a huge amount of simple maths done in parallel — multiplying and adding matrices over and over. This maps perfectly onto what a GPU does. A task that takes 10 seconds on a CPU can sometimes take 0.2 seconds on a GPU.

Moving Your Model to the GPU in PyTorch

Python — sending model and data to the GPU
import torch # Check if a GPU is available device = 'cuda' if torch.cuda.is_available() else 'cpu' print(f'Using device: {device}') # Move the model to the GPU model = model.to(device) # Move input data to the SAME device as the model # The model and data must always be on the same device input_tensor = input_tensor.to(device) # Now run inference — happens on the GPU output = model(input_tensor) # If you need the result as a regular Python number # move it back to CPU first result = output.cpu().detach().numpy() # For Apple Silicon (M1, M2, M3 chips) use 'mps' instead of 'cuda' device = ( 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu' )
Write device-agnostic code: always use device = 'cuda' if torch.cuda.is_available() else 'cpu' instead of hardcoding 'cuda'. This way your code works on machines without a GPU too — useful for local development and testing.

Model Pruning — Removing the Parts That Do Not Help

When a neural network learns, many of its weights end up being very close to zero. These near-zero weights do almost nothing to the output — they are not contributing to the model's accuracy in any meaningful way. Pruning removes them.

Think of it like editing a book. After the first draft, you read through and remove all the sentences that do not actually add anything. The story is the same, the meaning is the same, but the book is shorter and easier to read quickly. Pruning does this to a neural network.

After pruning, the model has fewer active connections. This means fewer operations at inference time, which means faster predictions and a smaller file on disk.

Pruning in Python with PyTorch

Python — unstructured pruning with torch.nn.utils.prune
import torch import torch.nn as nn import torch.nn.utils.prune as prune # A simple model with two linear layers class SimpleModel(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(512, 256) self.fc2 = nn.Linear(256, 10) model = SimpleModel() # Prune 30% of the smallest weights in the first layer prune.l1_unstructured(model.fc1, name='weight', amount=0.3) # Check how many weights are now zero total_weights = model.fc1.weight.nelement() pruned_weights = float(torch.sum(model.fc1.weight == 0)) print(f'Sparsity in fc1: {100 * pruned_weights / total_weights:.1f}%') # Make the pruning permanent (remove the mask, simplify the model) prune.remove(model.fc1, 'weight') # Prune all layers at once with global pruning parameters_to_prune = ( (model.fc1, 'weight'), (model.fc2, 'weight'), ) prune.global_unstructured( parameters_to_prune, pruning_method=prune.L1Unstructured, amount=0.4, # remove 40% of all weights globally )
⚠️ Always test after pruning. Removing weights can reduce accuracy. The standard approach is to prune gradually (10 to 20% at a time), then fine-tune the model with a small amount of training to recover accuracy, then prune again. This is called iterative pruning.

Quantisation — Using Smaller Numbers

By default, most neural network weights are stored as 32-bit floating point numbers (called float32). Each weight takes up 4 bytes of memory. Quantisation means switching to smaller number formats — like 8-bit integers (int8) — which use only 1 byte each.

Think of it like converting a high-resolution photo to a slightly lower resolution. You lose a tiny bit of detail, but the file is four times smaller and loads much faster. For most models, the accuracy difference is barely noticeable but the speed improvement is significant.

Quantisation gives you three benefits at once: the model file is smaller, it loads faster, and inference is faster because 8-bit maths is cheaper than 32-bit maths on most hardware.

Quantisation in Python with PyTorch

Python — dynamic quantisation with PyTorch
import torch import torch.quantization # Dynamic quantisation — the easiest way to start # Converts linear and LSTM layers to int8 automatically quantised_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, # which layer types to quantise dtype=torch.qint8 # use 8-bit integers ) # Compare file sizes torch.save(model.state_dict(), 'model_fp32.pt') torch.save(quantised_model.state_dict(), 'model_int8.pt') # model_int8.pt is typically 3 to 4 times smaller than model_fp32.pt # You use the quantised model exactly the same way as the original output = quantised_model(input_tensor)
Python — measuring the speedup from quantisation
import time def benchmark(m, inp, runs=100): m.eval() with torch.no_grad(): start = time.time() for _ in range(runs): m(inp) return (time.time() - start) / runs sample = torch.randn(1, 512) fp32_time = benchmark(model, sample) int8_time = benchmark(quantised_model, sample) print(f'Original (fp32): {fp32_time * 1000:.2f}ms per call') print(f'Quantised (int8): {int8_time * 1000:.2f}ms per call') print(f'Speedup: {fp32_time / int8_time:.1f}x faster')

Turn Off Gradient Tracking at Inference Time

During training, PyTorch keeps track of all the mathematical steps it took to produce each output. It needs this information to update the model's weights. This bookkeeping uses extra memory and time.

At inference time, you are not training — you are just making predictions. You do not need that bookkeeping at all. Turning it off with torch.no_grad() is one of the simplest and easiest wins you can get.

Python — always use no_grad for inference
import torch # Without no_grad — PyTorch tracks every operation for backprop # This wastes memory and slows things down output = model(input_tensor) # slow at inference time # With no_grad — no bookkeeping, just the forward pass with torch.no_grad(): output = model(input_tensor) # faster, uses less memory # Also call model.eval() before inference # This turns off Dropout layers and makes BatchNorm use running stats model.eval() # Full correct inference pattern model.eval() with torch.no_grad(): predictions = model(batch.to(device))
Two lines that always go together: model.eval() and torch.no_grad(). Put them at the start of every inference function you write. Together they can reduce inference time by 20 to 40% with zero effort.

TorchScript and ONNX — Exporting for Speed

When you run a regular PyTorch model, Python itself adds some overhead to every operation. You can remove this overhead entirely by compiling the model into a format that runs without Python involved at all.

Two popular options are TorchScript (stays in the PyTorch ecosystem) and ONNX (a universal format that works with many different runtimes including TensorRT, OpenVINO and CoreML).

Python — exporting with TorchScript and ONNX
import torch # Option 1: TorchScript — compiles the model to a static graph model.eval() example_input = torch.randn(1, 3, 224, 224) # Trace the model with an example input scripted_model = torch.jit.trace(model, example_input) scripted_model.save('model_scripted.pt') # Load and run — no Python overhead loaded = torch.jit.load('model_scripted.pt') output = loaded(example_input) # Option 2: ONNX — export to universal format torch.onnx.export( model, example_input, 'model.onnx', input_names=['input'], output_names=['output'], opset_version=17 ) # Run with ONNX Runtime — often 2 to 5x faster than PyTorch import onnxruntime as ort session = ort.InferenceSession('model.onnx') outputs = session.run(None, {'input': example_input.numpy()})

Technique Comparison

TechniqueTypical SpeedupAccuracy LossEffort
model.eval() and no_grad() 20 to 40% None 2 lines of code
Batching inputs 5 to 20x on GPU None Low
Move to GPU 10 to 50x None Low (one .to(device) call)
Quantisation (int8) 2 to 4x Less than 1% Low to Medium
Pruning (40 to 60%) 1.5 to 3x Small — needs fine-tuning Medium
TorchScript export 1.5 to 2x None Low
ONNX Runtime 2 to 5x None Medium
ℹ️ Stack the techniques: these methods work together. A model that is pruned, quantised, exported to ONNX and batched on a GPU will be significantly faster than one using only one of these techniques. Start with the easiest wins (no_grad and batching) and add more from there.

⚡ Key Takeaways
  • Inference is when you use your trained model to make predictions. It must be fast because it runs every time a user makes a request.
  • Models are slow because of three things: too many weights, running on the wrong hardware, or processing one input at a time when the hardware can handle many at once.
  • Always call model.eval() and use torch.no_grad() for inference. This turns off training-only features and stops bookkeeping you do not need. It is free performance.
  • Batching groups multiple inputs together and runs them through the model at once. On a GPU this can be 5 to 20 times faster than processing one input at a time.
  • GPUs have thousands of cores that run simple maths in parallel. Use .to(device) to move both your model and your input data to the GPU. Always use the same device for both.
  • Pruning removes near-zero weights that contribute almost nothing to predictions. Always test accuracy after pruning and use iterative pruning with fine-tuning for best results.
  • Quantisation switches weights from 32-bit floats to 8-bit integers. The model becomes 3 to 4 times smaller and inference becomes faster with minimal accuracy impact.
  • TorchScript and ONNX compile your model into formats that run without Python overhead. ONNX Runtime in particular can be 2 to 5 times faster than standard PyTorch on CPU.
  • Stack the techniques. Start with eval() and no_grad(), then add batching, then GPU, then quantisation. Each step stacks on top of the previous one.