People in ML recently started using “inference” to mean running a model: getting an output given a concrete input 1.
This clashes pretty hard with the conventional usage (from statistics), which is much more like training. We used to infer the values of unobserved parameters \(\theta\) from observed \(X\); now we “infer” the output \(X\) of a set of already-fixed parameters \(\theta\). How did that happen?
In stats
Before stats, it was mostly just another word for logical deduction. In logic it means “solving”.
“Inference” first got its modern meaning during the great formalisation of the 1920s: at the latest by 1925, in Fisher’s insanely influential “Statistical Methods for Research Workers”.
Bishop’s classic 2006 ML-and-stats textbook uses “inference” strictly to mean training, and uses “decision” for concrete predictions. Murphy (2012) is much the same with “prediction”. The 1995 Helmholtz machine paper is an interesting example of using “inference” for a generative model but it’s still properly probabilistic.
(The ML framework people, constrained perhaps by taste or backwards-compatibility, still call the relevant function .predict() or .eval() or .forward() rather than .infer().)
When did this shift happen?
Can we get a rough look with Google Trends? No:
But while suggestive this doesn’t actually distinguish the stats and ML usages and totally misses the c. 2015 shift in ML papers starting to use it.
^ This is no better.
I can at least upper bound the date to 2012 and blame Jeff Dean:
- 2012: His DistBelief paper (Google) uses “inference” casually as if it was established usage. In the 2007 n-gram paper, they say “apply” instead.
- 2015: An NVIDIA architect, Andersch, writes a splashy blogpost.
- 2016: NVIDIA GPU Inference Engine and popularisations.
- 2017: Tensorflow paper cements it. Also the “TensorFlow Inference” library
- 2025: huge spike in usage, maybe from outsiders catching on to inference scaling, maybe because reasoning models are so cool and pretraining scaling has lately been underwhelming. (But o1 was September 2024!)
(Jen-Hsun’s inequality: a tech CEO can blindly overrule centuries of usage and you can’t.)
You can see language degrading further in Google’s recent pop retcons. There we are said to be inferring “what’s going to happen” (unobserved) from the past (observed). Inferring the next token. Meh.
There’s one way to retcon it into making sense: we’re inferring the unobserved output distribution \(O\) for a given model \(M\) and input prompt \(P\).
Why?
“Inference” sounds cool, way better than “prediction”. Maybe because it’s borrowing status from logic more than stats.
Since last year, the tokens used per LLM query can vary by a factor of millions, so it makes sense to have a word for “more runtime compute”.
Original sin
What’s the problem?
The old problem - which is the statisticians’ fault - is that “infer” is ambiguous between “infer correctly” and “make some guess, right or wrong”. (“Predict” has a similar issue but less; we all know that predictions fail.)
I should say it’s a good thing that stats and ML are distinct. It allowed us to move to powerful black-box methods, sidestepping the powerful prejudice against them. It’s totally fine to say “training” instead of “inference”. But namespace collisions are lamentable and could have easily been avoided.
- See also "decision", "testing", "forward pass", "[function] evaluation", "forward propagation", "apply", "simulation". The original Tensorflow API called it `.run()`. So really it was a mess waiting for some consensus term.
