Inference Scaling

Why inference — running trained models to serve user requests — is where AI's recurring revenue will come from, and the dynamics driving its growth.

Inference — running trained models to serve user requests — is where AI's recurring revenue will come from. Training a model is a large, periodic cost. Serving that model to millions of users is an ongoing, scalable revenue stream.

Key Dynamics

Rapidly Falling Costs

Inference costs are falling approximately 90% per year. This dramatic cost reduction is driven by hardware improvements, software optimisation, model distillation, and architectural innovation. The pace of cost reduction is extraordinary even by technology industry standards.

Growth Over Margin (For Now)

AI companies are currently passing cost savings to users as lower prices, prioritising growth over margin. This is a deliberate strategic choice: expand the user base and embed AI into workflows while costs are falling, then capture margin as the market matures and switching costs increase.

Scaling With User Growth

As the inference business scales with user growth, it will become the dominant revenue driver for AI companies. Every new user, every new application, every new workflow that incorporates AI generates inference demand. The revenue compounds as adoption deepens.

Test-Time Compute

The shift from one-shot prompting to test-time compute — models "thinking longer" on harder problems — means more compute cycles per query, not fewer. When a model reasons through a complex problem step by step, it consumes significantly more inference compute than a simple response. This dynamic pushes per-query compute consumption upward even as per-unit costs fall.

This is analogous to how cloud computing evolved. Infrastructure costs fell steadily, usage expanded dramatically, and the platforms became enormously profitable over time. Early cloud sceptics focused on falling per-unit prices and missed the exponential growth in total usage. The same dynamic is playing out with AI inference.

The Large-Scale Provider Advantage

The economics of inference will eventually favour large-scale providers with the infrastructure already in place. Running inference efficiently requires:

Massive GPU clusters with high utilisation rates
Sophisticated orchestration and load balancing
Global distribution for low-latency serving
The capital to invest continuously in next-generation hardware

These requirements naturally favour the hyperscalers and established AI companies that have already built this infrastructure at scale.

Inference is the recurring revenue engine of AI. Companies positioned to serve inference at scale — with the infrastructure, customer relationships, and capital to maintain their position — are the primary beneficiaries of AI adoption growth regardless of which specific models or applications win.

The Infrastructure Layer — The hyperscalers building the infrastructure inference runs on
Jevons Paradox in AI — Why falling inference costs drive more demand, not less
The Models Layer — How frontier model companies are building inference-driven revenue

← PreviousAI Agents