The AI industry is transitioning from a "bigger is better" strategy to cost optimization. It is predicted that up to 80% of workloads will shift to models that are 99 times cheaper than current flagships within the next 12–18 months.

image

What Happened

Tech companies are actively implementing hybrid architectures to reduce inference costs. A prime example is the legal service Harvey, which reduced task execution costs threefold by using a combination of Claude Opus for complex problem-solving and the GLM 5.1 model from Fireworks AI for routine operations.

Context

A fundamental paradigm shift is occurring: moving from infinite scaling of capacity to the concept of "Right-sizing"—the purposeful selection of the optimal model for a specific task. This involves using patterns such as Router-based inference and leveraging small models for intermediate stages in complex pipelines.

Why It Matters for the Industry

Market competition is undergoing a transformation: the battle is no longer just between proprietary and open-weight solutions, but also between heavy flagships and highly efficient small models. In the long term, this could put significant pressure on the margins of major laboratories like OpenAI and Anthropic due to a decrease in the share of high-margin queries to their primary models.

Why It Matters for Users

For developers and businesses, this means a radical simplification of scaling AI products. Lowering inference costs allows for faster experimentation with new use cases without the risk of massive API bills, as well as the implementation of complex agentic systems and multi-level workflows into real business processes.

What Remains Unknown / Limitations

Expert focus on the issue of optimization remains broad, covering everything from purely engineering aspects to economic and legal implications.

Sources

Author

Look at AI, Editorial Team