The Brute-Force Era of AI (and What Comes After)

I think we are in the brute-force era of AI.

By that I mean that progress is coming less from fundamentally new ideas and more from applying massive amounts of data and compute to existing architectures. Larger pretraining corpora, longer context windows, more GPUs — more of everything.

And to be clear, this approach works. Today’s frontier models are materially better than their predecessors. They reason more coherently, generalize more broadly, and fail less often. But when you ask why they are better, the answer is usually some version of “we scaled it,” not “we discovered something fundamentally new.”

In other words, we are getting more, not different.

The Bitter Lesson

This pattern maps cleanly to Richard Sutton’s Bitter Lesson, which argues that the biggest long-run gains in AI come not from clever, human-designed features, but from methods that scale with compute.

That lesson has proven remarkably durable. It explains the transition from symbolic AI to machine learning, from feature engineering to deep learning, and from narrow models to large language models. And today, we are applying it aggressively, perhaps even reflexively. When in doubt, scale it.

Bittersweet Models

But the latest generation of systems complicates the story in an interesting way. Models such as DeepSeek V4 are simultaneously enormous and increasingly efficient.

On paper, these models are huge. They operate at trillion-parameter scale, with massive context windows and extensive training runs. But at inference time, they do not behave like monolithic, fully activated networks. Instead, they rely on a set of techniques designed to reduce the amount of computation required per query.

Mixture-of-experts architectures activate only a subset of parameters for any given token. Sparse connectivity reduces unnecessary computation. Quantization lowers precision in ways that preserve accuracy while dramatically improving efficiency. Compression techniques optimize how model weights and intermediate representations are stored and accessed.

The result is a system that is large in principle but relatively efficient in practice. Training remains brute force. Inference increasingly does not.

Constraint as a Driver of Efficiency

It is not coincidental that much of this efficiency work is emerging from environments where compute is constrained. When you cannot simply add more GPUs, you are forced to extract more value from the ones you have.

That dynamic has a long history in engineering. Systems built under constraint tend to be more efficient, more elegant, and more disciplined. Anyone who has written low-level code on constrained hardware has seen this firsthand. When you only have kilobytes of memory, you learn to think differently about how software is structured.

When capital and compute are abundant, that discipline is often deferred. It is faster, and often more effective in the short term, to follow the Bitter Lesson and scale.

A Two-Phase Pattern

All of this suggests a refinement to the original thesis. We are not simply in a brute-force era. We are in the first phase of a two-phase cycle.

In the first phase, brute force dominates because it is the fastest way to discover what works. In the second phase, efficiency becomes the priority because it is the only way to make those discoveries economically viable at scale.

We have seen this pattern repeatedly in other domains. Early cloud infrastructure was heavily overprovisioned before cost optimization became a discipline. Early web applications prioritized functionality over performance before latency and efficiency became central concerns. Early software systems accumulated features before being refactored into more coherent architectures.

AI appears to be following the same trajectory.

The Role of Capital and Competition

The current emphasis on scale is also a function of the competitive and financial environment. When capital is available, it is rational to prioritize speed over efficiency. Scaling works, it produces visible results, and it is easier to fund.

That is why the leading labs — OpenAI, Anthropic, Google, Meta, and Microsoft — continue to push the frontier outward. Larger models win benchmarks, attract attention, and help establish early leadership.

At the same time, the competitive landscape is shifting quickly, and it is not yet clear how durable any of these advantages will be. Unlike earlier platform battles, switching costs in AI may prove lower, particularly at the model layer. If that is true, then efficiency, cost structure, and system design could become more important sources of differentiation over time.

The Bitterest Lesson

The original Bitter Lesson argues that scaling wins. A more uncomfortable extension of that idea is that scaling alone is not sufficient.

The reason is not that scaling stops working. It is that scaling does not solve for efficiency. Training costs remain high, inference costs matter at scale, and latency and energy consumption are real constraints in production systems.

At some point, those factors become first-order concerns. When they do, architecture, optimization, and system design reassert themselves.

That leads to what might be called the bitterest lesson: the Bitter Lesson itself is not infinitely scalable because it is not inherently efficient.

Which raises a natural question. Have we become too “Bitter Lesson–pilled”? Have we overlearned the idea that scaling is the answer to every problem, and underinvested in the kinds of engineering discipline that make systems practical and economical?

Likely Direction

None of this implies that scaling is over. Frontier models will continue to improve, and there will be meaningful advances along that path. But it does suggest that the long-term shape of the market will not be defined by brute force alone.

More likely, we will see a combination of approaches: models that are large at training time, increasingly efficient at inference, and embedded in systems that rely on retrieval, tools, and composition to deliver results.

The brute-force era does not end so much as it evolves. What begins as pure scaling gradually incorporates efficiency, specialization, and better system design.

And historically, that second phase is where a great deal of the economic value is created.