Toward the end of 2024, I shared my perspective on the ongoing debate about whether AI’s “scaling laws” were encountering a real-world technical barrier. My argument was that the question of whether these scaling laws are hitting a wall is less critical than many believe. The reality is that we already have AI systems powerful enough to profoundly transform our world. The next few years will undoubtedly be shaped by AI advancements, regardless of whether scaling laws continue to hold or not.
Predicting the future of AI is always risky because the field evolves so rapidly. It’s embarrassing enough when your predictions for the year ahead don’t come true. But when your predictions for the *week* ahead are proven wrong? That’s a whole new level of humbling.
Less than a week after I published that piece, OpenAI released their latest large language model (LLM), o3, as part of their end-of-year announcements. While o3 doesn’t entirely disprove the idea that scaling laws are becoming less effective in driving AI progress, it definitively refutes the claim that AI advancement is hitting a wall.
o3 is nothing short of extraordinary. To fully grasp its capabilities, we need to take a brief detour into how we measure AI systems.
### Standardized Tests for AI
Comparing two language models requires evaluating their performance on a set of problems they haven’t encountered before. This is easier said than done, as these models are trained on vast amounts of text, meaning they’ve likely seen most standard tests already.
To address this, machine learning researchers create benchmarks—standardized tests designed to compare AI systems directly against each other and against human performance across various tasks, such as math, programming, and text interpretation. For a while, AIs were tested on challenges like the US Math Olympiad and advanced problems in physics, biology, and chemistry.
The issue, however, is that AI systems have been improving so rapidly that they keep rendering these benchmarks obsolete. Once an AI performs exceptionally well on a benchmark, we say the benchmark has become “saturated,” meaning it no longer effectively distinguishes between AI capabilities because nearly all models achieve near-perfect scores.
By 2024, benchmark after benchmark for AI capabilities had become as saturated as the Pacific Ocean. For example, GPQA, a benchmark for physics, biology, and chemistry that was once so challenging that even PhD students in those fields struggled to score above 70%, has now been surpassed by AI systems. Similarly, in the Math Olympiad qualifier, AI models now perform on par with top human competitors. Even the MMLU benchmark, designed to measure language understanding across multiple domains, has been saturated by the best models. ARC-AGI, a benchmark intended to measure general humanlike intelligence, saw o3 achieve a staggering 88% score when fine-tuned for the task.
We can always create new benchmarks (ARC-AGI-2 is on the horizon and promises to be significantly harder), but given the pace of AI progress, each new benchmark only remains relevant for a few years at best. More importantly, as AI systems continue to advance, benchmarks increasingly need to measure performance on tasks that even humans cannot perform, just to keep up with what these systems are capable of.
Yes, AIs still make frustrating and seemingly nonsensical errors. But if it’s been six months since you last engaged with the latest AI systems, or if you’ve only experimented with free, outdated versions of language models, you’re likely overestimating how often they make mistakes and underestimating their ability to handle complex, intellectually demanding tasks.
### The Invisible Wall
In a recent *Time* article, Garrison Lovely argued that AI progress hasn’t so much “hit a wall” as it has become invisible. Improvements are happening in areas that most people don’t notice or interact with. For instance, I’ve never tried using an AI to solve elite-level programming, biology, mathematics, or physics problems, and even if I did, I wouldn’t be able to verify the accuracy of the results.
The progress of a 5-year-old learning arithmetic versus a high schooler learning calculus is easy to see and understand. But the difference between a first-year math undergraduate and a world-renowned mathematician is far less tangible to most of us. Similarly, AI’s progress from solving basic problems to tackling advanced, specialized tasks has been less visible—but no less significant.
This progress is a big deal. AI is poised to transform our world by automating a vast amount of intellectual work previously done by humans. Three key factors will drive this transformation:
1. **Cost Reduction**: o3 delivers astonishing results, but it can cost over $1,000 to tackle a single complex problem. However, the recent release of China’s DeepSeek suggests that high-quality AI performance might soon become far more affordable.
2. **Improved Interfaces**: There’s widespread confidence that innovation in how we interact with AI systems—how they verify their work, and how we assign tasks to them—will unlock significant potential. Imagine a system where a mid-tier chatbot handles most tasks but can seamlessly call in a more advanced (and expensive) model when needed. This kind of product development, as opposed to pure technical advancement, is what I warned in December would reshape our world even if AI progress stalled.
3. **Increased Intelligence**: Despite claims that AI is hitting a wall, the latest systems are clearly getting smarter. They’re better at reasoning, problem-solving, and approaching expertise in a wide range of fields. In fact, we’re still figuring out how to measure their intelligence now that they’ve surpassed human performance on many benchmarks.
These three forces—cost reduction, improved interfaces, and increased intelligence—will define the next few years of AI development. Like it or not (and I, for one, am not entirely comfortable with how this world-changing transition is being managed), none of these forces are hitting a wall. Any one of them alone would be enough to permanently alter the world we live in. Together, they promise a future that will be shaped profoundly by AI, whether we’re ready for it or not.