• Turing Post
  • Posts
  • 🌁#76: Rethinking Scaling Laws (when plateau is actually a fork)

🌁#76: Rethinking Scaling Laws (when plateau is actually a fork)

we discuss Compound AI and test-time compute to challenge the scaling plateau and pave the way for smarter AI systems, along with our usual collection of interesting articles, relevant news, and research papers. Dive in!

This Week in Turing Post:

  • Wednesday, AI 101, Concepts: Get ready for the next set of ML Flashcards

  • Friday, Agentic Workflows series.

If you like Turing Post, consider becoming a paid subscriber or sharing this digest with a friend. It helps us keep Monday digests free →

The main topic – When plateau is actually a fork

Last week was a whirlpool of discussions around scaling laws. The recent performance of OpenAI's "Orion," showing only modest improvements over GPT-4, and rumors of Google’s Gemini falling short of expectations, have sparked conversations about an AI plateau. Marc Andreessen noted that multiple models are "hitting the same ceiling on capabilities," while Ilya Sutskever reflected that "the 2010s were the age of scaling; now we’re back in the age of wonder and discovery." That caused a lot of media and analysts to talk about generative AI reaching a plateau.

Let’s be nerdy and look into the meaning of the word “plateau.” In science, a plateau phase refers to a steady state in a process. In psychology, a plateau can describe a stage where growth or learning appears stagnant, requiring new strategies or approaches to break through.

With generative AI, we are both on a plateau and not on a plateau since we are not in a steady state. So what we need is new strategies and approaches to break through. And there are already many of them either existing or emerging.

Today, I want to highlight a few important approaches that might be relevant to a breakthrough.

What is Compound AI?

Compound AI systems offer a practical way to address scaling law limitations. Instead of relying solely on larger models, these systems improve efficiency and performance by optimizing resource use and tailoring components to specific tasks. The first instances of "Compound AI" principles – combining multiple models, systems, or tools to solve complex tasks – date back to early research in multi-agent systems and ensemble learning, long before the term "Compound AI" was popularized. These ideas evolved from:

  • 1990s: Ensemble learning (e.g., random forests) and multi-agent systems introduced collaborative and model-combining techniques.

  • 2010s: Pipeline systems like IBM Watson combined NLP and retrieval models for complex tasks.

  • 2020s: Tool-integrated models like Codex and AlphaCode refined these ideas with external tools and iterative approaches.

Recently, in February 2024, BAIR formally spotlighted Compound AI in their famous paper “The Shift from Models to Compound AI Systems”, framing it as a system-level paradigm for efficiency and scalability. I remembered about it seeing the today’s news about the F1 and F1-mini, compound AI models excelling in complex reasoning. Early testing indicates that F1 matches or surpasses many closed frontier models in areas such as coding, mathematics, and logic puzzles. Promising, indeed.

Next, What Are We Scaling?

One of the goals of scaling laws is to identify where additional resources yield the greatest improvements. Remember how everybody talked about test time compute when OpenAI’s o1 just launched? They demonstrated that allowing models to "think longer" during inference significantly improved their reasoning performance across complex tasks, such as achieving human-expert accuracy on PhD-level science questions and competitive programming challenges. It’s because test time compute provides an efficient way to boost performance without significantly increasing model size or data volume, strategically addressing the cost-performance trade-offs, pushing the boundaries of what existing models can achieve. They covered it in detail in the paper “Learning to Reason with LLMs.”

Two more important papers about test-time compute are worth checking out if you want to dive deeper:

  1. “Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters” by Google DeepMind and UC Berkeley.

  2. “Training Verifiers to Solve Math Word Problems” by OpenAI from 2021, where they tackled the challenge of multi-step mathematical reasoning by introducing verifiers, significantly reducing dependence on brute-force parameter scaling and emphasizing efficient test-time compute strategies.

Instead of concentrating all resources on the training phase, it’s time to optimize and scale inference.

As for the Steady State

We need that as well. Not exponentially growing models but actual utility. Systems need not only reasoning but also the ability to act on their reasoning, leveraging external tools or workflows.

So here’s the fork in the road:

On one side, new approaches to scaling – like test-time compute – show us where additional resources can unlock meaningful gains. On the other side, the age of scaling is giving way to an age of integration, where reasoning meets action through systems that leverage external tools and workflows.

Far from a plateau, this is a transition. We’re moving into uncharted territory, where breakthroughs will come not from growing models indefinitely but from building systems that are smarter, more efficient, and deeply integrated. Sutskever’s right: we’ve stepped out of the shadow of pure scaling and back into the age of wonder and discovery.

(speaking of integration, here is a very fresh paper “The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use”. Researchers from the National University of Singapore put Claude 3.5 to the test as a GUI automation agent, tackling 20 real-world desktop tasks – from web navigation to gaming. It nailed planning, GUI actions, and adapting dynamically, handling tasks like adding ANC headphones under $100 to an Amazon cart.

Twitter library

Weekly recommendation from AI practitioner👍🏼

LLMs work better with markdown, and so far https://github.com/JohannesKaufmann/html-to-markdown does the best job of converting an entire HMTL page to markdown rather than just the top few paragraphs which is the norm for other tools.

Not a subscriber yet? Subscribe to receive our digests and articles:

Top Research

Image Credit: The Original paper

  • That’s a super fun paper: Game-theoretic LLM: Agent Workflow for Negotiation Games

    Researchers investigated whether AI language models could negotiate and play strategic games like humans. They found that while these AIs can be incredibly sophisticated negotiators, they sometimes act irrationally - not because they're flawed, but because they're too trusting! When two AIs negotiate, they tend to prioritize cooperation over self-interest, unlike humans who are typically more strategic →read the paper

  • Stronger Models Are Not Stronger Teachers for Instruction Tuning

    Researchers from the University of Washington and Allen Institute for AI examined if larger models improve smaller models during instruction tuning. They introduced the "Larger Models’ Paradox," finding larger models are not always better teachers than medium-sized ones →read the paper

  • Generative Agent Simulations of 1,000 People

    The study built AI agents to mimic the behaviors of 1,052 people based on interviews and surveys, hitting 85% accuracy in replicating responses. These agents can predict personality traits and social experiment outcomes, reducing bias compared to simpler models. With applications in policymaking and research, this project offers a safe way for scientists to explore human-like simulations while keeping participant data secure →read the paper

  • Toward Modular Models: Collaborative AI Development Enables Model Accountability and Continuous Learning

    Researchers from Microsoft propose modular AI models to address monolithic architecture limitations, enabling flexibility, transparency, and efficiency. They emphasize "MoErging," a taxonomy for routing tasks using expert models categorized by design (classifier-based, embedding-based, task-specific, or nonrouter). Benefits include privacy-compliant contributions, improved extensibility, accountability, and reduced compute costs →read the paper

You can find the rest of the curated research at the end of the newsletter.

We are reading

News from The Usual Suspects ©

  • Musk’s Rocketing Valuations

    • Elon Musk’s SpaceX preps a $250B valuation share sale, while xAI raises $5B, doubling its worth to $45B. With investors flocking and AI ambitions soaring, Musk also finds favor in Washington, joining Trump’s “efficiency team.” A tale of satellites, supercomputers, and political intrigue, Musk’s empire seems to orbit beyond Earth—and Silicon Valley.

  • CoreWeave’s ambitions are stacking up faster than its GPUs

    • CoreWeave, the AI chip infrastructure powerhouse, snagged an additional $650M in a secondary offering. Heavyweights like Cisco and Pure Storage are onboard, signaling confidence in CoreWeave’s role as a backbone for AI’s rapid expansion. We covered them in detail here, a fascinating read.

    • They also introduced the NVIDIA GB200 NVL72 with Quantum InfiniBand, thanks to collaborations with Dell and Switch, advancing AI infrastructure.

More interesting research papers from last week

Advanced Language Models

Model Optimization & Alignment

Multimodal & Vision-Language Models

Hardware & Efficiency

Counterfactuals & Reasoning

Network Automation & Specialized Models

Narrative & Media Processing

Leave a review!

Login or Subscribe to participate in polls.

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

Reply

or to participate.