- Turing Post
- Posts
- 🌁#76: Rethinking Scaling Laws (when plateau is actually a fork)
🌁#76: Rethinking Scaling Laws (when plateau is actually a fork)
we discuss Compound AI and test-time compute to challenge the scaling plateau and pave the way for smarter AI systems, along with our usual collection of interesting articles, relevant news, and research papers. Dive in!
This Week in Turing Post:
Wednesday, AI 101, Concepts: Get ready for the next set of ML Flashcards
Friday, Agentic Workflows series.
If you like Turing Post, consider becoming a paid subscriber or sharing this digest with a friend. It helps us keep Monday digests free →
The main topic – When plateau is actually a fork
Last week was a whirlpool of discussions around scaling laws. The recent performance of OpenAI's "Orion," showing only modest improvements over GPT-4, and rumors of Google’s Gemini falling short of expectations, have sparked conversations about an AI plateau. Marc Andreessen noted that multiple models are "hitting the same ceiling on capabilities," while Ilya Sutskever reflected that "the 2010s were the age of scaling; now we’re back in the age of wonder and discovery." That caused a lot of media and analysts to talk about generative AI reaching a plateau.
Let’s be nerdy and look into the meaning of the word “plateau.” In science, a plateau phase refers to a steady state in a process. In psychology, a plateau can describe a stage where growth or learning appears stagnant, requiring new strategies or approaches to break through.
With generative AI, we are both on a plateau and not on a plateau since we are not in a steady state. So what we need is new strategies and approaches to break through. And there are already many of them either existing or emerging.
Today, I want to highlight a few important approaches that might be relevant to a breakthrough.
What is Compound AI?
Compound AI systems offer a practical way to address scaling law limitations. Instead of relying solely on larger models, these systems improve efficiency and performance by optimizing resource use and tailoring components to specific tasks. The first instances of "Compound AI" principles – combining multiple models, systems, or tools to solve complex tasks – date back to early research in multi-agent systems and ensemble learning, long before the term "Compound AI" was popularized. These ideas evolved from:
1990s: Ensemble learning (e.g., random forests) and multi-agent systems introduced collaborative and model-combining techniques.
2010s: Pipeline systems like IBM Watson combined NLP and retrieval models for complex tasks.
2020s: Tool-integrated models like Codex and AlphaCode refined these ideas with external tools and iterative approaches.
Recently, in February 2024, BAIR formally spotlighted Compound AI in their famous paper “The Shift from Models to Compound AI Systems”, framing it as a system-level paradigm for efficiency and scalability. I remembered about it seeing the today’s news about the F1 and F1-mini, compound AI models excelling in complex reasoning. Early testing indicates that F1 matches or surpasses many closed frontier models in areas such as coding, mathematics, and logic puzzles. Promising, indeed.
Next, What Are We Scaling?
One of the goals of scaling laws is to identify where additional resources yield the greatest improvements. Remember how everybody talked about test time compute when OpenAI’s o1 just launched? They demonstrated that allowing models to "think longer" during inference significantly improved their reasoning performance across complex tasks, such as achieving human-expert accuracy on PhD-level science questions and competitive programming challenges. It’s because test time compute provides an efficient way to boost performance without significantly increasing model size or data volume, strategically addressing the cost-performance trade-offs, pushing the boundaries of what existing models can achieve. They covered it in detail in the paper “Learning to Reason with LLMs.”
Two more important papers about test-time compute are worth checking out if you want to dive deeper:
“Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters” by Google DeepMind and UC Berkeley.
“Training Verifiers to Solve Math Word Problems” by OpenAI from 2021, where they tackled the challenge of multi-step mathematical reasoning by introducing verifiers, significantly reducing dependence on brute-force parameter scaling and emphasizing efficient test-time compute strategies.
Instead of concentrating all resources on the training phase, it’s time to optimize and scale inference.
As for the Steady State
We need that as well. Not exponentially growing models but actual utility. Systems need not only reasoning but also the ability to act on their reasoning, leveraging external tools or workflows.
So here’s the fork in the road:
On one side, new approaches to scaling – like test-time compute – show us where additional resources can unlock meaningful gains. On the other side, the age of scaling is giving way to an age of integration, where reasoning meets action through systems that leverage external tools and workflows.
Far from a plateau, this is a transition. We’re moving into uncharted territory, where breakthroughs will come not from growing models indefinitely but from building systems that are smarter, more efficient, and deeply integrated. Sutskever’s right: we’ve stepped out of the shadow of pure scaling and back into the age of wonder and discovery.
(speaking of integration, here is a very fresh paper “The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use”. Researchers from the National University of Singapore put Claude 3.5 to the test as a GUI automation agent, tackling 20 real-world desktop tasks – from web navigation to gaming. It nailed planning, GUI actions, and adapting dynamically, handling tasks like adding ANC headphones under $100 to an Amazon cart.
Twitter library
Weekly recommendation from AI practitioner👍🏼
LLMs work better with markdown, and so far https://github.com/JohannesKaufmann/html-to-markdown does the best job of converting an entire HMTL page to markdown rather than just the top few paragraphs which is the norm for other tools.
Not a subscriber yet? Subscribe to receive our digests and articles:
Top Research
Autoregressive Models in Vision: A Survey →read the paper
Image Credit: The Original paper
That’s a super fun paper: Game-theoretic LLM: Agent Workflow for Negotiation Games
Researchers investigated whether AI language models could negotiate and play strategic games like humans. They found that while these AIs can be incredibly sophisticated negotiators, they sometimes act irrationally - not because they're flawed, but because they're too trusting! When two AIs negotiate, they tend to prioritize cooperation over self-interest, unlike humans who are typically more strategic →read the paper
Stronger Models Are Not Stronger Teachers for Instruction Tuning
Researchers from the University of Washington and Allen Institute for AI examined if larger models improve smaller models during instruction tuning. They introduced the "Larger Models’ Paradox," finding larger models are not always better teachers than medium-sized ones →read the paper
Generative Agent Simulations of 1,000 People
The study built AI agents to mimic the behaviors of 1,052 people based on interviews and surveys, hitting 85% accuracy in replicating responses. These agents can predict personality traits and social experiment outcomes, reducing bias compared to simpler models. With applications in policymaking and research, this project offers a safe way for scientists to explore human-like simulations while keeping participant data secure →read the paper
Toward Modular Models: Collaborative AI Development Enables Model Accountability and Continuous Learning
Researchers from Microsoft propose modular AI models to address monolithic architecture limitations, enabling flexibility, transparency, and efficiency. They emphasize "MoErging," a taxonomy for routing tasks using expert models categorized by design (classifier-based, embedding-based, task-specific, or nonrouter). Benefits include privacy-compliant contributions, improved extensibility, accountability, and reduced compute costs →read the paper
You can find the rest of the curated research at the end of the newsletter.
We are reading
MUST READ: OpenAI Email Archives (from Musk v. Altman) from LessWrong is like a treasure chest from the past (2015-2019) – so many telling things about our present to analyze.
In “A Chance to Build” Ben Thompson is discussing whether the U.S. faces a rare opportunity to rebuild its manufacturing base through automation and modularity, or if the entrenched dominance of the U.S.-Asia tech partnership will remain unshakable.
Jeffrey Ding’s latest ChinAI Newsletter highlights IT Juzi data showing that foreign tech companies like Microsoft and Google play a significant role in cultivating Chinese AI entrepreneurs, ranking alongside major domestic firms like Baidu.
39 Lessons on Building ML Systems, Scaling, Execution, and More by Eugene Yan
News from The Usual Suspects ©
Musk’s Rocketing Valuations
Elon Musk’s SpaceX preps a $250B valuation share sale, while xAI raises $5B, doubling its worth to $45B. With investors flocking and AI ambitions soaring, Musk also finds favor in Washington, joining Trump’s “efficiency team.” A tale of satellites, supercomputers, and political intrigue, Musk’s empire seems to orbit beyond Earth—and Silicon Valley.
CoreWeave’s ambitions are stacking up faster than its GPUs
CoreWeave, the AI chip infrastructure powerhouse, snagged an additional $650M in a secondary offering. Heavyweights like Cisco and Pure Storage are onboard, signaling confidence in CoreWeave’s role as a backbone for AI’s rapid expansion. We covered them in detail here, a fascinating read.
They also introduced the NVIDIA GB200 NVL72 with Quantum InfiniBand, thanks to collaborations with Dell and Switch, advancing AI infrastructure.
More interesting research papers from last week
Advanced Language Models
New AI Model Gemini Experimental 1114 Debuts On Google AI Studio
Demonstrates strong reasoning skills with a 32k context window, outperforming competitors on benchmarks, despite slower problem-solving speed.CamemBERT 2.0: A Smarter French Language Model
Tackles concept drift in French NLP with improved tokenization, excelling in QA and domain-specific tasks like biomedical NER.Qwen2.5-Coder Series: Powerful, Diverse, Practical
Excels in coding and multi-language repair tasks, rivaling GPT-4o in 40+ programming languages with open innovation for developers.Llava-o1: Let Vision Language Models Reason Step-By-Step
Enhances multimodal reasoning through structured, multi-stage processes, achieving superior benchmark performance.Large Language Models Can Self-Improve In Long-Context Reasoning
Uses self-improvement via ranking model outputs, improving performance in long-context reasoning tasks without external datasets.
Model Optimization & Alignment
Direct Preference Optimization Using Sparse Feature-Level Constraints
Combines efficiency and stability to achieve better alignment in LLMs with reduced computational overhead.Cut Your Losses In Large-Vocabulary Language Models
Reduces memory use for large-scale training, enabling up to 10x larger batch sizes without sacrificing performance.SPARSING LAW: Towards Large Language Models With Greater Activation Sparsity
Explores neuron sparsity in LLMs to enhance efficiency while preserving interpretability.
Multimodal & Vision-Language Models
Edify Image: High-Quality Image Generation With Pixel Space Laplacian Diffusion Models
Generates high-resolution photorealistic images with advanced diffusion techniques and controllable output mechanisms.Llava-o1: Let Vision Language Models Reason Step-By-Step
Improves multimodal reasoning by incorporating structured reasoning stages and a custom fine-tuning dataset.Language Models Are Hidden Reasoners
Unlocks latent reasoning capabilities in pre-trained LLMs using a self-rewarding framework.
Hardware & Efficiency
Balancing Pipeline Parallelism With Vocabulary Parallelism
Improves training efficiency by balancing memory and computation across large vocabularies in pipeline parallelism.Hardware And Software Platform Inference
Infers GPU architecture and software configurations from ML model behaviors, promoting transparency in cloud services.
Counterfactuals & Reasoning
Counterfactual Generation From Language Models
Generates meaningful counterfactuals using reformulated LLMs, enabling nuanced reasoning and intervention analysis.Can Sparse Autoencoders Be Used To Decompose And Interpret Steering Vectors?
Investigates challenges in interpreting steering vectors with sparse autoencoders, proposing recalibrated decomposition methods.
Network Automation & Specialized Models
Hermes: A Large Language Model Framework On The Journey To Autonomous Networks
Automates cellular network operations using modular LLM chains, achieving high accuracy in diverse tasks like energy-saving policy evaluation.Watermark Anything With Localized Messages
Introduces a robust watermarking framework for localized image encoding and message extraction under various transformations.
Narrative & Media Processing
Extracting Narrative Arcs From Media Collections (with Jürgen Schmidhuber as co-author)
Combines contrastive learning and evolutionary algorithms to reorder media into coherent story structures, enhancing storytelling automation.
Leave a review! |
Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!
Reply