Turing Post
Posts
Topic 26: What is test-time compute and how to scale it?

Topic 26: What is test-time compute and how to scale it?

We dive into test-time compute and discuss five+ open-source methods for its effective scaling for deep step-by-step models' reasoning.

Alyona Vert.
February 05, 2025

For a long time, many AI and ML researchers and users preferred models that generate outputs immediately. But the recent shift to slow thinking, introduced with OpenAI’s o1 model, turned everything upside down. Since this breakthrough, it has become clear how remarkable a model’s reasoning capabilities can be when it is not “in a hurry” and has time to “think” through multiple steps – a process known as Chain-of-Thought reasoning. These aspects tie into a fascinating topic: test-time compute. Today, we’ll take a broader look at test-time compute, discussing five methods to scale it and how it can enhance AI models’ reasoning. This article is a true gem!

In today’s episode, we will cover:

The core idea behind OpenAI’s o1 model
What basically is Test-Time Compute (TTC)?
DeepSeek-R1’s way of scaling test-time compute
Test-time compute scaling meets multimodality
- What if we use long-form text examples for training MLLMs?
- Collective learning with Collective Monte Carlo Tree Search
- Generating images with CoT
Search-o1: Enhancing retrieval and agentic capabilities
A brief overview of 3 more research
Not without limitations
Conclusion: What does the future hold for test-time compute?
Resources to dive deeper

The core idea behind OpenAI’s o1 model

While many developers were running after high speed of input processing which leads to immediate outputs, OpenAI decided to bet on the deeper “thinking” of their o1 model, which resulted in increasing test-time compute. The concept of test-time compute aligns with what's known now as "System-2 thinking," which involves slow, deliberate, and logical reasoning, as opposed to "System-1 thinking," which is fast and intuitive.

What basically is Test-Time Compute (TTC)?

TTC refers to the amount of computational power used by an AI model when it is generating a response or performing a task after it has been trained. In simple terms, it's the processing power and time required when the model is actually being used, rather than when it is being trained.

Key aspects of Test-Time Compute (TTC):

Inference process: When you input a question or a prompt into a model, it processes the input and generates a response. The computational cost of this process is called test-time compute.
Scaling at test time: Some advanced AI models, like OpenAI's o1 series, dynamically increase their reasoning time during inference. This means they can spend more time thinking for complex questions, improving accuracy at the cost of higher compute usage.

By allocating more computational resources during inference, o1 models can perform deeper reasoning, leading to more accurate and thoughtful responses. o1 uses step-by-step thinking, in other words, Chain-of-Thought method, before arriving at a final answer. Thanks to this, the o1 model excels at tasks that require complex problem-solving.

Since o1 is so powerful but, at the same time, a closed model, it has pushed other developers to create new models, based on o1 principles, trying to scale at TTC, or uncover secrets of o1 and bring these technologies to the community. Let’s dive into five research which explore, use and expand the o1’s core idea to make it accessible for the developers. →

DeepSeek-R1’s way of scaling test-time compute

This article explores test-time compute in so many fascinating details! Upgrade to be the first to read it. And then receive our detailed explanations and curated resources – right in your inbox. Simplify your learning! →

or follow us on Hugging Face, this article will appear there tomorrow for free

From our partners: Ultimate Guide to AI Agents

Learn how to create powerful, reliable AI agents with Galileo's in-depth eBook! Enjoy 100 pages of expert content to help you:

Select the right agentic framework for your needs
Evaluate and improve agent performance with proven techniques
Identify and resolve failure points before they impact production

Reply

or to participate.