- Turing Post
- Posts
- Topic 34: Things You Need to Know About Inference
Topic 34: Things You Need to Know About Inference
we dive into the core aspects of AI inference, explore how it works and what can make it faster
Was this email forwarded to you? Forward it also to a friend or a colleague! Sign up
Powerful reasoning models that take significant time to generate a final response remain a major focus in the AI community. They use test-time scaling to achieve state-of-the-art performance, often at the cost of slower response speeds. Our article on test-time compute gained a lot of interest from our readers, which is why we decided to dive deeper into a closely related topic – inference.
All chatbots, image generators, voice assistants, and recommendation systems rely on inference; it’s what users actually interact with. That’s why understanding the core aspects of inference is important for everyone, not just developers. Once a model is trained, inference is the stage where it demonstrates the knowledge it has learned. Making inference faster and enabling models to process many inputs simultaneously are among the most important challenges in optimization. Today, we’ll explore the fundamentals of inference – everything you need to know to feel confident with this part of a model’s function, including key concepts, workflows, types, optimization techniques, and of course, some latest developments in the field.
This article isn’t meant to be exhaustive but aims to clarify why inference matters so much in AI today.
What’s in today’s episode?
What Is AI inference?
Key concepts: Latency, throughput and a little bit more
How Inference Works in Transformers
Inference in Diffusion Models
Types of Inference
Optimization Techniques
Current Trends in Hardware
Conclusion
Sources and further reading
What is AI inference?
During training, a model learns essential patterns from data. Once it's properly trained, it's time for the model to demonstrate its acquired “knowledge” in real-world tasks. The process that begins with prompting and ends when the model generates an answer to a query is called inference. In other words, inference is the process of using a trained model to make predictions or generate outputs.
Achieving efficiency of inference is extremely crucial – as once IBM estimated in their survey, up to 90% of an AI model’s lifetime compute consumption happens during inference rather than training.
We even started to talk about new scaling laws, thanks to inference. In one of his recent presentations, NVIDIA CEO Jensen Huang has distinguished three scaling laws shaping modern AI:
Pre-training scaling (training on large data sets)
Post-training scaling (fine-tuning the model)
Test-time scaling (inference-time compute)
Jensen Huang emphasized that inference scaling is becoming increasingly critical. Models now perform complex, multi-step reasoning at inference – an approach called "test-time compute" or "inference scaling." By significantly increasing the computational resources allocated at this stage, AI systems achieve better performance, accuracy, and reliability, particularly in challenging applications requiring careful thought or nuanced reasoning.
To be fluent in AI inference, we need to understand its main aspects, such as the concepts used to measure the effectiveness of inference from different perspectives and how the inference process works within the model. Let's start with the concepts first.
Key concepts: Latency, throughput and a little bit more
Key concepts of inference include:
Inference time is the raw measure of how long the model takes to compute the output.
Total generation time = full time from prompt to complete response. It closely refers to one of the most important aspects of inference →
Latency – time to get one complete result. It starts when the request (prompt) is sent and ends when the first output token or the full response is received. Latency is usually measured in milliseconds (ms). In general, to count latency we need:
Time to First Token (TTFT): It measures how long it takes for the model to start responding. It’s the time from when you send a prompt to the model until the first generated token is returned.
Time Per Output Token (TPOT): It is the average time it takes to generate each output token after the first one is generated. It shows how fast the model generates answers once it starts.
So, approximately: Latency = TTFT + (TPOT Ă— number of output tokens).
This metric is used in streaming settings, when the model starts sending tokens as soon as it generates them, one-by-one or in small chunks.
In non-streaming settings, when the model generates the entire response first, and only then sends it back to the user, TTFT = total latency. Total generation time is usually the same as latency in non-streaming setting.
Throughput – the number of successful inferences (or generated tokens) the system can complete per second. It shows how much the system can handle overall. Throughput is often measured in requests per second (RPS) or tokens per second (TPS).
However, there is often a trade-off: optimizing for latency can hurt the throughput and vice versa. Therefore, the important task for developers is to balance them effectively, achieving minimized latency and maximized throughput.
Other important concepts of inference, which developers also pay attention to, are:
Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey →
If you want to thank us for our work without a subscription – here’s a way to do it. Thank you for your support – you make Turing Post possible.

Reply