This website uses cookies

Read our Privacy policy and Terms of use for more information.

You might have heard this from us before, but sometimes when you can’t find a solution, it helps to look back. That’s exactly what a team from Sea AI Lab and the National University of Singapore did when facing a long-standing reinforcement learning (RL) problem.

It’s a well-known issue researchers have been trying to overcome: models often behave differently during training than they do in real use. The gap shows up in the numbers that represent the policy’s decisions – they don’t match between training and inference, which makes RL fine-tuning unstable.

Instead of designing new algorithms or adding complex adjustments, the team revisited something fundamental: numerical precision. They discovered that a simple shift back from the newer BF16 format to the older FP16 could restore stability and consistency.

Today we’ll explore what BF16 and FP16 actually are, how precision affects RL fine-tuning, and why this seemingly small change has caught the attention of developers everywhere – including Andrej Karpathy, who used it for nanochat.

In today’s episode, we will cover:

  • The origins of RL instability: why it all comes down to precision

  • What is FP16 precision and why does it really help?

  • Results of the BF16 → FP16 switch

  • Advantages of using FP16 precision format

  • Not without limitations

  • Early cases of implementation

  • Conclusion

  • Sources and further reading

Why RL Fine-Tuning Is Unstable: The Precision Problem

In many RL setups used for fine-tuning large models, training and inference run on different computation paths:

  • one for training to compute gradients are update model parameters

  • one for inference, when the model generates text

In theory, both engines should behave the same and produce the same mathematical results. But in practice, small numerical differences emerge because of rounding errors and hardware optimizations. This causes what’s called a training-inference mismatch – which is a major source of RL instability.

This mismatch leads to two main problems:

  1. Biased gradient: When the model trains, it tries to achieve higher rewards. The problem is that it learns from samples generated in inference mode, where numbers are handled slightly differently – enough to throw off the gradient that guides each step of learning. Hence – biased gradient.

  2. Deployment gap: After training, the model used for text generation is not exactly the same as the one optimized during training. This happens because the parameter that perform best during training may not be optimal during deployment, so performance drops.

To address this mismatch between training and inference, researchers often turn to a technique called importance sampling. This technique reweights each sample’s contribution by the ratio between the training and inference probabilities to keep the gradient estimate unbiased. Some researchers tried to fix the mismatch through engineering rather than algorithmic changes – for example, using higher precision such as FP32 (32-bit floating point) for certain layers or manually aligning the training and inference code paths – but these adjustments still failed to prevent training collapse. Many approaches still ended up optimizing model for the training engine rather than the inference one, but that’s not what we need.

The problem remains a problem: training and inference operate differently.

Researchers from Sea AI Lab and the National University of Singapore aimed to find the root cause of this mismatch, and it appeared to be floating-point precision. But why?

  • During RL fine-tuning, the model’s policy is updated through numerical calculations such as floating-point operations and probability values.

  • During inference, the same policy is used, but it isn’t updated – it simply runs forward to generate outputs.

  • If the numerical precision (the level of detail in how numbers are represented) isn’t consistent, the policy can behave slightly differently across these two stages.

A floating-point number (or “float”) is a way computers store real numbers, for example, 3.14 or 0.001, using bits. Some bits represent the value’s size (the exponent), and others represent its precision (the fraction). The more bits used, the more precisely the number can be stored and calculated.

Most RL fine-tuning methods now use a format called BF16 (bfloat16, or brain floating point). It’s a 16-bit floating-point format that keeps the same wide range of values as 32-bit floats but uses fewer bits for precision – 16 bits total:

  • 1 bit for the sign (positive or negative),

  • 8 bits for the exponent (range of values),

  • 7 fraction or mantissa bits (precision) (watch for the mantissa bits!)

Image Credit: ZipNN: Lossless Compression for AI Models

But again, the problem is that BF16 introduces rounding errors that cause small deviations that further lead to errors in training and inference policy match.

To fix it, researchers from Sea AI Lab and the National University of Singapore offered a method that turned out to be surprisingly simple – just switch from BF16 to earlier FP16 format during RL fine-tuning. Mindblowing 🙂

Now, let’s look closer at why such a small change could make such a big difference – and how it fixed what years of patches couldn’t.

FP16 vs BF16: Key Differences

Both FP16 and BF16 use 16 bits to represent numbers, but they distribute those bits differently – and that small design choice changes everything:

Image Credit: Defeating the Training-Inference Mismatch via FP16

  • FP16 dedicates 10 bits to the mantissa (the part that stores numerical detail) and 5 bits to the exponent (which defines the range). That gives FP16 high precision – it can tell small numbers apart accurately – but at the cost of a limited range. If values grow too large or too tiny, they can overflow or vanish to zero.

  • BF16 flips the priorities. It keeps 8 exponent bits – the same range as full FP32 – but only 7 mantissa bits for detail. The result is a format that’s tolerant to extreme values yet rougher in precision. It rarely crashes, but it blurs fine distinctions.

During large-scale pre-training, this trade-off made sense. Models need to handle vast and unstable values, so BF16’s wide range was a lifesaver – and it removed the need for tricks like loss scaling.

Loss scaling is a simple workaround used with FP16:

  • Multiply the loss by a large constant (say 1 000 or 10 000) to make gradients big enough not to underflow.

  • Compute gradients at this boosted scale.

  • Then divide them back down before updating weights.

Modern frameworks such as PyTorch, DeepSpeed, and Megatron handle this automatically, so FP16 training stays stable.

Okay, that makes sense for training, but we are in RL fine-tuning now. And here the situation reverses. After pre-training, model values are already stable – we are now afraid of overflow, precision loss is what matters. Each rollout depends on long chains of probability estimates, where even tiny rounding errors can compound over time. Here, every bit of detail matters.

With only 7 mantissa bits, BF16 gradually drifts: errors accumulate across autoregressive steps, and training and inference start to diverge.
FP16, with 10 mantissa bits, keeps those values aligned. That’s roughly eight times more numerical precision – 2102^{10}210 vs 272^{7}27 – enough to preserve consistency between training and inference and prevent instability from spiraling.

Switching from BF16 to FP16 doesn’t require new algorithms or elaborate tuning. Aside from the small trick of loss scaling, it simply restores mathematical honesty to the process – the model behaves the same way when it learns as when it acts.

And that’s why this “step back” to FP16 feels like such a leap forward.

BF16 to FP16 Switch: Benchmark Results

  1. Offline performance check. On DeepSeek-R1-Distill-Qwen-1.5B, raw inference performance was similar across FP16, BF16, and FP32. The key difference was elsewhere: FP16 shrinks the training–inference mismatch dramatically – about 24× smaller than BF16 – so rollout and training probabilities align much more closely.

Image Credit: Defeating the Training-Inference Mismatch via FP16

  1. Sanity test. The paper’s sanity test uses a filtered set of 1,460 solvable but non-trivial MATH problems where the model can, in principle, reach 100% training accuracy; ≥95% counts as a pass. Results across RL algorithms:

    • Under BF16, runs were unstable. Vanilla GRPO collapsed early; token-level TIS lasted a bit longer but still failed; sequence-level MIS was more stable yet slow and plateaued around 95%.

Image Credit: Defeating the Training-Inference Mismatch via FP16

  • Switching both rollout and training to FP16 made training stable, faster, and in some cases ~99% accurate. Even simple policy-gradient baselines outperformed BF16 with complex corrections.

Image Credit: Defeating the Training-Inference Mismatch via FP16

Why corrections become unnecessary. With FP16, the mismatch is so small that optimization behaves nearly on-policy, reducing the need for TIS/MIS patches.

Across settings.

  • MoE RL (Qwen3-30B-A3B-Base): FP16 was consistently more stable and achieved higher rewards than BF16.

  • LoRA RL: BF16-based LoRA collapsed after ~600 steps; FP16 LoRA remained stable.

  • Large dense models (Qwen3-14B-Base): FP16 converged faster and reached higher accuracy than BF16.

  • Other families (OctoThinker-3B on Llama-3.2-3B): BF16 destabilized after ~150 steps as rounding errors accumulated, while FP16 stayed stable throughout.

Image Credit: Defeating the Training-Inference Mismatch via FP16

Ablation on precision. Using BF16 training + FP32 inference improved stability but made inference about 3× slower. FP16 for both training and inference delivered the best balance: stability, speed, and near-100% training accuracy.

Bias-variance under BF16 vs FP16.

  • In BF16, methods like GRPO or token-level TIS have lower variance but higher bias – they learn quickly, then collapse. Methods like PG-Seq-IS or GRPO-Seq-MIS are less biased but high-variance – stable yet slow.

  • In FP16, this trade-off largely fades. Higher precision reduces the mismatch-induced bias and tames variance in IS corrections, so even basic algorithms converge smoothly.

Bottom line. Use FP16 for both rollout and training during RL fine-tuning. You get stability, efficiency, and top accuracy without algorithmic band-aids, aside from standard loss scaling that frameworks already handle.

Even though many of FP16 use strengths are already clear here, let’s still sum up quickly all the benefits.

Benefits of FP16 for RL Fine-Tuning

  • The change from BF16 to FP16 is simple – just a few lines of code and no algorithmic changes.

  • FP16 works in all major frameworks, and doesn’t require altering the model or training process.

  • FP precision format ensures both engines, training and inference, produce nearly identical numerical results.

  • Reduces rounding errors.

  • Precision improves by up to 8×.

  • With FP16, RL fine-tuning becomes more stable and efficient, the model learn faster, and performance improves across different tasks and setups (dense, LoRA, and Mixture-of-Experts models).

  • FP16 achieves high stability without extra computation.

It’s a rage combination of big benefits and simplicity in one method, but don’t forget about the trade-offs.

Not without limitations

  • First of all, switching from BF16 → FP16 trades range for precision. FP16 can overflow or underflow more easily because it has only 5 exponent bits (vs. 8 in BF16). This may be an issue for very large models or extreme gradient values.

  • For massive-scale pre-training, BF16’s wider range is still safer and easier to use.

  • And again, for very large distributed systems, developers might need extra engineering adjustments to manage overflow and synchronization.

So FP16 isn’t a go-to format for every training stage, but in RL fine-tuning it has already found wide use among developer, including some well-known ones as well →

Early cases of implementation

After the authors published their paper about switching to FP16, Twitter users began sharing it widely, admiring the achievements.

Others said it might be overhyped (a typical situation for something that’s getting a wave of popularity), and some also claimed that this research says BF16 is bad – but that’s not true. The researchers actually acknowledge the strengths of BF16 and point out where it works better than FP16.

It was very exciting for the researchers, that Andrej Karpathy also noticed this research and immediately applied the BF16→FP16 method in his nanochat – his new, super-cheap, open small model that’s perfect for experiments. (We wrote about it here.)

Image Credit: Zichen Liu’s X

Nathan Lambert also joined and encouraged the wave of implementations, emphasizing BF16’s outstanding error-reduction capabilities.

When big AI voices find something special in a method, it usually means it’s worth trying to see what it can really do. Since “Defeating the Training-Inference Mismatch via FP16” is a very recent paper, we think there will be many more examples and waves of applications to come.

Conclusion

Switching from BF16 to FP16 trades range for precision – and that trade works well in RL fine-tuning, where numerical precision, not algorithm design, is often the main source of instability. BF16 remains a solid choice for earlier training stages that involve large value fluctuations, but it is less suited to the stable, precision-sensitive phase of reinforcement learning.

The work from Sea AI Lab and the National University of Singapore carries a clear lesson: each stage of model training deserves its own precision and approach. It’s also a fortunate turn for the community that the researchers recognized this simple fix for RL fine-tuning – now the most widely used training method – and shared it openly. Their results may well refine how we think about, and measure, every stage of the training process.

At Turing Post, we often return to earlier ideas to understand the present. This research is a good reminder that progress sometimes means revisiting what once worked and seeing it in a new light. The modest FP16 format – introduced years ago and long overshadowed – has become the key to stability and accuracy in today’s advanced RL methods such as GRPO, PPO, and others.

Sources and further reading

From Turing Post:

Reply

Avatar

or to participate

Keep Reading