This website uses cookies

Read our Privacy policy and Terms of use for more information.

In 2026, reinforcement learning (RL) is a whole industry, with a huge variety of methods created to help AI models stay on track and reason correctly. The new landscape is mainly shaped by GRPO (Group Relative Policy Optimization), RLVR (reinforcement learning with verifiable rewards), critic-free optimization, DPO (Direct Preference Optimization) variants, agentic policy optimization, and test-time diversity methods.

This guide maps the classic baselines and the newest 2026 methods for training models that reason, verify, search, self-correct, and improve with every optimization step.

TL;DR: Modern reasoning RL is shifting from expensive PPO-style pipelines toward cheaper, critic-free, group-relative, and preference-based methods. GRPO, DPO, DAPO, GSPO, ARPO, VPO, and newer DPO variants define the 2026 toolkit for reinforcement learning, agentic training, and reasoning optimization.

Now to the list!

Method

Status

Why it matters

Use for

GRPO – Group Relative Policy Optimization

2024–2026, mainstream

Critic-free PPO alternative; central RLVR baseline

Reasoning RL, math/code, verifiable rewards

DPO – Direct Preference Optimization

2023–2026, classic

Direct preference training without reward-model RL

Offline alignment, chosen/rejected datasets

REINFORCE++

2025–2026, practical

Simple critic-free RL with normalized advantage

Lightweight RLHF/RLVR baselines

DAPO – Dynamic sAmpling Policy Optimization

2025–2026, hot

More stable GRPO with dynamic sampling and clipping fixes

Long-CoT and large-scale reasoning RL

Dr. GRPO

2025–2026, corrective

Fixes GRPO length bias in loss normalization

Token-efficient long-reasoning training

GSPO – Group Sequence Policy Optimization

2025–2026, important

Optimizes sequence-level ratios, not token-level ones.

Sequence rewards, Mixture-of-Experts RL stability

DHPO – Dynamic Hybrid Policy Optimization

2026, new

Blends token-level GRPO and sequence-level GSPO

Hybrid reasoning RL optimization

EP-GRPO – Entropy-Progress Aligned GRPO

2026, new

Reweights tokens using entropy-progress signals

Better credit assignment in reasoning

TR-GRPO – Token-Regulated GRPO

2025–2026, new

Regulates token contributions by reward relevance

Math, logic, agentic reasoning

DPPO – Dynamic Pruning Policy Optimization

2026, efficiency-focused

Prunes redundant rollouts with unbiased correction

Faster GRPO-style training

ARPO – Agentic Reinforced Policy Optimization

2025-2026, agentic

Agentic PO, optimizes multi-turn agent steps

Tool-use and agentic LLMs

VPO – Vector Policy Optimization

2026, new

Trains diverse solution sets with reward vectors

Test-time search, best@k/pass@k

InSPO – Intrinsic Self-reflective Preference Optimization

2025-2026, DPO-family

Adds self-reflection to preference optimization

Reflective DPO-style alignment

TI-DPO – Token-Importance Guided DPO

2025-2026, notable

Adds token-importance weights to DPO

Fine-grained preference learning

RAPPO – Reliable Alignment for Preference PO

2026, reliable

Filters ambiguous pairs that hurt DPO generalization

Noisy preference datasets

Core RL Baselines: GRPO, DPO, REINFORCE++

GRPO

The foundation of the wave of RLVR and reasoning RL: critic-free, group-relative advantage that is cheaper than the classic PPO (Proximal Policy Optimization). By 2026, it is the central reference point. GRPO (Group Relative Policy Optimization) is a method where responses are compared within a group, without a separate value critic, which helps reduce compute costs. → Read more

DPO

DPO (Direct Preference Optimization) is already a classic “RLHF without full RL” method, because it uses human preference data but avoids the separate reward model. DPO trains the model directly on preference pairs – chosen vs. rejected pair of responses to the same prompt. It updates the model so the chosen response becomes more likely, while keeping the model close to the original supervised fine-tuned model. Now DPO is the main offline preference optimization reference point, because it is simple, stable, and also cheaper than PPO-style RLHF. Read more

REINFORCE++

It matters as a “simple is strong again” method: it’s critic-free policy optimization that updates the model based on the reward for the full generated response, reinforcing more successful trajectories through a normalized advantage. It’s often placed next to GRPO and RLOO as a simple RLVR/RLHF baseline without PPO-level complexity. → Read more

GRPO Variants in 2026: DAPO, GSPO, DHPO and More

DAPO

One of the main GRPO-successor methods. It fixes several practical issues with GRPO: DAPO (Dynamic sAmpling Policy Optimization) keeps the GRPO-style group comparison workflow, but makes training more stable by separating clipping behavior, filtering and sampling more informative prompts, and tuning several rollout-level details. DAPO scores 50 on AIME 2024 with Qwen2.5-32B, along with an open-source large-scale RL system.→ Read more

Dr. GRPO

It matters as “GRPO done right” and a fix for token efficiency: Dr. GRPO fixes GRPO’s length-related bias by correcting how advantages and normalization are computed across tokens and responses. It normalizes using a fixed maximum or completion length, so shorter answers don’t get artificially larger updates, and longer reasoning traces are not unfairly penalized. → Read more

GSPO

A very important shift to sequence-level likelihood ratios. GSPO (Group Sequence Policy Optimization) computes the importance ratio over the whole generated sequence, then clips and optimizes this sequence-level ratio so the update aligns more directly with the final response-level reward. It is especially stable for Mixture-of-Experts RL training. → Read more

DHPO

A very fresh 2026 method: DHPO (Dynamic Hybrid Policy Optimization) combines GRPO’s token-level ratios to guide local corrections and GSPO’s sequence-level importance ratio to keep the whole-response optimization aligned with the final reward. In the end, GRPO gives you fine-grained credit assignment, GSPO better matches sequence-level rewards, and DHPO tries to get the best of both. → Read more

EP-GRPO

A fresh GRPO variant. It targets some of GRPO’s credit assignment failures: uniform token-level granularity, wrong polarity on reasoning steps, and zero-variance collapse. EP-GRPO (Entropy-Progress Aligned GRPO) tracks entropy changes across reasoning steps and uses this “progress” signal to reweight token advantages, so updates focus more on tokens that actually move the solution forward instead of treating every token equally. → Read more

TR-GRPO

One more GRPO-variant, that regulates token contributions. TR-GRPO (Token-Regulated GRPO) assigns different weights to tokens based on their estimated contribution to the final reward. This reduces noisy or unhelpful token updates while preserving stronger learning signals for important reasoning/action tokens. → Read more

DPPO

It is a fresh efficiency-focused method for group-based PO. DPPO (Dynamic Pruning Policy Optimization) makes GRPO-style training faster through dynamic pruning. It prunes low-value or redundant rollouts during group-based training, then uses importance-sampling correction so the faster update still estimates the original GRPO-style gradient without bias.. → Read more

Agentic & Test-Time Methods: ARPO, VPO

ARPO

Very important for agentic and tool-use models. ARPO (Agentic Reinforced Policy Optimization) proposes an RL algorithm designed specifically for multi-turn LLM agents. ARPO samples and optimizes at the agent-step level – across intermediate tool calls, observations, and decisions – and the model learns which actions improve the whole multi-turn trajectory instead of only rewarding the final answer. → Read more

VPO

This is one of the most interesting new VPO methods. VPO (Vector Policy Optimization) trains the model to produce diverse solution sets under different reward vectors, which is important for test-time search, best@k, and pass@k. → Read more

On X, we daily surface the AI research that matters and explain the ideas behind it. Follow us to be on track with the latest advancements!

Preference Optimization Methods: DPO Variants

InSPO

InSPO (Intrinsic Self-reflective Preference Optimization) is conceptually interesting: it brings self-reflection directly into preference optimization by conditioning the policy not only on the context, but also on an alternative response. It is a plug-and-play enhancement for DPO-family algorithms. → Read more

TI-DPO

TI-DPO (Token-Importance Guided DPO) is one of the most notable DPO variants. DPO is too coarse-grained because not all tokens matter equally. So TI-DPO introduces token-importance weights and a triplet loss, to let the model can focus more on the parts of the response that actually drive the preference. → Read more

RAPPO

A good fresh DPO variant that uses order-aware preference learning – “keep the best, forget the rest”. RAPPO (Reliable Alignment for Preference PO) ranks multiple candidate responses by preference order, keeps the strongest one as the main positive signal, and downweights or discards weaker alternatives. → Read more

If you’ve found this list valuable, please subscribe to our newsletter for free.

FAQ

What is GRPO in reinforcement learning?

GRPO, or Group Relative Policy Optimization, is a critic-free reinforcement learning method where multiple responses to the same prompt are compared within a group. Instead of training a separate value model, GRPO uses group-relative rewards to estimate advantage, making reasoning RL and RLVR cheaper than PPO-style training.

What is RLVR?

RLVR means reinforcement learning with verifiable rewards. It trains models on tasks where answers can be checked automatically, such as math, coding, logic, or structured reasoning problems. Instead of relying only on human preference labels, RLVR uses rule-based or programmatic verification to reward correct reasoning outcomes.

GRPO vs PPO: what is the difference?

PPO usually relies on a value critic to estimate advantages during reinforcement learning. GRPO removes the separate critic and compares responses within a sampled group instead. This makes GRPO simpler and often cheaper for large language model reasoning training, especially when rewards are verifiable.

What are GRPO, DPO, RLVR, DAPO, GSPO, ARPO, and VPO used for?

GRPO is used for cheaper critic-free reasoning RL; DPO for offline preference alignment; RLVR for tasks with verifiable answers like math or coding; DAPO for more stable GRPO-style training; GSPO for sequence-level rewards; ARPO for multi-turn agents and tool use; and VPO for diverse test-time search.

Why do RLVR methods matter for reasoning models?

RLVR methods matter because they help models improve on tasks with objectively checkable answers. They are central to training stronger reasoning models for math, coding, tool use, and multi-step problem solving, where the model needs not only to sound plausible but to reach a correct result.

Reply

Avatar

or to participate

Keep Reading