GRPO, DPO & RLVR Explained: Reasoning RL Methods in 2026

In 2026, reinforcement learning (RL) is a whole industry, with a huge variety of methods created to help AI models stay on track and reason correctly. The new landscape is mainly shaped by GRPO (Group Relative Policy Optimization), RLVR (reinforcement learning with verifiable rewards), critic-free optimization, DPO (Direct Preference Optimization) variants, agentic policy optimization, and test-time diversity methods.

This guide maps the classic baselines and the newest 2026 methods for training models that reason, verify, search, self-correct, and improve with every optimization step.

TL;DR: Modern reasoning RL is shifting from expensive PPO-style pipelines toward cheaper, critic-free, group-relative, and preference-based methods. GRPO, DPO, DAPO, GSPO, ARPO, VPO, and newer DPO variants define the 2026 toolkit for reinforcement learning, agentic training, and reasoning optimization.

Now to the list!

Method	Status	Why it matters	Use for
GRPO – Group Relative Policy Optimization	2024–2026, mainstream	Critic-free PPO alternative; central RLVR baseline	Reasoning RL, math/code, verifiable rewards
DPO – Direct Preference Optimization	2023–2026, classic	Direct preference training without reward-model RL	Offline alignment, chosen/rejected datasets
REINFORCE++	2025–2026, practical	Simple critic-free RL with normalized advantage	Lightweight RLHF/RLVR baselines
DAPO – Dynamic sAmpling Policy Optimization	2025–2026, hot	More stable GRPO with dynamic sampling and clipping fixes	Long-CoT and large-scale reasoning RL
Dr. GRPO	2025–2026, corrective	Fixes GRPO length bias in loss normalization	Token-efficient long-reasoning training
GSPO – Group Sequence Policy Optimization	2025–2026, important	Optimizes sequence-level ratios, not token-level ones.	Sequence rewards, Mixture-of-Experts RL stability
DHPO – Dynamic Hybrid Policy Optimization	2026, new	Blends token-level GRPO and sequence-level GSPO	Hybrid reasoning RL optimization
EP-GRPO – Entropy-Progress Aligned GRPO	2026, new	Reweights tokens using entropy-progress signals	Better credit assignment in reasoning
TR-GRPO – Token-Regulated GRPO	2025–2026, new	Regulates token contributions by reward relevance	Math, logic, agentic reasoning
DPPO – Dynamic Pruning Policy Optimization	2026, efficiency-focused	Prunes redundant rollouts with unbiased correction	Faster GRPO-style training
ARPO – Agentic Reinforced Policy Optimization	2025-2026, agentic	Agentic PO, optimizes multi-turn agent steps	Tool-use and agentic LLMs
VPO – Vector Policy Optimization	2026, new	Trains diverse solution sets with reward vectors	Test-time search, best@k/pass@k
InSPO – Intrinsic Self-reflective Preference Optimization	2025-2026, DPO-family	Adds self-reflection to preference optimization	Reflective DPO-style alignment
TI-DPO – Token-Importance Guided DPO	2025-2026, notable	Adds token-importance weights to DPO	Fine-grained preference learning
RAPPO – Reliable Alignment for Preference PO	2026, reliable	Filters ambiguous pairs that hurt DPO generalization	Noisy preference datasets

Core RL Baselines: GRPO, DPO, REINFORCE++

GRPO

The foundation of the wave of RLVR and reasoning RL: critic-free, group-relative advantage that is cheaper than the classic PPO (Proximal Policy Optimization). By 2026, it is the central reference point. GRPO (Group Relative Policy Optimization) is a method where responses are compared within a group, without a separate value critic, which helps reduce compute costs. → Read more

DPO

DPO (Direct Preference Optimization) is already a classic “RLHF without full RL” method, because it uses human preference data but avoids the separate reward model. DPO trains the model directly on preference pairs – chosen vs. rejected pair of responses to the same prompt. It updates the model so the chosen response becomes more likely, while keeping the model close to the original supervised fine-tuned model. Now DPO is the main offline preference optimization reference point, because it is simple, stable, and also cheaper than PPO-style RLHF. → Read more

REINFORCE++

It matters as a “simple is strong again” method: it’s critic-free policy optimization that updates the model based on the reward for the full generated response, reinforcing more successful trajectories through a normalized advantage. It’s often placed next to GRPO and RLOO as a simple RLVR/RLHF baseline without PPO-level complexity. → Read more

GRPO Variants in 2026: DAPO, GSPO, DHPO and More

DAPO

One of the main GRPO-successor methods. It fixes several practical issues with GRPO: DAPO (Dynamic sAmpling Policy Optimization) keeps the GRPO-style group comparison workflow, but makes training more stable by separating clipping behavior, filtering and sampling more informative prompts, and tuning several rollout-level details. DAPO scores 50 on AIME 2024 with Qwen2.5-32B, along with an open-source large-scale RL system.→ Read more

Dr. GRPO

It matters as “GRPO done right” and a fix for token efficiency: Dr. GRPO fixes GRPO’s length-related bias by correcting how advantages and normalization are computed across tokens and responses. It normalizes using a fixed maximum or completion length, so shorter answers don’t get artificially larger updates, and longer reasoning traces are not unfairly penalized. → Read more

GSPO

A very important shift to sequence-level likelihood ratios. GSPO (Group Sequence Policy Optimization) computes the importance ratio over the whole generated sequence, then clips and optimizes this sequence-level ratio so the update aligns more directly with the final response-level reward. It is especially stable for Mixture-of-Experts RL training. → Read more

DHPO

A very fresh 2026 method: DHPO (Dynamic Hybrid Policy Optimization) combines GRPO’s token-level ratios to guide local corrections and GSPO’s sequence-level importance ratio to keep the whole-response optimization aligned with the final reward. In the end, GRPO gives you fine-grained credit assignment, GSPO better matches sequence-level rewards, and DHPO tries to get the best of both. → Read more

EP-GRPO

A fresh GRPO variant. It targets some of GRPO’s credit assignment failures: uniform token-level granularity, wrong polarity on reasoning steps, and zero-variance collapse. EP-GRPO (Entropy-Progress Aligned GRPO) tracks entropy changes across reasoning steps and uses this “progress” signal to reweight token advantages, so updates focus more on tokens that actually move the solution forward instead of treating every token equally. → Read more

TR-GRPO

One more GRPO-variant, that regulates token contributions. TR-GRPO (Token-Regulated GRPO) assigns different weights to tokens based on their estimated contribution to the final reward. This reduces noisy or unhelpful token updates while preserving stronger learning signals for important reasoning/action tokens. → Read more

DPPO

It is a fresh efficiency-focused method for group-based PO. DPPO (Dynamic Pruning Policy Optimization) makes GRPO-style training faster through dynamic pruning. It prunes low-value or redundant rollouts during group-based training, then uses importance-sampling correction so the faster update still estimates the original GRPO-style gradient without bias.. → Read more

Agentic & Test-Time Methods: ARPO, VPO

ARPO

Very important for agentic and tool-use models. ARPO (Agentic Reinforced Policy Optimization) proposes an RL algorithm designed specifically for multi-turn LLM agents. ARPO samples and optimizes at the agent-step level – across intermediate tool calls, observations, and decisions – and the model learns which actions improve the whole multi-turn trajectory instead of only rewarding the final answer. → Read more

VPO

This is one of the most interesting new VPO methods. VPO (Vector Policy Optimization) trains the model to produce diverse solution sets under different reward vectors, which is important for test-time search, best@k, and pass@k. → Read more

On X, we daily surface the AI research that matters and explain the ideas behind it. Follow us to be on track with the latest advancements!

— # (#)

Preference Optimization Methods: DPO Variants

InSPO

InSPO (Intrinsic Self-reflective Preference Optimization) is conceptually interesting: it brings self-reflection directly into preference optimization by conditioning the policy not only on the context, but also on an alternative response. It is a plug-and-play enhancement for DPO-family algorithms. → Read more

TI-DPO

TI-DPO (Token-Importance Guided DPO) is one of the most notable DPO variants. DPO is too coarse-grained because not all tokens matter equally. So TI-DPO introduces token-importance weights and a triplet loss, to let the model can focus more on the parts of the response that actually drive the preference. → Read more

RAPPO

A good fresh DPO variant that uses order-aware preference learning – “keep the best, forget the rest”. RAPPO (Reliable Alignment for Preference PO) ranks multiple candidate responses by preference order, keeps the strongest one as the main positive signal, and downweights or discards weaker alternatives. → Read more

❝

If you’ve found this list valuable, please subscribe to our newsletter for free.

FAQ

What is GRPO in reinforcement learning?

GRPO, or Group Relative Policy Optimization, is a critic-free reinforcement learning method where multiple responses to the same prompt are compared within a group. Instead of training a separate value model, GRPO uses group-relative rewards to estimate advantage, making reasoning RL and RLVR cheaper than PPO-style training.

What is RLVR?

RLVR means reinforcement learning with verifiable rewards. It trains models on tasks where answers can be checked automatically, such as math, coding, logic, or structured reasoning problems. Instead of relying only on human preference labels, RLVR uses rule-based or programmatic verification to reward correct reasoning outcomes.

GRPO vs PPO: what is the difference?

PPO usually relies on a value critic to estimate advantages during reinforcement learning. GRPO removes the separate critic and compares responses within a sampled group instead. This makes GRPO simpler and often cheaper for large language model reasoning training, especially when rewards are verifiable.

What are GRPO, DPO, RLVR, DAPO, GSPO, ARPO, and VPO used for?

GRPO is used for cheaper critic-free reasoning RL; DPO for offline preference alignment; RLVR for tasks with verifiable answers like math or coding; DAPO for more stable GRPO-style training; GSPO for sequence-level rewards; ARPO for multi-turn agents and tool use; and VPO for diverse test-time search.

Why do RLVR methods matter for reasoning models?

RLVR methods matter because they help models improve on tasks with objectively checkable answers. They are central to training stronger reasoning models for math, coding, tool use, and multi-step problem solving, where the model needs not only to sound plausible but to reach a correct result.

Reasoning RL in 2026: GRPO, DPO, RLVR, Agentic PO & Beyond

Core RL Baselines: GRPO, DPO, REINFORCE++

GRPO Variants in 2026: DAPO, GSPO, DHPO and More

EP-GRPO

TR-GRPO

Agentic & Test-Time Methods: ARPO, VPO

Preference Optimization Methods: DPO Variants

InSPO

Reply

AI 101: What is Recursive Self-Improvement?

FOD#156: What is the harder human-capital problem beneath token capital?

AI 101: Hermes Agent vs OpenClaw: Local AI Agents Compared