Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Qualitative Comparison

FLUX.1-dev Flow-GRPO Flow-CPS GRPO-Guard Flow-DPPO + CPS

"seven green croissants" FLUX.1-dev baseline

"a blue dog on top of three white sheeps behind seven white candles" FLUX.1-dev baseline

"a blue giraffe behind seven pink clocks to the right of a elephant" FLUX.1-dev baseline

Figure 1. Qualitative comparison on FLUX.1-dev with GenEval2 prompts. Flow-DPPO achieves competitive compositional accuracy with notably less image quality degradation compared to Flow-GRPO, Flow-CPS, and GRPO-Guard, reflecting superior KL-proximal efficiency.

Abstract

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others.

We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold.

Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviated catastrophic forgetting, promoted balanced multi-objective optimization, and enabled stable multi-epoch training where ratio clipping degrades.

Key Problem

Ratio clipping is a noisy, single-sample estimate of policy divergence — noise has the same magnitude as the signal in high dimensions.

Our Solution

Compute exact KL divergence leveraging Gaussian policy structure — deterministic, zero additional cost, theoretically grounded.

Why Ratio Clipping Fails in Flow Models

For Gaussian policies in flow models, the log probability ratio decomposes as:

$$\log r_t^i(\theta) = \frac{\varepsilon^\top d}{\sigma} - \frac{\|d\|^2}{2\sigma^2}$$

where $d = \mu_\theta - \mu_{\theta_{\text{old}}}$ is the policy shift and $\varepsilon \sim \mathcal{N}(0, I)$ is the sampling noise.

Critical Insight

The first term $\varepsilon^\top d / \sigma$ is a zero-mean noise with standard deviation $\|d\|/\sigma$ — the same order of magnitude as the signal term $\|d\|^2/(2\sigma^2)$. This means clipping decisions are dominated by random noise, not the true policy divergence.

Ratio Clipping (PPO-style)

Noisy single-sample estimate
Over-constrains some regions
Under-constrains others
Degrades with multi-epoch training

Divergence Constraint (Ours)

Exact, deterministic KL divergence
Uniform trust region enforcement
Zero additional computational cost
Stable across multiple epochs

Method

01

Exact KL Computation

For Gaussian per-step policies, the KL divergence is computed in closed form:

$$\text{KL}(\pi_{\text{old}} \| \pi_\theta) = \frac{\|\mu_{\theta_{\text{old}}}(x_t, t) - \mu_\theta(x_t, t)\|^2}{2\sigma^2}$$

Both means are already computed during training — zero additional cost.

02

Asymmetric Divergence Mask

We block gradient updates only when they simultaneously move away from the old policy and exceed the divergence threshold:

$$M_t^i = \begin{cases} 0 & \text{if } (\hat{A}^i > 0 \wedge r_t^i > 1 \wedge D_t > \delta) \\ 0 & \text{if } (\hat{A}^i < 0 \wedge r_t^i < 1 \wedge D_t > \delta) \\ 1 & \text{otherwise} \end{cases}$$

Corrective updates are never blocked, preserving PPO's beneficial asymmetry.

03

Trust Region Guarantee

The policy improvement bound for flow models guarantees monotonic improvement:

$$J(\pi_\theta) - J(\pi_{\theta_{\text{old}}}) \geq L'_{\theta_{\text{old}}}(\pi_\theta) - 2\xi(K\!-\!1)(K\!-\!2)\left[D_{\text{TV}}^{\max}\right]^2$$

KL-constraint upper-bounds TV divergence via Pinsker's inequality.

Flow-DPPO Objective

$$\mathcal{L}^{\text{Flow-DPPO}}(\theta) = \mathbb{E}\left[\sum_{i}\frac{1}{G}\sum_{t}\frac{1}{T}\left[M_t^i \cdot r_t^i(\theta) \cdot \hat{A}^i - \beta\, \text{KL}(\pi_\theta \| \pi_{\text{ref}})\right]\right]$$

Experimental Results

Performance comparison under multi-reward RL fine-tuning (GDPO with equal weights). Bold = best, underline = second best.

	In-Domain (GenEval2)				Out-of-Domain (PickScore)
Method	GenEval2 ↑	CLIP ↑	PickScore ↑	HPSv2 ↑	CLIP ↑	PickScore ↑	HPSv2 ↑
SD3.5-medium (pretrained: GenEval2 12.4)
Flow-GRPO	39.9	0.358	25.09	0.399	0.273	22.07	0.349
Flow-CPS	44.6	0.359	25.51	0.407	0.265	22.08	0.343
GRPO-Guard	47.8	0.353	25.64	0.409	0.272	22.32	0.354
Diffusion-NFT	42.5	0.334	25.30	0.394	0.269	22.52	0.355
Flow-DPPO	48.1	0.345	25.63	0.409	0.273	22.58	0.360
Flow-DPPO + CPS	51.6	0.369	25.72	0.415	0.279	22.51	0.361
FLUX2-klein-base-9B (pretrained: GenEval2 25.4)
Flow-GRPO	46.8	0.371	25.61	0.412	0.277	22.62	0.357
Flow-CPS	47.1	0.361	25.70	0.416	0.276	22.85	0.364
GRPO-Guard	49.0	0.375	25.27	0.411	0.269	21.99	0.349
Diffusion-NFT	47.3	0.336	24.87	0.389	0.274	22.47	0.351
Flow-DPPO	57.7	0.364	25.76	0.418	0.282	22.90	0.368
Flow-DPPO + CPS	55.2	0.386	26.15	0.427	0.287	22.97	0.370

Distribution drift from pretrained model (KL divergence ×10^-3, lower is better). Flow-DPPO achieves significantly less drift while attaining higher rewards.

	FLUX2-9B			SD3.5
Method	Single	Multi	+CFG	Single	Multi
Flow-SDE schedule
Flow-GRPO	0.77	0.79	1.36	2.34	3.81
GRPO-Guard	1.07	1.01	1.63	2.05	3.33
Flow-DPPO	0.17	0.49	0.51	1.16	2.49
CPS schedule
Flow-CPS	0.24	1.66	1.51	2.41	3.18
Flow-DPPO + CPS	0.68	0.70	0.83	1.60	2.52

6.3× better KL efficiency on FLUX2-9B (single-reward: 0.17 vs 1.07)

Flow-DPPO enables stable multi-epoch training (sample reuse), critical for expensive generation scenarios like video. Ratio-clipping methods degrade with multiple inner loops.

Flow-SDE Schedule

CPS Schedule

G64-I1

Standard: 64 groups, 1 inner loop

G32-I2

Half rollouts, 2 inner loops (saves compute)

G64-I2

Full rollouts, 2× training intensity

Qualitative Comparison

Qualitative comparison on FLUX2-9B (single-reward, controlled seeds, same training iteration). Flow-DPPO and Flow-DPPO + CPS retain competitive in-domain performance with less reward hacking while exhibiting notably less catastrophic forgetting on out-of-domain prompts.

In-Domain (GenEval2) Out-of-Domain (PickScore)

FLUX2-9B

Flow-GRPO

Flow-CPS

GRPO-Guard

Flow-DPPO

Flow-DPPO + CPS

Key Takeaways

+17.7%

Higher Reward

Flow-DPPO achieves 57.7 on GenEval2 vs 49.0 for the best baseline (GRPO-Guard) on FLUX2-9B.

6.3×

Better KL Efficiency

Significantly less distribution drift from the pretrained model, preserving generation quality.

Stable

Multi-Epoch Training

Enables efficient sample reuse where ratio clipping degrades — critical for video generation.

Balanced

Multi-Objective

Competitive across all metrics simultaneously without reward hacking or catastrophic forgetting.

Citation

@misc{ping2026flowdppodivergenceproximalpolicy,
      title={Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models}, 
      author={Bowen Ping and Xiangxin Zhou and Penghui Qi and Minnan Luo and Liefeng Bo and Tianyu Pang},
      year={2026},
      eprint={2606.11025},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.11025}, 
}

Flow-DPPO: Divergence Proximal Policy Optimizationfor Flow Matching Models

Qualitative Comparison

Abstract

Key Problem

Our Solution

Why Ratio Clipping Fails in Flow Models

Critical Insight

Ratio Clipping (PPO-style)

Divergence Constraint (Ours)

Method

Exact KL Computation

Asymmetric Divergence Mask

Trust Region Guarantee

Flow-DPPO Objective

Experimental Results

Flow-SDE Schedule

CPS Schedule

G64-I1

G32-I2

G64-I2

Qualitative Comparison

Key Takeaways

Higher Reward

Better KL Efficiency

Multi-Epoch Training

Multi-Objective

Citation

Flow-DPPO:
Divergence Proximal Policy Optimization
for Flow Matching Models