Flow-DPPO:
Divergence Proximal Policy Optimization
for Flow Matching Models

Bowen Ping1,2,* Xiangxin Zhou2,*,¶ Penghui Qi3 Minnan Luo1,‡ Liefeng Bo2 Tianyu Pang2,‡
1Xi'an Jiaotong University 2Tencent Hunyuan 3National University of Singapore

* Equal contribution    ¶ Project Lead    ‡ Corresponding author

Qualitative Comparison

FLUX.1-dev Flow-GRPO Flow-CPS GRPO-Guard Flow-DPPO + CPS
"seven green croissants" FLUX.1-dev baseline Flow-GRPO Flow-CPS GRPO-Guard Flow-DPPO + CPS
"a blue dog on top of three white sheeps behind seven white candles" FLUX.1-dev baseline Flow-GRPO Flow-CPS GRPO-Guard Flow-DPPO + CPS
"a blue giraffe behind seven pink clocks to the right of a elephant" FLUX.1-dev baseline Flow-GRPO Flow-CPS GRPO-Guard Flow-DPPO + CPS

Figure 1. Qualitative comparison on FLUX.1-dev with GenEval2 prompts. Flow-DPPO achieves competitive compositional accuracy with notably less image quality degradation compared to Flow-GRPO, Flow-CPS, and GRPO-Guard, reflecting superior KL-proximal efficiency.

Abstract

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others.

We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold.

Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviated catastrophic forgetting, promoted balanced multi-objective optimization, and enabled stable multi-epoch training where ratio clipping degrades.

Key Problem

Ratio clipping is a noisy, single-sample estimate of policy divergence — noise has the same magnitude as the signal in high dimensions.

Our Solution

Compute exact KL divergence leveraging Gaussian policy structure — deterministic, zero additional cost, theoretically grounded.

Why Ratio Clipping Fails in Flow Models

For Gaussian policies in flow models, the log probability ratio decomposes as:

$$\log r_t^i(\theta) = \frac{\varepsilon^\top d}{\sigma} - \frac{\|d\|^2}{2\sigma^2}$$

where $d = \mu_\theta - \mu_{\theta_{\text{old}}}$ is the policy shift and $\varepsilon \sim \mathcal{N}(0, I)$ is the sampling noise.

Critical Insight

The first term $\varepsilon^\top d / \sigma$ is a zero-mean noise with standard deviation $\|d\|/\sigma$ — the same order of magnitude as the signal term $\|d\|^2/(2\sigma^2)$. This means clipping decisions are dominated by random noise, not the true policy divergence.

Ratio Clipping (PPO-style)

  • Noisy single-sample estimate
  • Over-constrains some regions
  • Under-constrains others
  • Degrades with multi-epoch training

Divergence Constraint (Ours)

  • Exact, deterministic KL divergence
  • Uniform trust region enforcement
  • Zero additional computational cost
  • Stable across multiple epochs

Method

01

Exact KL Computation

For Gaussian per-step policies, the KL divergence is computed in closed form:

$$\text{KL}(\pi_{\text{old}} \| \pi_\theta) = \frac{\|\mu_{\theta_{\text{old}}}(x_t, t) - \mu_\theta(x_t, t)\|^2}{2\sigma^2}$$

Both means are already computed during training — zero additional cost.

02

Asymmetric Divergence Mask

We block gradient updates only when they simultaneously move away from the old policy and exceed the divergence threshold:

$$M_t^i = \begin{cases} 0 & \text{if } (\hat{A}^i > 0 \wedge r_t^i > 1 \wedge D_t > \delta) \\ 0 & \text{if } (\hat{A}^i < 0 \wedge r_t^i < 1 \wedge D_t > \delta) \\ 1 & \text{otherwise} \end{cases}$$

Corrective updates are never blocked, preserving PPO's beneficial asymmetry.

03

Trust Region Guarantee

The policy improvement bound for flow models guarantees monotonic improvement:

$$J(\pi_\theta) - J(\pi_{\theta_{\text{old}}}) \geq L'_{\theta_{\text{old}}}(\pi_\theta) - 2\xi(K\!-\!1)(K\!-\!2)\left[D_{\text{TV}}^{\max}\right]^2$$

KL-constraint upper-bounds TV divergence via Pinsker's inequality.

Flow-DPPO Objective

$$\mathcal{L}^{\text{Flow-DPPO}}(\theta) = \mathbb{E}\left[\sum_{i}\frac{1}{G}\sum_{t}\frac{1}{T}\left[M_t^i \cdot r_t^i(\theta) \cdot \hat{A}^i - \beta\, \text{KL}(\pi_\theta \| \pi_{\text{ref}})\right]\right]$$

Experimental Results

Performance comparison under multi-reward RL fine-tuning (GDPO with equal weights). Bold = best, underline = second best.

In-Domain (GenEval2) Out-of-Domain (PickScore)
Method GenEval2 ↑ CLIP ↑ PickScore ↑ HPSv2 ↑ CLIP ↑ PickScore ↑ HPSv2 ↑
SD3.5-medium (pretrained: GenEval2 12.4)
Flow-GRPO 39.90.35825.090.399 0.27322.070.349
Flow-CPS 44.60.35925.510.407 0.26522.080.343
GRPO-Guard 47.80.35325.640.409 0.27222.320.354
Diffusion-NFT 42.50.33425.300.394 0.26922.520.355
Flow-DPPO 48.10.34525.630.409 0.27322.580.360
Flow-DPPO + CPS 51.60.36925.720.415 0.27922.510.361
FLUX2-klein-base-9B (pretrained: GenEval2 25.4)
Flow-GRPO 46.80.37125.610.412 0.27722.620.357
Flow-CPS 47.10.36125.700.416 0.27622.850.364
GRPO-Guard 49.00.37525.270.411 0.26921.990.349
Diffusion-NFT 47.30.33624.870.389 0.27422.470.351
Flow-DPPO 57.70.36425.760.418 0.28222.900.368
Flow-DPPO + CPS 55.20.38626.150.427 0.28722.970.370

Distribution drift from pretrained model (KL divergence ×10-3, lower is better). Flow-DPPO achieves significantly less drift while attaining higher rewards.

FLUX2-9B SD3.5
Method Single Multi +CFG Single Multi
Flow-SDE schedule
Flow-GRPO 0.770.791.362.343.81
GRPO-Guard 1.071.011.632.053.33
Flow-DPPO 0.170.490.511.162.49
CPS schedule
Flow-CPS 0.241.661.512.413.18
Flow-DPPO + CPS 0.680.700.831.602.52
6.3× better KL efficiency on FLUX2-9B (single-reward: 0.17 vs 1.07)

Flow-DPPO enables stable multi-epoch training (sample reuse), critical for expensive generation scenarios like video. Ratio-clipping methods degrade with multiple inner loops.

Flow-SDE Schedule

Multi-epoch training on SD3.5 with Flow-SDE schedule

CPS Schedule

Multi-epoch training on SD3.5 with CPS schedule

G64-I1

Standard: 64 groups, 1 inner loop

G32-I2

Half rollouts, 2 inner loops (saves compute)

G64-I2

Full rollouts, 2× training intensity

Key Takeaways

+17.7%

Higher Reward

Flow-DPPO achieves 57.7 on GenEval2 vs 49.0 for the best baseline (GRPO-Guard) on FLUX2-9B.

6.3×

Better KL Efficiency

Significantly less distribution drift from the pretrained model, preserving generation quality.

Stable

Multi-Epoch Training

Enables efficient sample reuse where ratio clipping degrades — critical for video generation.

Balanced

Multi-Objective

Competitive across all metrics simultaneously without reward hacking or catastrophic forgetting.

Citation

BibTeX will be available once the arXiv preprint is released.