Offline Reinforcement Learning with Generative Trajectory Policies

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

ICML 2026

Overview

Offline RL needs policies that can model complex, multimodal behavior while staying fast enough for practical action generation. Existing generative policies tend to sit at opposite ends of this trade-off: diffusion policies are expressive but slow, while consistency-style policies are fast but can lose policy quality.

GTP resolves this by learning the full solution map of a continuous-time generative ODE. The result is a policy class that preserves trajectory information while enabling short, deterministic sampling paths at inference time.

Contributions

A unified ODE view that connects diffusion, flow matching, consistency models, and consistency trajectory models as trajectory-learning policies.
Generative Trajectory Policies, which learn the ODE solution map directly instead of choosing between slow iterative sampling and brittle one-step shortcuts.
Two offline RL adaptations: stable score approximation for efficient supervision, and advantage-weighted generative training for value-guided policy improvement.

Highlights

89.0

D4RL Gym average

Best average among the main offline RL baselines in the paper.

80.6

D4RL AntMaze average

Large gain on sparse-reward, long-horizon navigation tasks.

100

AntMaze U-Maze

Perfect normalized score in the full policy-improvement setting.

0.67 ms

2-step inference

Short-horizon GTP keeps strong returns with lower latency.

Abstract

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models—including diffusion, flow matching, and consistency models—as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks — it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

Method

GTP is built around three choices: learn the trajectory map directly, stabilize supervision with a data-anchored score approximation, and guide the generative objective with critic advantages.

Core 1

Learn a trajectory-level policy

\Phi(x_t,t,s)=x_t+\int_t^s f(x_\tau,\tau)\,d\tau=x_s

\Phi_\theta(x_t,t,s)\approx \Phi(x_t,t,s)

GTP learns the ODE solution map itself. This lets the policy preserve trajectory information while still generating actions with only a few learned jumps.

Core 2

Use score approximation for stable supervision

a_t=a+t z,\qquad \tilde f(a_t,t)=\frac{a_t-a}{t}

Training a full trajectory map from scratch can become self-referential and expensive. GTP avoids repeated inner-loop ODE solves by anchoring the training signal to offline data through a closed-form score surrogate.

Core 3

Turn imitation into value-guided improvement

w(s,a)=\exp\left(\eta\frac{\max(0,A(s,a))}{\operatorname{std}(A)+\epsilon}\right)

The actor still learns from dataset actions, but high-advantage actions receive larger generative weights. This keeps policy improvement data-supported instead of relying on unstable unconstrained Q maximization.

Stable score approximation diagram for GTP training — Stable score approximation

A data-anchored approximate score provides stable trajectory supervision without repeatedly solving the ODE.

Value-driven guidance diagram for GTP policy improvement — Value-driven guidance

Advantage weighting shifts the learned trajectory toward higher-value actions while staying aligned with data.

Results

Setting	GTP	Other Models	Takeaway
Behavior cloning
Gym	82.3	D-BC 76.3 C-BC 69.7	Best average behavior-cloning policy across the locomotion suite.
AntMaze	66.3	D-BC 41.2 C-BC 44.1	Much stronger imitation on multimodal, long-horizon datasets.
Policy improvement
Gym	89.0	D-QL 87.9 BDM 87.3 QGPO 86.6	Highest average return in the main D4RL offline RL comparison.
AntMaze	80.6	QGPO 78.3 IDQL-A 79.1 D-QL 69.6	Best average on the sparse-reward AntMaze suite.

Scores are normalized D4RL averages reported in the paper. Full per-task results, ablations, and additional baselines are in the PDF.

Policy Visualization

A 2D multi-goal environment gives a qualitative view of policy expressiveness. The target dataset is multimodal, and the learned policy should preserve all goal-reaching modes rather than collapsing to a single direction.

Original multi-goal dataset trajectories — Original dataset

The behavior data contains four clear modes toward the four goals.

Diffusion-QL trajectories in the multi-goal environment — Diffusion-QL

Diffusion-style sampling covers the modes, but trajectories are visibly noisy.

Consistency-AC trajectories in the multi-goal environment — Consistency-AC

Fast consistency sampling can distort the multimodal structure.

GTP trajectories in the multi-goal environment — GTP

GTP preserves all four goal-reaching modes with compact trajectories.

The key point is mode preservation: in this toy environment, a strong generative policy should not collapse to one goal or smear probability mass across the space. GTP keeps the four-way structure while avoiding long iterative diffusion sampling.

Additional Evidence

Sampling steps

K=2

The two-step variant achieves a Gym average of 88.7 versus 89.0 for the five-step variant, showing that GTP does not rely on long ODE rollouts.

Compute profile

0.67\ \mathrm{ms}

The K = 2 variant reports 3.8 hours of training and 0.67 ms inference, compared with 7.1 hours and 1.16 ms for Diffusion-QL under the matched setup.

Beyond D4RL

Additional OGBench and visual-observation experiments show that the same policy formulation can extend beyond low-dimensional state inputs.

Citation

@inproceedings{feng2026offline,
  title = {Offline Reinforcement Learning with Generative Trajectory Policies},
  author = {Feng, Xinsong and Tang, Leshu and Wang, Chenan and Chen, Haipeng},
  booktitle = {International Conference on Machine Learning},
  year = {2026}
}

Xinsong Feng (冯欣淞)