Paper Notes - Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

This paper proposed to train a one-step policy guided by a diffusion model via a trust region loss, enabling fast inference without iterative sampling.
  • We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
  • We bridge the two polices by a newly introduced diffusion trust region loss.
  • The diffusion policy maintains expressiveness, while the trust region loss directs the one-step policy to explore freely and seek modes within the region defined by the diffusion policy.

1 Introduction

  • Behavior-regularized policy optimization techniques are employed to constrain the divergence between the learned and in-sample policies during training
  • Distilling the iterative denoising process of diffusion models into a one-step generator.

This paper

  • introduce a diffusion trust region loss that moves away from focusing on distribution matching; instead, it emphasizes establishing a safe, in-sample behavior region.
  • simultaneously train two cooperative policies: a diffusion policy for pure behavior cloning and a one-step policy for actual deployment.
  • The one-step policy is optimized based on two objectives: the diffusion trust region loss, which ensures safe policy exploration, and the maximization of the Q-value function, guiding the policy to generate actions in high-reward regions.

2 Diffusion Trusted Q-Learning

Diffusion Policy

This paper only trains a diffusion model and avoids using it for inference, thus significantly reducing both training and inference times.

The objective function of the diffusion model aims to train a predictor for denoising noisy samples back to clean samples, represented by the optimization problem:

minϕEt,x0,εN(0,I)[w(t)μϕ(xt,t)x022]\min_\phi\mathbb{E}_{t,\boldsymbol{x}_0,\boldsymbol{\varepsilon}\sim\mathcal{N}(0,\boldsymbol{I})}[w(t)\|\mu_\phi(\boldsymbol{x}_t,t)-\boldsymbol{x}_0\|_2^2]

where w(t)w(t) is a weighted function dependent only on t.t.

In offline RL, since the training data is state-action pairs, we train a diffusion policy using a conditional diffusion model as follows:

L(ϕ)=Et,εN(0,I),(a0,s)D[w(t)μϕ(at,ts)a022]\mathcal{L}(\phi)=\mathbb{E}_{t,\boldsymbol{\varepsilon}\sim\mathcal{N}(0,\boldsymbol{I}),(\boldsymbol{a}_0,\boldsymbol{s})\sim\mathcal{D}}[w(t)\|\mu_\phi(\boldsymbol{a}_t,t|\boldsymbol{s})-\boldsymbol{a}_0\|_2^2]

where a0,sa_0,s are the action and state samples from offline datasets D\mathcal{D}, and at=αta0+σtε.\boldsymbol a_t=\alpha_t\boldsymbol{a}_0+\sigma_t\boldsymbol{\varepsilon}.

The ELBO Objective The ELBO for continuous-time diffusion models can be simplified to the following expression (adopted in our setting):

logp(a0s)ELBO(a0s)=12EtU(0,1),εN(0,I)[w(t)μϕ(at,ts)a022]+c,\log p(\boldsymbol{a}_0|s)\geq\mathrm{ELBO}(\boldsymbol{a}_0|s)=-\frac12\mathbb{E}_{t\thicksim\mathcal{U}(0,1),\boldsymbol{\varepsilon}\thicksim\mathcal{N}(0,\boldsymbol{I})}\left[w(t)\|\mu_\phi(\boldsymbol{a}_t,t|\boldsymbol{s})-\boldsymbol{a}_0\|_2^2\right]+c,

where at=αta0+σtε,w(t)=dSNR(t)dt\boldsymbol{a}_t= \alpha _t\boldsymbol{a}_0+ \sigma _t\boldsymbol{\varepsilon }, w( t) = - \frac {\mathrm{dSNR}( t) }{\mathrm{d} t}, and the signal-to-noise ratio SNR(t)=αt2σt2,c( t) = \frac {\alpha _t^2}{\sigma _t^2}, c is a constant not relevant to ϕ.\phi.

  • Since we always assume that the SNR(t)(t) is strictly monotonically decreasing in tt, thus w(t)>0w(t)>0. The validity of the ELBO is maintained regardless of the schedule of αt\alpha_t and σt.\sigma_t.
  • Kingma and Gao [2024] generalized this theorem stating that if the weighting function w(t)=v(t)dSNR(t)dtw(t)=-v(t)\frac{\mathrm{dSNR}(t)}{\mathrm{d}t}, where v(t)v(t) is monotonic increasing function of tt, then this weighted diffusion denoising loss is equivalent to the ELBO as defined in Equation 3.

Diffusion Trust Region Loss

For any given ss and a fixed diffusion model μϕ\mu_\phi, the loss is to find the optimal generation function πθ(s)\pi_{\boldsymbol{\theta}}(\cdot|\boldsymbol{s}) that can minimize the diffusion-based trust region (TR) loss:

LTR(θ)=Et,εN(0,I),sD,aθπθ(s)[w(t)μϕ(αtaθ+σtε,ts)aθ22],\mathcal{L}_{\mathrm{TR}}(\theta)=\mathbb{E}_{t,\boldsymbol{\varepsilon}\thicksim\mathcal{N}(0,\boldsymbol{I}),\boldsymbol{s}\thicksim\mathcal{D},\boldsymbol{a}_\theta\thicksim\pi_\theta(\cdot|\boldsymbol{s})}[w(t)\|\mu_\phi(\alpha_t\boldsymbol{a}_\theta+\sigma_t\boldsymbol{\varepsilon},t|\boldsymbol{s})-\boldsymbol{a}_\theta\|_2^2],

where πθ(as)\pi_{\theta}(\boldsymbol{a}|\boldsymbol{s}) is a one-step generation policy, such as a Gaussian policy.


Theorem 1. policy μ\mu satisfies the ELBO condition of Equation 3, then the Diffusion Trust Region Loss aims to maximize the lower bound of the distribution mode maxa0logp(a0s)\max _{\boldsymbol{a}_0}\log p( \boldsymbol{a}_0| \boldsymbol{s}) for any given ss.

Proof.Proof. For any given state s\boldsymbol{s}

maxa0logp(a0s)maxθEaθπθ(s)[logp(aθs)]maxθEaθπθ(s)[ELBO(aθs)]=minθ12EtU(0,1),εN(0,I),aθπθ(s)[w(t)μϕ(aθ+σtε,ts)aθ22]\begin{aligned}\max_{\boldsymbol{a}_0}\log p(\boldsymbol{a}_0|\boldsymbol{s})&\geq\max_\theta\mathbb{E}_{\boldsymbol{a}_\theta\sim\pi_\theta(\cdot|\boldsymbol{s})}\left[\log p(\boldsymbol{a}_\theta|\boldsymbol{s})\right]\geq\max_\theta\mathbb{E}_{\boldsymbol{a}_\theta\sim\pi_\theta(\cdot|\boldsymbol{s})}\left[\mathrm{ELBO}(\boldsymbol{a}_\theta|\boldsymbol{s})\right]\\&=\min_\theta\frac12\mathbb{E}_{t\thicksim\mathcal{U}(0,1),\boldsymbol{\varepsilon}\thicksim\mathcal{N}(0,\boldsymbol{I}),\boldsymbol{a}_\theta\thicksim\pi_\theta(\cdot|\boldsymbol{s})}\left[w(t)\|\mu_\phi(\boldsymbol{a}_\theta+\sigma_t\boldsymbol{\varepsilon},t|\boldsymbol{s})-\boldsymbol{a}_\theta\|_2^2\right]\end{aligned}

Then, during training, we consider all states ss in D.\mathcal{D}. Thus, by taking the expectation over sDs\sim\mathcal{D} on both sides and setting tU(0,1)t\sim\mathcal{U}(0,1), we derive the loss described in Equation 4.


Unlike other diffusion models that generate various modalities by optimizing ϕ\phi to learn the data distribution, our method specifically aims to generate actions (data) that reside in the high-density region of the data manifold specified by μϕ\mu_\phi through optimizing θ.\theta.

The idea is similar to Langevin Dynamics.

Thus, the loss effectively creates a trust region defined by the diffusion-based behavior-cloning policy, within which the one-step policy πθ\pi_{\theta} can move freely. If the generated action deviates significantly from this trust region, it will be heavily penalized.


Remark 2. This loss is also closely connected with Diffusion- GAN and and EB-GAN, where the discriminator loss is considered as:

D(aθs)=Dec(Enc(aθ)s)aθ22D(\boldsymbol{a}_\theta|s)=\|Dec(Enc(\boldsymbol{a}_\theta)|s)-\boldsymbol{a}_\theta\|_2^2

In our model, the process of adding noise, αtaθ+σtϵ\alpha _t\boldsymbol{a}_\theta + \sigma _t\boldsymbol{\epsilon }, functions as an encoder, and μϕ(s)\mu_{\phi}(\cdot|s) acts as a decoder. Thus, this loss can also be considered as a discriminator loss, which determines whether the generated action aθa_{\theta} resembles the training dataset.


This approach makes the generated action aθa_\theta appear similar to in-sample actions and penalizes those that differ, thereby effectuating behavior regularization.

Diffusion Trusted Q-Learning

  • Introduce a dual-policy approach Diffusion Trusted Q-Learning (DTQL): a diffusion policy for pure behavior cloning and a one-step policy for actual deployment.
  • bridge the two policies through our newly introduced diffusion trust region loss
  • The trust region loss is optimized efficiently through each diffusion timestep without requiring the inference of the diffusion policy.
  • DTQL not only maintains an expressive exploration region but also facilitates efficient optimization.

Policy Learning

  • we utilize an unlimited number of timesteps and construct the diffusion policy μϕ\mu_\phi in a continuous time setting, based on the schedule outlined in EDM.
  • we can instantiate one typical one-step policy πθ(as)\pi_\theta(a|s) in two cases, Gaussian πθ(as)=N(μθ(s),σθ(s))\pi_{\theta}(\boldsymbol{a}|\boldsymbol{s})=\mathcal{N}(\mu_{\theta}(\boldsymbol{s}),\sigma_{\theta}(\boldsymbol{s})) or Implicit aθ=πθ(s,ε),εN(0,I).\boldsymbol a_\theta=\pi_{\theta}(\boldsymbol{s},\boldsymbol{\varepsilon}),\boldsymbol{\varepsilon}\sim\mathcal{N}(0,\boldsymbol{I}).
  • Then, we optimize πθ\pi_{\theta} by minimizing the introduced diffusion trust region loss and typical Q-value function maximization, as follows.
Lπ(θ)=αLTR(θ)EsD,aθπθ(as)[Qη(s,aθ)],\mathcal{L}_\pi(\theta)=\alpha\cdot\mathcal{L}_{\mathrm{TR}}(\theta)-\mathbb{E}_{\boldsymbol{s}\sim\mathcal{D},\boldsymbol{a}_\theta\sim\pi_\theta(\boldsymbol{a}|\boldsymbol{s})}[Q_\eta(s,\boldsymbol{a}_\theta)],

where LTR(θ)\mathcal{L}_\mathrm{TR}(\theta) serves primarily as a behavior-regularization term, and maximizing the Q-value function enables the model to preferentially sample actions associated with higher values.

  • Here we use the double Q-learning trick where Qη(s,aθ)=min(Qη1(s,aθ),Qη2(s,aθ)).Q_\eta(\boldsymbol{s},\boldsymbol{a}_{\boldsymbol{\theta}})=\min(Q_{\eta_1}(\boldsymbol{s},\boldsymbol{a}_{\boldsymbol{\theta}}),Q_{\eta_2}(\boldsymbol{s},\boldsymbol{a}_{\boldsymbol{\theta}})).
  • If Gaussian policy is employed, it necessitates the introduction of an entropy term Es,aD[logπθ(as)]-\mathbb{E}_{\boldsymbol{s},\boldsymbol{a}\sim\mathcal{D}}[\log\pi_{\boldsymbol{\theta}}(\boldsymbol{a}|s)] to maintain an exploratory nature during training.

Q-Learning We utilize Implicit Q-Learning (IQL) to train a Q function by maintaining two Q-functions (Qη1,Qη2)(Q_{\eta_1},Q_{\eta_2}) and one value function VψV_\psi, following the methodology outlined in IQL. The loss function for the value function VψV_\psi is defined as:

LV(ψ)=E(s,aD)[L2τ(min(Qη1(s,a),Qη2(s,a))Vψ(s))],\mathcal{L}_V(\psi)=\mathbb{E}_{(\boldsymbol{s},\boldsymbol{a}\thicksim\mathcal{D})}\left[L_2^\tau\left(\min(Q_{\eta_1^{\prime}}(s,\boldsymbol{a}),Q_{\eta_2^{\prime}}(\boldsymbol{s},\boldsymbol{a}))-V_\psi(\boldsymbol{s})\right)\right],

where τ\tau is a quantile in [0,1], and L2τ(u)=τ1(u<0)u2.L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2. When τ=0.5,L2τ\tau=0.5,L_2^\tau simplifies to the L2L_2 loss. When τ>0.5,Lψ\tau>0.5,L_\psi encourages the learning of the τ\tau quantile values of Q.Q. The loss function for updating the Q-functions, QηiQ_{\eta_i}, is given by:

LQ(ηi)=E(s,a,sD)[r(s,a)+γVψ(s)Qηi(s,a)2],\mathcal{L}_Q(\eta_i)=\mathbb{E}_{(\boldsymbol{s},\boldsymbol{a},\boldsymbol{s}^{\prime}\boldsymbol{\sim}\mathcal{D})}\left[||r(\boldsymbol{s},\boldsymbol{a})+\gamma*V_\psi(\boldsymbol{s}^{\prime})-Q_{\eta_i}(\boldsymbol{s},\boldsymbol{a})||^2\right],

where γ\gamma denotes the discount factor. This setup aims to minimize the error between the predicted Q-values and the target values derived from the value function VψV_\psi and the rewards. The algorithm is as follows:


Algorithm 1 Diffusion Trusted Q-Learning Initialize policy network πθ,μϕ\pi_\theta,\mu_\phi, critic networks Qη1Q_{\eta_1} and Qη2Q_{\eta_2}, and target networks Qη1Q_{\eta_1^{\prime}} and Qη2Q_{\eta_2^{\prime}}, value function VψV_\psi for each iteration do Sample transition mini-batch B={(st,at,rt,st+1)}D.\mathcal{B}=\left\{(\boldsymbol{s}_t,\boldsymbol{a}_t,r_t,\boldsymbol{s}_{t+1})\right\}\sim\mathcal{D}.

  1. Q-value function learning: Update Qη1,Qη2Q_{\eta_1},Q_{\eta_2} and VψV_\psi by LQ\mathcal{L}_Q and LV\mathcal{L}_V (Eqs. 6 and 7).
  2. Diffusion Policy learning: Update μϕ\mu_\phi by L(ϕ)\mathcal{L}(\phi) (Eq. 2).
  3. Diffusion Trust Region Policy learning: aθπθ(as)\boldsymbol{a}_\theta \sim \pi _\theta ( \boldsymbol{a}| \boldsymbol{s}), Update πθ\pi_\theta by Lπ(θ)\mathcal{L}_\pi(\theta) (Eq. 5).
  4. Update target networks: ηi=ρηi+(1ρ)ηi\eta _i^\prime = \rho \eta _i^{\prime }+ ( 1- \rho ) \eta _i for i={1,2}.i=\{1,2\}. end for\textbf{end for}

3 Mode seeking behavior regularization comparison

Another approach to accelerate training and inference in diffusion-based policy learning involves utilizing distillation techniques.

  • using a trained diffusion model alongside another diffusion network to minimize the KL divergence between the two models.

In our experimental setup, this strategy is employed for behavior regularization by

LKL(θ)=DKL[πθ(s)μϕ(s)]=EεN(0,I),sD,πθ(s,ε)[logpfake(aθs)preal(aθs)]\mathcal{L}_{\mathrm{KL}}(\theta)=D_{\mathrm{KL}}[\pi_\theta(\cdot|s)||\mu_\phi(\cdot|s)]=\mathbb{E}_{\boldsymbol{\varepsilon}\sim\mathcal{N}(0,\boldsymbol{I}),\boldsymbol{s}\sim\mathcal{D},\pi_\theta(\boldsymbol{s},\boldsymbol{\varepsilon})}\left[\log\frac{p_\mathrm{fake}(\boldsymbol{a}_\theta|\boldsymbol{s})}{p_\mathrm{real}(\boldsymbol{a}_\theta|\boldsymbol{s})}\right]

where πθ(s,ε)\pi_\theta(\boldsymbol{s},\boldsymbol{\varepsilon}) is instantiates as an one-step Implicit policy.

As we do not have access to the log densities of the fake and true conditional distributions of actions, the loss itself cannot be calculated directly. However, we are able to compute the gradients. The gradient of logpreal(aθs)\log p_{\mathrm{real}}(\boldsymbol{a}_{\boldsymbol{\theta}}|\boldsymbol{s}) can be estimated by the diffusion model μϕ(s)\mu_\phi(\cdot|\boldsymbol{s}), and the gradient of logpfake(aθs)\log p_{\mathrm{fake}}(\boldsymbol{a}_{\boldsymbol{\theta}}|\boldsymbol{s}) can also be estimated by a diffusion model trained from fake action data aθ.\boldsymbol{a}_\theta.