- We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
- We bridge the two polices by a newly introduced diffusion trust region loss.
- The diffusion policy maintains expressiveness, while the trust region loss directs the one-step policy to explore freely and seek modes within the region defined by the diffusion policy.
1 Introduction
- Behavior-regularized policy optimization techniques are employed to constrain the divergence between the learned and in-sample policies during training
- Distilling the iterative denoising process of diffusion models into a one-step generator.
This paper
- introduce a diffusion trust region loss that moves away from focusing on distribution matching; instead, it emphasizes establishing a safe, in-sample behavior region.
- simultaneously train two cooperative policies: a diffusion policy for pure behavior cloning and a one-step policy for actual deployment.
- The one-step policy is optimized based on two objectives: the diffusion trust region loss, which ensures safe policy exploration, and the maximization of the Q-value function, guiding the policy to generate actions in high-reward regions.
2 Diffusion Trusted Q-Learning
Diffusion Policy
This paper only trains a diffusion model and avoids using it for inference, thus significantly reducing both training and inference times.
The objective function of the diffusion model aims to train a predictor for denoising noisy samples back to clean samples, represented by the optimization problem:
ϕminEt,x0,ε∼N(0,I)[w(t)∥μϕ(xt,t)−x0∥22]
where w(t) is a weighted function dependent only on t.
In offline RL, since the training data is state-action pairs, we train a diffusion policy using a conditional diffusion model as follows:
L(ϕ)=Et,ε∼N(0,I),(a0,s)∼D[w(t)∥μϕ(at,t∣s)−a0∥22]
where a0,s are the action and state samples from offline datasets D, and at=αta0+σtε.
The ELBO Objective
The ELBO for continuous-time diffusion models can be simplified to the following expression (adopted in our setting):
logp(a0∣s)≥ELBO(a0∣s)=−21Et∼U(0,1),ε∼N(0,I)[w(t)∥μϕ(at,t∣s)−a0∥22]+c,
where at=αta0+σtε,w(t)=−dtdSNR(t), and the signal-to-noise ratio SNR(t)=σt2αt2,c is a constant not relevant to ϕ.
- Since we always assume that the SNR(t) is strictly monotonically decreasing in t, thus w(t)>0. The validity of the ELBO is maintained regardless of the schedule of αt and σt.
- Kingma and Gao [2024] generalized this theorem stating that if the weighting function w(t)=−v(t)dtdSNR(t), where v(t) is monotonic increasing function of t, then this weighted diffusion denoising loss is equivalent to the ELBO as defined in Equation 3.
Diffusion Trust Region Loss
For any given s and a fixed diffusion model μϕ, the loss is to find the optimal generation function πθ(⋅∣s) that can minimize the diffusion-based trust region (TR) loss:
LTR(θ)=Et,ε∼N(0,I),s∼D,aθ∼πθ(⋅∣s)[w(t)∥μϕ(αtaθ+σtε,t∣s)−aθ∥22],
where πθ(a∣s) is a one-step generation policy, such as a Gaussian policy.
Theorem 1. policy μ satisfies the ELBO condition of Equation 3, then the Diffusion Trust Region Loss aims to maximize the lower bound of the distribution mode maxa0logp(a0∣s) for any given s.
Proof. For any given state s
a0maxlogp(a0∣s)≥θmaxEaθ∼πθ(⋅∣s)[logp(aθ∣s)]≥θmaxEaθ∼πθ(⋅∣s)[ELBO(aθ∣s)]=θmin21Et∼U(0,1),ε∼N(0,I),aθ∼πθ(⋅∣s)[w(t)∥μϕ(aθ+σtε,t∣s)−aθ∥22]
Then, during training, we consider all states s in D. Thus, by taking the expectation over s∼D on both sides and setting t∼U(0,1), we derive the loss described in Equation 4.
Unlike other diffusion models that generate various modalities by optimizing ϕ to learn the data distribution, our method specifically aims to generate actions (data) that reside in the high-density region of the data manifold specified by μϕ through optimizing θ.
The idea is similar to Langevin Dynamics.
Thus, the loss effectively creates a trust region defined by the diffusion-based behavior-cloning policy, within which the one-step policy πθ can move freely. If the generated action deviates significantly from this trust region, it will be heavily penalized.
Remark 2. This loss is also closely connected with Diffusion- GAN and and EB-GAN, where the discriminator loss is considered as:
D(aθ∣s)=∥Dec(Enc(aθ)∣s)−aθ∥22
In our model, the process of adding noise, αtaθ+σtϵ, functions as an encoder, and μϕ(⋅∣s) acts as a decoder. Thus, this loss can also be considered as a discriminator loss, which determines whether the generated action aθ resembles the training dataset.
This approach makes the generated action aθ appear similar to in-sample actions and penalizes those that differ, thereby effectuating behavior regularization.
Diffusion Trusted Q-Learning
- Introduce a dual-policy approach Diffusion Trusted Q-Learning (DTQL): a diffusion policy for pure behavior cloning and a one-step policy for actual deployment.
- bridge the two policies through our newly introduced diffusion trust region loss
- The trust region loss is optimized efficiently through each diffusion timestep without requiring the inference of the diffusion policy.
- DTQL not only maintains an expressive exploration region but also facilitates efficient optimization.
Policy Learning
- we utilize an unlimited number of timesteps and construct the diffusion policy μϕ in a continuous time setting, based on the schedule outlined in EDM.
- we can instantiate one typical one-step policy πθ(a∣s) in two cases, Gaussian πθ(a∣s)=N(μθ(s),σθ(s)) or Implicit aθ=πθ(s,ε),ε∼N(0,I).
- Then, we optimize πθ by minimizing the introduced diffusion trust region loss and typical Q-value function maximization, as follows.
Lπ(θ)=α⋅LTR(θ)−Es∼D,aθ∼πθ(a∣s)[Qη(s,aθ)],
where LTR(θ) serves primarily as a behavior-regularization term, and maximizing the Q-value function enables the model to preferentially sample actions associated with higher values.
- Here we use the double Q-learning trick where Qη(s,aθ)=min(Qη1(s,aθ),Qη2(s,aθ)).
- If Gaussian policy is employed, it necessitates the introduction of an entropy term −Es,a∼D[logπθ(a∣s)] to maintain an exploratory nature during training.
Q-Learning
We utilize Implicit Q-Learning (IQL) to train a Q function by maintaining two Q-functions (Qη1,Qη2) and one value function Vψ, following the methodology outlined in IQL. The loss function for the value function Vψ is defined as:
LV(ψ)=E(s,a∼D)[L2τ(min(Qη1′(s,a),Qη2′(s,a))−Vψ(s))],
where τ is a quantile in [0,1], and L2τ(u)=∣τ−1(u<0)∣u2. When τ=0.5,L2τ simplifies to the L2 loss. When τ>0.5,Lψ encourages the learning of the τ quantile values of Q. The loss function for updating the Q-functions, Qηi, is given by:
LQ(ηi)=E(s,a,s′∼D)[∣∣r(s,a)+γ∗Vψ(s′)−Qηi(s,a)∣∣2],
where γ denotes the discount factor. This setup aims to minimize the error between the predicted Q-values and the target values derived from the value function Vψ and the rewards. The algorithm is as follows:
Algorithm 1 Diffusion Trusted Q-Learning
Initialize policy network πθ,μϕ, critic networks Qη1 and Qη2, and target networks Qη1′ and Qη2′, value function Vψ
for each iteration do
Sample transition mini-batch B={(st,at,rt,st+1)}∼D.
- Q-value function learning: Update Qη1,Qη2 and Vψ by LQ and LV (Eqs. 6 and 7).
- Diffusion Policy learning: Update μϕ by L(ϕ) (Eq. 2).
- Diffusion Trust Region Policy learning: aθ∼πθ(a∣s), Update πθ by Lπ(θ) (Eq. 5).
- Update target networks: ηi′=ρηi′+(1−ρ)ηi for i={1,2}.
end for
3 Mode seeking behavior regularization comparison
Another approach to accelerate training and inference in diffusion-based policy learning involves utilizing distillation techniques.
- using a trained diffusion model alongside another diffusion network to minimize the KL divergence between the two models.
In our experimental setup, this strategy is employed for behavior regularization by
LKL(θ)=DKL[πθ(⋅∣s)∣∣μϕ(⋅∣s)]=Eε∼N(0,I),s∼D,πθ(s,ε)[logpreal(aθ∣s)pfake(aθ∣s)]
where πθ(s,ε) is instantiates as an one-step Implicit policy.
As we do not have access to the log densities of the fake and true conditional distributions of actions, the loss itself cannot be calculated directly. However, we are able to compute the gradients. The gradient of logpreal(aθ∣s) can be estimated by the diffusion model μϕ(⋅∣s), and the gradient of logpfake(aθ∣s) can also be estimated by a diffusion model trained from fake action data aθ.