Paper Notes - Efficient Diffusion Policies for Offline Reinforcement Learning

Notes for this paper.

Diffusion-QL suffers from two critical limitations

  • computationally inefficient to forward and backward through the whole Markov chain during training
  • incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable.

EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain.

Efficient Diffusion Policy

present a novel algorithm termed Reinforcement-Guided Diffusion Policy Learning (RGDPL)

Diffusion Policy

We use the reverse process of a conditional diffusion model as a parametric policy:

πθ(as)=pθ(a0:Ks)=p(aK)k=1Kpθ(ak1ak,s),\pi_\theta(\boldsymbol{a}|s)=p_\theta(\boldsymbol{a}^{0:K}|\boldsymbol{s})=p(\boldsymbol{a}^K)\prod_{k=1}^Kp_\theta(\boldsymbol{a}^{k-1}|\boldsymbol{a}^k,\boldsymbol{s}),

where aKN(0,I).a^K\sim\mathcal{N}(0,I).

Given a dataset, we can easily and efficiently train a diffusion policy in a behavior-cloning manner as we only need to forward and backward through the network once each iteration.

Reinforcement-Guided Diffusion Policy Learning

how we can efficiently use QϕQ_{\phi} to guide diffusion policy training procedure.

  • We now show that this can be achieved without sampling actions from diffusion policies.

Using the reparameterization trick, we are able to connect ak,a0\boldsymbol{a}^k,\boldsymbol{a}^0 and ϵ\epsilon by:

ak=αˉka0+1αˉkϵ,ϵN(0,I).a^k=\sqrt{\bar{\alpha}^k}a^0+\sqrt{1-\bar{\alpha}^k}\epsilon,\quad\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I}).

Recall that our diffusion policy is parameterized to predict ϵ\epsilon with ϵθ(ak,k;s).\epsilon_\theta(a^k,k;s). By relacing ϵ\epsilon with ϵθ(ak,k;s)\boldsymbol{\epsilon}_\theta(\boldsymbol{a}^k,k;\boldsymbol{s}), we obtain the approximated action:

a^0=1αˉkak1αˉkαˉkϵθ(ak,k;s).\hat{a}^0=\frac1{\sqrt{\bar{\alpha}^k}}\boldsymbol{a}^k-\frac{\sqrt{1-\bar{\alpha}^k}}{\sqrt{\bar{\alpha}^k}}\boldsymbol{\epsilon}_\theta(\boldsymbol{a}^k,k;\boldsymbol{s}).

Accordingly, the policy improvement for diffusion policies is modified as follows:

Lπ(θ)=EsD,a^0[Qϕ(s,a^0)].L_\pi(\theta)=-\mathbb{E}_{\boldsymbol{s}\sim\mathcal{D},\hat{\boldsymbol{a}}^0}\left[Q_\phi(s,\hat{\boldsymbol{a}}^0)\right].

To improve the efficiency of policy evaluation, we propose to replace the DDPM sampling with DPM-Solver [20], which is an ODE-based sampler.

Generalization to Various RL algorithms

Direct policy optimization. It maximizes Q values and directly backpropagate the gradients from Q network to policy network.

θLπ(θ)=Qϕ(s,a)aaθ.\nabla_\theta L_\pi(\theta)=-\frac{\partial Q_\phi(\boldsymbol{s},\boldsymbol{a})}{\partial\boldsymbol{a}}\frac{\partial\boldsymbol{a}}{\partial\theta}.

This is only applicable to cases where aθ\frac{\partial \boldsymbol{a}}{\partial \theta} is tractable, e.g., when a deterministic policy a=πθ(s)\boldsymbol{a}=\pi_{\theta}(\boldsymbol{s}) is used or when the sampling process can be reparametrized.

Likelihood-based policy optimization. It tries to distill the knowledge from the Q network into the policy network indirectly by performing weighted regression or weighted maximum likelihood

maxθE(s,a)D[f(Qϕ(s,a))logπθ(as)],\max_{\boldsymbol{\theta}}\quad\mathbb{E}_{(\boldsymbol{s},\boldsymbol{a})\thicksim\mathcal{D}}\left[f(Q_{\phi}(\boldsymbol{s},\boldsymbol{a}))\log\pi_{\boldsymbol{\theta}}(\boldsymbol{a}|\boldsymbol{s})\right],

where f(Qϕ(s,a))f( Q_\phi ( \boldsymbol{s}, \boldsymbol{a}) ) is a monotonically increasing function that assigns a weight to each state- action where f(Qϕ(s,a))f( Q_\phi ( \boldsymbol{s}, \boldsymbol{a}) ) is a monotonically increasing function that assigns a weight to each state- action serf(Qϕ(s,a))f( Q_\phi ( \boldsymbol{s}, \boldsymbol{a}) ) is pair in the dataset. This objective requires the log-likelihood of the policy to be tractable and differentiable.

In this paper

First, instead of computing the likelihood, we turn to a lower bound for log πθ(as)\pi _{\theta }( \boldsymbol{a}| \boldsymbol{s}) introduced in DDPM. By discarding the constant term that does not depend on θ\theta,we can have the objective:

Ek,ϵ,(a,s)[βkf(Qϕ(s,a))2αk(1αˉk1)ϵϵθ(ak,k;s)2].\mathbb{E}_{k,\boldsymbol{\epsilon},(\boldsymbol{a},\boldsymbol{s})}\left[\frac{\beta^k\cdot f(Q_\phi(\boldsymbol{s},\boldsymbol{a}))}{2\alpha^k(1-\bar{\alpha}^{k-1})}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\boldsymbol{a}^k,k;\boldsymbol{s}\right)\right\|^2\right].

Second, instead of directly optimizing logπθ(as)\log\pi_\theta(\boldsymbol{a}|s), we propose to replace it with an approximated policy π^θ(as)N(a^0,I)\hat{\pi}_\theta(a|s)\triangleq\mathcal{N}(\hat{a}^0,\boldsymbol{I}). Then, we get the following objective:

Ek,ϵ,(a,s)[f(Qϕ(s,a))aa^02].\mathbb{E}_{k,\boldsymbol{\epsilon},(\boldsymbol{a},\boldsymbol{s})}\left[f(Q_\phi(s,\boldsymbol{a}))\left\|\boldsymbol{a}-\hat{\boldsymbol{a}}^0\right\|^2\right].

Empirically, we find these two choices perform similarly, but the latter is easier to implement. So we will report results mainly based on the second realization. In our experiments, we consider two offline RL algorithms under this category, i.e., CRR, and IQL. They use two weighting schemes: fCRR=exp[(Qϕ(s,a)Eaπ^(as)Q(s,a))/τCRR]f_{\mathbf{CRR}}=\exp\left[\left(Q_\phi(s,a)-\mathbb{E}_{a^{\prime}\sim\hat{\pi}(a|s)}Q(s,a^{\prime})\right)/\tau_{\mathrm{CRR}}\right] and fIQL=exp[(Qϕ(s,a)Vψ(s))/τIQL]f_{\mathrm{IQL}}=\exp\left[\left(Q_\phi(s,a)-V_\psi(s)\right)/\tau_{\mathrm{IQL}}\right], where τ\tau refers to the temperature parameter and Vψ(s)V_\psi(s) is an additional value network parameterized by ψ.\psi.