Paper Notes - Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learning
Replay offline data directly in the online phase
- data distribution shift
- inefficiency in online fine-tuning
Introduce Energy-guided Diffusion Sampling (EDIS)
- use a diffusion model to extract prior knowledge from the offline dataset
- employs energy functions to distill this knowledge for enhanced data degeneration in the online phase
1 Introduction
However, utilizing a diffusion model trained on an offline dataset introduces a challenge—it can only generate samples adhering to the dataset distribution, thus still being susceptible to distribution shift issues.
The desired distribution for RL has three crucial characteristics:
- the state distribution should align with that in the online training phase
- actions should be consistent with the current policy
- the next states should conform to the transition function
To achieve this, we formulate three distinct energy functions to guide the diffusion sampling process, ensuring alignment with the aforementioned features.
2 EDIS: Energy-Guided Diffusion Sampling
To extract prior knowledge from the offline dataset and generate samples to conform to the online data distribution, we introduce our innovative approach, named Energy-guided Diffusion Sampling (EDIS).
At the heart of our method is to accurately generate a desired online data distribution, denoted as , from pre-gathered data. The distribution does not include reward because we assume that the reward function is accessible, either directly or through learning from the dataset.
To achieve this, we have integrated a diffusion model into our framework, capitalizing on its exceptional capability for modeling complex distributions.
2.1 Distribution Adjustment via Energy Guidance
One challenge in this process is the inherent limitation of directly training a diffusion model on an offline dataset. Such a model typically yields an offline data distribution ,which does not align perfectly with online data and causes distribution shift issues.
To address this, our method needs to guide the diffusion sampling process towards the online distribution. This is achieved by decomposing the online data distribution into the following form:
where is the distribution generated by the denoiser network, parameterized by . is the energy function, which serves as the guidance to bridge the gap between generated distribution and online data distribution. The following theorem shows such an energy function exists.
Theorem 3.1. Let be the marginal distribution of , and be the conditional distribution of given and . The former equation is valid if the energy function is structured as follows:
such that , , .
Each part is responsible for aligning the generated distribution with different aspects of the online data:
- the online state distribution
- the current policy action distribution
- the environmental dynamics.
2.2 Learning Energy Guidance by Contrastive Energy Prediction
- the energy is estimated using a neural network denoted as
- Let and be two positive numbers. Given i.i.d. samples drawn from the distribution , and negative samples for We employ the Information Noise Contrastive Estimation (InfoNCE):
- Then, we devise positive and negative samples to achieve the target energy function established by Thm. 3.1.
- Suppose the distribution of positive samples is , the distribution of negative samples is , the final optimized results is Compared to the function indicated bv Thm. 3.1. the result can be achieved by selecting
- Following the approach of Sinha et al. (2022);Liu et al. (2021),we construct a positive buffer, containing only a small set of trajectories from very recent policies. The data distribution in this buffer can be viewed as an approximation of the on-policy distribution While is the distribution of the data generated during the denoising steps.
- Therefore, the positive samples is sampled from the positive buffer and the negative samples is sampled from the denoiser.
2.3 Sampling under Energy Guidance
Score function in sampling process:
In the denoising process, we need to obtain the score function at each timestep. Denote the forward distribution at time starting from as Remember that the denoiser model is designed to match the score with the expression:
Thus, we can obtain the gradient through the denoiser model.