Paper Notes - S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic
1 Introduction
MaxEnt RL learns a stochastic policy that captures the intricacies of the action space.
- better exploration
- better robustness to environmental perturbations
- learning policies that maximize the sum of expected future reward and expected future entropy.
- estimating the entropy is a problem.
1.1 Related Work
go around the entropy computation or make limiting assumptions on the policy
- poor scalability
- convergence to suboptimal solutions.
SQL (Haarnoja et al., 2017)
- implicitly incorporates entropy in the Q-function computation
- using importance sampling
- high variability and hence poor training stability and limited scalability to high dimensional action spaces.
SAC et al., 2018a)
- fitting a Gaussian distribution to the EBM policy --- closed form evaluation of entropy
- suboptimal solution (in multimodal action distributions)
IAPO (Marino et al., 2021)
- models the policy as a uni-modal Gaussian
- achieves multimodal policies by learning a collection of parameter estimates (mean, variance) through different initializations for different policies.
1.2 Proposed Method
To achieve expressivity, S2AC models the policy as a Stein Variational Gradient Descent (SVGD) (Liu, 2017) sampler from an EBM over Q-values (target distribution).
SVGD proceeds by first sampling a set of particles from an initial distribution, and then iteratively transforming these particles via a sequence of updates to fit the target distribution.
To compute a closed-form estimate of the entropy of such policies, we use the change-of-variable formula for pdfs (Devore et al., 2012).
To improve scalability, we model the initial distribution of the SVGD sampler as an isotropic Gaussian and learn its parameters
- faster convergence to the target distribution
Beyond RL, the backbone of S2AC is a new variational inference algorithm with a more expressive and scalable distribution characterized by a closed-form entropy estimate.
2 Preliminaries
2.1 Samplers for energy-based models
- SVGD is a particle-based Bayesian inference algorithm.
- SVGD samples a set of m particles from an initial distribution which it then transforms through a sequence of updates to fit the target distribution.
- SVGD applies a form of functional gradient descent that minimizes the KL-divergence between the target distribution and the proposal distribution induced by the particles. .
2.2 Maximum-entropy RL
MaxEnt RL learns a policy , that instead of maximizing the expected future reward, maximizes the sum of the expected future reward and entropy:
equivalent to approximating the policy, modeled as an EBM over Q-values, by a variational distribution .
SAC: actor-critic algorithm
policy evaluation:
Don’t include the entropy of current state
policy improvement:
SAC models as an isotropic Gaussian, i.e., .
If we use diffusion model to learn the policy, how to compute its entropy.
Weakness
- over-simplification of the true action distribution
- cannot represent complex distributions, e.g., multimodal distributions.
SQL
goes around the entropy computation, by defining a soft version of the value function
This lead to the expression of Q-value
SQL follows a soft value iteration which alternates between the updates of the “soft” versions of and value functions
- let and converge first
- uses amortized SVGD to learn a stochastic sample network that maps noise samples into the action samples from the EBM policy distribution
- obtained by minimizing the loss
- the integral is approximated via importance sampling --- high variance estimates and hence poor scalability to high dimensional action spaces
- amortized generation is usually unstable and prone to mode collapse
3 Approach
S2AC: a new actor-critic MaxEnt RL algorithm
- uses SVGD as the underlying actor to generate action samples from policies represented using EBMs. (expressivity)
- derive a closed-form entropy estimate of the SVGD-induced distribution
- propose a parameterized version of SVGD to enable scalability to high-dimensional action spaces and non-smooth Q-function landscapes.
3.1 Stein Soft Actor Critic
- model the actor as a parameterized sampler from an EBM.
Critic
where target .
Actor as an EBM sampler
- samples a set of particles from an initial distribution (e.g., Gaussian).
- These particles are then updated over several iterations .
If is tractable and is invertible, it’s possible to compute a closed-form expression of the distribution of the particles at the th iteration via the change of variable formula.
- The policy is represented using the particle distribution at the final step of the sampler dynamics, i.e.,
- The entropy can be estimated by averaging over a set of particles.
Parameterized initialization
- To speed up convergence, modeling the initial distribution as a parameterized isotropic Gaussian, i.e., .
- To deal with the non-smooth nature of deep Q-function landscapes, bound the particle updates , .
is the replay buffer.
3.2 A Closed-form Expression of the Policy’s entropy
A critical challenge in MaxEnt RL is how to efficiently compute the entropy term .
3.3 Invertible Policies
4 Results
Entropy Evaluation
- compare the estimated entropy for distributions (with known ground truth entropy or log-likelihoods) using different samplers.
- study the sensitivity of the formula to different samplers’ parameters.
Multi-goal Experiments
- To check if S2AC learns a better solution to the max-entropy objective, we design a new multi-goal environment.
Mujoco Experiments
- Performance and sample efficiency
- run-time