Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine
2020
3 references

Abstract

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

1 repository
3 references

Code References

â–¶ ray-project/ray
3 files
â–¶ doc/source/rllib/rllib-algorithms.rst
1
`[paper] <https://arxiv.org/abs/2006.04779>`__
â–¶ rllib/algorithms/cql/README.md
1
[CQL](https://arxiv.org/abs/2006.04779) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via convservative critic estimates. CQL does this by adding a simple Q regularizer loss to the standard Belman update loss. This ensures that the critic does not output overly-optimistic Q-values and can be added on top of any off-policy Q-learning algorithm (in this case, we use SAC).
â–¶ rllib/algorithms/cql/torch/cql_torch_learner.py
1
# actions (from the mu distribution as named in Kumar et al. (2020))).
Link copied to clipboard!