Pre-training large language models have advanced state-of-the-art in many NLP benchmarks and have shown to be great few-shot learners. As offline RL can be framed as a sequence modeling task, it is likely that offline RL benefits from pre-trained models. The authors of this paper show that pre-trained language models can help model learn offline RL tasks.

Like Decision Transformer, the goal is to autoregressively model trajectories by representing them as a sequence of states, actions, and rewards. Instead of random initializations, the authors experiment with two pre-trained models: the GPT2-small model and an equivalently sized “ChibiT” model trained on Wikipedia articles. Also, to align state, action, reward embeddings to language embeddings, the authors add an additional loss term that encourages these embeddings to be more similar to language embeddings.

On most of the DQN-replay Atari dataset and the D4RL benchmark, models pre-trained on ChibiT or GPT2 achieve similar or better performance than Decision Transformers and other offline RL algorithms.

Learn More

The authors also experimented with pre-training on vision (CLIP, ImageGPT) and perform various analyses and ablation studies. Check the paper to learn more!

Relevant Resources