Improved Contrastive Divergence Training of Energy Based Models

Yilun Du1    Shuang Li1    Joshua Tenenbaum1    Igor Mordatch2

1 MIT CSAIL    2 Google Brain

Paper | Pytorch Code


We propose several different techniques to improve contrastive divergence training of energy-based models (EBMs). We first show that a gradient term neglected in the popular contrastive divergence formulation is both tractable to estimate and is important to avoid training instabilities in previous models. We further highlight how data augmentation, multi-scale processing, and reservoir sampling can be used to improve model robustness and generation quality. Thirdly, we empirically evaluate stability of model architectures and show improved performance on a host of benchmarks and use cases, such as image generation, OOD detection, and compositional generation.

paper thumbnail


arxiv, 2020.


Yilun Du, Shuang Li, Joshua Tenenbaum, Igor Mordatch. "Improved Contrastive Divergence Training of Energy Based Models", arxiv. Bibtex

Pytorch Code


Energy based models (EBMs) represent the likelihood of a probability distribution of data by assigning an unnormalized probability scalar (or "energy") to each input data point. This provides significant model flexibility; any arbitrary model that outputs a real number can be used as an energy model. A difficulty however, is that training EBMs is hard, as to properly maximize the likelihood of an energy function, samples must be drawn from the energy model. In this work we present a set of improvements to contrastive divergence training of EBMs, enabling more stable, high resolution generation with EBMs. In particular we propose to:

Contrastive Divergence

A common objective used to train EBMs is contrastive divergence. Contrastive divergence consists of the following objective:

where we the minimize the difference between the KL divergence of the data distribution and EBM distribution, and the KL divergence of finite number of MCMC steps on data distribution and EBM distribution. This objective has a key gradient (highlighted in red) that is often ignored.

We present a loss to capture this gradient (see our paper for details), and find that this missing gradient contributes substantially to the overall training gradient of a EBM. We further show below that that by adding a KL term into contrastive divergence training of energy models, overall training is greatly stabilized. In the graph below, we investigate the energy difference between real images and generated samples with or without the KL term. Stable EBM training corresponds to an energy difference of around 0. We find that the addition of the KL loss term enables incoperation of architectural blocks such as self-attention and normalization, while the absence of the KL term leads to a necessity of spectral normalization to train models stably.

Data Augmentation

We illustrate our the overall image generation process with EBMs below. We intersperse Langevin sampling with data augmentation transitions, enabling image sampling chains from our model to traverse across a large number of modes in the energy landscape.

When comparing samples initialized from the same random noise value, we find that with data augmentation to aid sampling, we get significantly more diverse samples than without.

Zero Shot Compositional Generation

EBMs are able to independently compose with other EBMs, allowing us flexibly compose generation across seperate models. We show that our approach enables compositional generation across different domains. We independently train EBMs for CelebA factors of age, gender, smiling, and wavy hair. We show below that by adding each energy model in generation, we are able to gradually able to construct and change generations to exhibit each desired factor, as encoded by an individual energy function.

We can further independently train EBMs on rendering attributes of shape, size, position and rotation. By adding independent energy model in generation, we are also able to gradually construct generations that exhibit each desired factor.

Out of Distribution Detection

We further find that our approach significantly outperforms past energy based models on the task of simply using the likelihood of the EBM for out-of-distribution detection.

Our Additional Work on Energy-Based Models

If interested, here are additional works from us on utilizing energy models: