Compositional Image Decomposition with Diffusion Models

Abstract

Our visual understanding of the world is factorized and compositional. With just a single visual observation, we can deduce both global and local scene attributes, such as materials, weather, lighting, and underlying objects in the scene. These attributes are highly compositional and can be combined in various ways to create new representations of the world. This website introduces Decomp Diffusion, an unsupervised method for decomposing images into a set of underlying compositional factors, each represented by a different diffusion model. We demonstrate how each decomposed diffusion model captures a different factor of the scene, ranging from global scene descriptors (e.g. shadows, foreground, facial expression) to local scene descriptors (e.g. constituent objects). Furthermore, we show how these inferred factors can be flexibly composed and recombined both within and across different image datasets.

Method

Decomp Diffusion is an unsupervised approach that discovers compositional concepts from images, which may be flexibly combined both within and across different image modalities. In particular, it leverages the close connection between Energy-Based Models and diffusion models to decompose a scene into a set of factors, each represented as separate diffusion models.

Our method decomposes inputs into $K$ components, where $K$ is a hyperparameter. We learn a set of $K$ denoising functions to recover a training image $\boldsymbol{x}_0$. Each denoising function is conditioned on a latent $\boldsymbol{z}_k$, which is inferred by a neural network encoder $\text{Enc}_\theta(\boldsymbol{x}_0)[k]$. Once we train these denoising functions, we use the standard noisy optimization objective to sample from compositions of different factors.

Demo

We illustrate our approach's ability to decompose images into both global and local concepts, as well as reconstruct the original image by recombining concepts.

Click on an image to view its decomposition and reconstruction. Hover over a decomposed component to view its description.

Original

Decomposition

Reconstruction

❮

Facial Features

Hair Color

Skin Tone

Hair Shape

Original

Decomposition

Reconstruction

❮

Object 3

Object 2

Object 4

Object 1

Original

Decomposition

Reconstruction

❮

Shadow

Lighting

Landscape

Objects

Original

Decomposition

Reconstruction

❮

Object 3

Object 2

Object 1

❯

Results

Below, we showcase additional examples of decomposition and reconstruction. We also show that discovered concepts generalize well and can compose across different modalities of data.

Global Factors

Given a set of input images, Decomp Diffusion can capture a set of global scene descriptors such as lighting and background, and recombine them to construct image variations.

Decomposition and Reconstruction

Our method can enable global factor decomposition and reconstruction on CelebA-HQ (Left) and Virtual KITTI 2 (Right). Note that we name inferred concepts for easy understanding.

Global factor decomposition on CelebA-HQ and Virtual KITTI. On the left, we decompose CelebA-HQ images into inferred components of facial features, hair color, skin tone, and hair shape. We combine these factors to generate a reconstruction of the original image. On the right, we decompose Virtual KITTI 2 into inferred factors of shadow, lighting, landscape, and objects.

Recombination

Our approach can also recombine factors to produce new images. In Falcor3D (Left), we produce variations on a source image by varying a target factor, such as lighting intensity, while preserving its other factors. In CelebA-HQ (Right), we recombine factors from two different inputs to generate novel face combinations.

Global factor recombination on Falcor3D and CelebA-HQ. On the left, we vary inferred factors of lighting intensity, camera position, and lighting position for a particular Falcor3D input. On the right, we recombine an inferred facial features factor from the 1st input and hair color, color temperature, and hair shape factors from the 2nd input to generate novel face combinations.

Local Factors

Given an input image with multiple objects, e.g., a purple cylinder and a green cube, Decomp Diffusion can decompose the input image into individual object components, akin to object-level segmentation.

Decomposition and Reconstruction

Our method can extract individual object components that can be reused for image reconstruction on CLEVR (left) and Tetris (right).

Local factor decomposition on CLEVR and Tetris. On the left, our approach decomposes CLEVR images into individual object components. Similarly, on the right, our method decomposes Tetris images into individual Tetris blocks.

Recombination

We recombine local factors from 2 images to generate composition of inferred object factors. On both CLEVR and Tetris (Left, Middle), we recombine inferred object components in the bounding box to generate novel object compositions. On CLEVR (Right), we compose all inferred factors to generalize up to 8 objects, though training images only contain 4 objects.

Local factor recombination on CLEVR and Tetris. On CLEVR and Tetris (Left, Middle), we combine object components from each of 2 input images, labeled within the bounding boxes, to generate compositions that contain all desired objects. On CLEVR (Right), we compose all 4 inferred object factors from each of 2 inputs to create a composition with 8 objects, though training images only contain 4 objects. This demonstrates that our method can generalize to unseen scenarios.

Multi-Modal Datasets

Finally, we show how Decomp Diffusion can extract and combine concepts across multiple datasets. We demonstrate the recombination of factors in multi-modal datasets, and the combination of factors from distinct models trained on different datasets.

Decomposition and Reconstruction

We show our method can capture a set of global factors that are shared between hybrid datasets such as KITTI and Virtual KITTI 2 scenes (Left), and CelebA-HQ and Anime faces (Right). Inferred concepts are named for better understanding.

Multi-modal decomposition on road scenes and faces. On the left, we apply our method on a dataset of both KITTI and Virtual KITTI images to extract inferred factors of background, background texture, objects, and foreground. On the right, our method learns to decompose a joint dataset of photorealistic CelebA-HQ faces and Anime faces into factors of face shape, head shape, color temperature, and facial details.

Multi-Modal Recombination

Decomp Diffusion can recombine inferred factors from hybrid datasets to generate novel compositions. On a hybrid KITTI and Virtual KITTI dataset (Top), we recombine factors from a KITTI image and Virtual KITTI image to produce novel road scenes. On a hybrid CelebA-HQ and Anime dataset (Bottom), we combine hair shapes and colors from a CelebA-HQ human face image with face shape and facial features from an Anime face image to generate unique anime-like faces.

Multi-modal recombination on road scenes and faces. On top, we recombine inferred background and background lighting factors from the 1st KITTI input, as well as foreground and shadow factors from the 2nd Virtual KITTI input, to generate a novel combination. On the bottom, we combine inferred hair shape and hair color factors from the 1st CelebA-HQ image, plus face shape and facial detail factors from the 2nd Anime image, to create anime-like faces.

Cross-Dataset Recombination

Our method can recombine factors across 2 distinct models trained on different datasets. Below, we combine factors across one model trained on CLEVR and another model trained on CLEVR Toy to generate unseen compositions.

Cross-dataset recombination on CLEVR and CLEVR Toy. We combine 2 object factors from one model trained on CLEVR and 2 object factors from a different model trained on CLEVR Toy to generate novel compositions containing both geometric objects and toy objects.

Related Projects

Check out a list of our related papers on compositional generation and energy based models. A full list can be found here!

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

We present an unsupervised approach to discover generative concepts from a collection of images. We show how such generative concepts can accurately represent the content of images, be recombined and composed to generate new artistic and hybrid images, and be used as a representation for downstream classification tasks.

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

We propose new samplers, inspired by MCMC, to enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers.

Unsupervised Learning of Compositional Energy Concepts

We propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision.

Compositional Visual Generation with Composable Diffusion Models

We present a method to compose different diffusion models together, drawing on the close connection of diffusion models with EBMs. We illustrate how compositional operators enable the ability to composing multiple sets of objects together as well as generate images subject to complex text prompts.

Learning to Compose Visual Relations

The visual world around us can be described as a structured set of objects and their associated relations. In this work, we propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully.

Compositional Visual Generation with Energy Based Models

We present a set of compositional operators that enable EBMs to exhibit zero-shot compositional visual generation, enabling us to compose visual concepts (through operators of conjunction, disjunction, or negation) together in a zero-shot manner. Our approach enables us to generate faces given a description ((Smiling AND Female) OR (NOT Smiling AND Male)) or to combine several different objects together.