Our visual understanding of the world is factorized and compositional. With just a single visual observation, we can deduce both global and local scene attributes, such as materials, weather, lighting, and underlying objects in the scene. These attributes are highly compositional and can be combined in various ways to create new representations of the world. This website introduces Decomp Diffusion, an unsupervised method for decomposing images into a set of underlying compositional factors, each represented by a different diffusion model. We demonstrate how each decomposed diffusion model captures a different factor of the scene, ranging from global scene descriptors (e.g. shadows, foreground, facial expression) to local scene descriptors (e.g. constituent objects). Furthermore, we show how these inferred factors can be flexibly composed and recombined both within and across different image datasets.
Decomp Diffusion is an unsupervised approach that discovers compositional concepts from images, which may be flexibly combined both within and across different image modalities. In particular, it leverages the close connection between Energy-Based Models and diffusion models to decompose a scene into a set of factors, each represented as separate diffusion models.
Our method decomposes inputs into $K$ components, where $K$ is a hyperparameter. We learn a set of $K$ denoising functions to recover a training image $\boldsymbol{x}_0$. Each denoising function is conditioned on a latent $\boldsymbol{z}_k$, which is inferred by a neural network encoder $\text{Enc}_\theta(\boldsymbol{x}_0)[k]$. Once we train these denoising functions, we use the standard noisy optimization objective to sample from compositions of different factors.
We illustrate our approach's ability to decompose images into both global and local concepts, as well as reconstruct the original image by recombining concepts.
Click on an image to view its decomposition and reconstruction. Hover over a decomposed component to view its description.
Below, we showcase additional examples of decomposition and reconstruction. We also show that discovered concepts generalize well and can compose across different modalities of data.
Given a set of input images, Decomp Diffusion can capture a set of global scene descriptors such as lighting and background, and recombine them to construct image variations.
Our method can enable global factor decomposition and reconstruction on CelebA-HQ (Left) and Virtual KITTI 2 (Right). Note that we name inferred concepts for easy understanding.
Our approach can also recombine factors to produce new images. In Falcor3D (Left), we produce variations on a source image by varying a target factor, such as lighting intensity, while preserving its other factors. In CelebA-HQ (Right), we recombine factors from two different inputs to generate novel face combinations.
Given an input image with multiple objects, e.g., a purple cylinder and a green cube, Decomp Diffusion can decompose the input image into individual object components, akin to object-level segmentation.
Our method can extract individual object components that can be reused for image reconstruction on CLEVR (left) and Tetris (right).
We recombine local factors from 2 images to generate composition of inferred object factors. On both CLEVR and Tetris (Left, Middle), we recombine inferred object components in the bounding box to generate novel object compositions. On CLEVR (Right), we compose all inferred factors to generalize up to 8 objects, though training images only contain 4 objects.
Finally, we show how Decomp Diffusion can extract and combine concepts across multiple datasets. We demonstrate the recombination of factors in multi-modal datasets, and the combination of factors from distinct models trained on different datasets.
We show our method can capture a set of global factors that are shared between hybrid datasets such as KITTI and Virtual KITTI 2 scenes (Left), and CelebA-HQ and Anime faces (Right). Inferred concepts are named for better understanding.
Decomp Diffusion can recombine inferred factors from hybrid datasets to generate novel compositions. On a hybrid KITTI and Virtual KITTI dataset (Top), we recombine factors from a KITTI image and Virtual KITTI image to produce novel road scenes. On a hybrid CelebA-HQ and Anime dataset (Bottom), we combine hair shapes and colors from a CelebA-HQ human face image with face shape and facial features from an Anime face image to generate unique anime-like faces.
Our method can recombine factors across 2 distinct models trained on different datasets. Below, we combine factors across one model trained on CLEVR and another model trained on CLEVR Toy to generate unseen compositions.