Our visual understanding of the world is factorized and compositional. With just a single visual observation, we can deduce both global and local scene attributes, such as materials, weather, lighting, and underlying objects in the scene. These attributes are highly compositional and can be combined in various ways to create new representations of the world. This website introduces Decomp Diffusion, an unsupervised method for decomposing images into a set of underlying compositional factors, each represented by a different diffusion model. We demonstrate how each decomposed diffusion model captures a different factor of the scene, ranging from global scene descriptors (e.g. shadows, foreground, facial expression) to local scene descriptors (e.g. constituent objects). Furthermore, we show how these inferred factors can be flexibly composed and recombined both within and across different image datasets.
Decomp Diffusion is an unsupervised approach that discovers compositional concepts from images, which may be flexibly combined both within and across different image modalities. In particular, it leverages the close connection between Energy-Based Models and diffusion models to decompose a scene into a set of factors, each represented as separate diffusion models.
Our method decomposes inputs into $K$ components, where $K$ is a hyperparameter. We learn a set of $K$ denoising functions to recover a training image $\boldsymbol{x}_0$. Each denoising function is conditioned on a latent $\boldsymbol{z}_k$, which is inferred by a neural network encoder $\text{Enc}_\theta(\boldsymbol{x}_0)[k]$. Once we train these denoising functions, we use the standard noisy optimization objective to sample from compositions of different factors.
We illustrate our approach's ability to decompose images into both global and local concepts, as well as reconstruct the original image by recombining concepts.
Click on an image to view its decomposition and reconstruction. Hover over a decomposed component to view its description.
Below, we showcase additional examples of decomposition and reconstruction. We also show that discovered concepts generalize well and can compose across different modalities of data.
Given a set of input images, Decomp Diffusion can capture a set of global scene descriptors such as lighting and background, and recombine them to construct image variations.
Our method can enable global factor decomposition and reconstruction on CelebA-HQ (Left) and Virtual KITTI 2 (Right). Note that we name inferred concepts for easy understanding.
Global factor decomposition on CelebA-HQ and Virtual KITTI. On the left, we decompose CelebA-HQ images into inferred components of facial features, hair color, skin tone, and hair shape. We combine these factors to generate a reconstruction of the original image. On the right, we decompose Virtual KITTI 2 into inferred factors of shadow, lighting, landscape, and objects.
Our approach can also recombine factors to produce new images. In Falcor3D (Left), we produce variations on a source image by varying a target factor, such as lighting intensity, while preserving its other factors. In CelebA-HQ (Right), we recombine factors from two different inputs to generate novel face combinations.
Global factor recombination on Falcor3D and CelebA-HQ. On the left, we vary inferred factors of lighting intensity, camera position, and lighting position for a particular Falcor3D input. On the right, we recombine an inferred facial features factor from the 1st input and hair color, color temperature, and hair shape factors from the 2nd input to generate novel face combinations.
Given an input image with multiple objects, e.g., a purple cylinder and a green cube, Decomp Diffusion can decompose the input image into individual object components, akin to object-level segmentation.
Our method can extract individual object components that can be reused for image reconstruction on CLEVR (left) and Tetris (right).
Local factor decomposition on CLEVR and Tetris. On the left, our approach decomposes CLEVR images into individual object components. Similarly, on the right, our method decomposes Tetris images into individual Tetris blocks.
We recombine local factors from 2 images to generate composition of inferred object factors. On both CLEVR and Tetris (Left, Middle), we recombine inferred object components in the bounding box to generate novel object compositions. On CLEVR (Right), we compose all inferred factors to generalize up to 8 objects, though training images only contain 4 objects.
Local factor recombination on CLEVR and Tetris. On CLEVR and Tetris (Left, Middle), we combine object components from each of 2 input images, labeled within the bounding boxes, to generate compositions that contain all desired objects. On CLEVR (Right), we compose all 4 inferred object factors from each of 2 inputs to create a composition with 8 objects, though training images only contain 4 objects. This demonstrates that our method can generalize to unseen scenarios.
Finally, we show how Decomp Diffusion can extract and combine concepts across multiple datasets. We demonstrate the recombination of factors in multi-modal datasets, and the combination of factors from distinct models trained on different datasets.
We show our method can capture a set of global factors that are shared between hybrid datasets such as KITTI and Virtual KITTI 2 scenes (Left), and CelebA-HQ and Anime faces (Right). Inferred concepts are named for better understanding.
Multi-modal decomposition on road scenes and faces. On the left, we apply our method on a dataset of both KITTI and Virtual KITTI images to extract inferred factors of background, background texture, objects, and foreground. On the right, our method learns to decompose a joint dataset of photorealistic CelebA-HQ faces and Anime faces into factors of face shape, head shape, color temperature, and facial details.
Decomp Diffusion can recombine inferred factors from hybrid datasets to generate novel compositions. On a hybrid KITTI and Virtual KITTI dataset (Top), we recombine factors from a KITTI image and Virtual KITTI image to produce novel road scenes. On a hybrid CelebA-HQ and Anime dataset (Bottom), we combine hair shapes and colors from a CelebA-HQ human face image with face shape and facial features from an Anime face image to generate unique anime-like faces.
Multi-modal recombination on road scenes and faces. On top, we recombine inferred background and background lighting factors from the 1st KITTI input, as well as foreground and shadow factors from the 2nd Virtual KITTI input, to generate a novel combination. On the bottom, we combine inferred hair shape and hair color factors from the 1st CelebA-HQ image, plus face shape and facial detail factors from the 2nd Anime image, to create anime-like faces.
Our method can recombine factors across 2 distinct models trained on different datasets. Below, we combine factors across one model trained on CLEVR and another model trained on CLEVR Toy to generate unseen compositions.
Cross-dataset recombination on CLEVR and CLEVR Toy. We combine 2 object factors from one model trained on CLEVR and 2 object factors from a different model trained on CLEVR Toy to generate novel compositions containing both geometric objects and toy objects.