Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

Abstract

Text-to-image generative models have enabled high-resolution image synthesis across different domains, but require users to specify the content they wish to generate. In this paper, we consider the inverse problem -- given a collection of different images, can we discover the generative concepts that represent each image? We present an unsupervised approach to discover generative concepts from a collection of images, disentangling different art styles in paintings, objects, and lighting from kitchen scenes, and discovering image classes given ImageNet images. We show how such generative concepts can accurately represent the content of images, be recombined and composed to generate new artistic and hybrid images and be further used as a representation for downstream classification tasks.

Method

We discover a set of compositional concepts given a dataset of unlabeled images. Score functions representing each concept $\{c^1, \dots, c^K\}$ are composed together to form a compositional score function that is trained to denoise images. The inferred concepts can be used to generate new images.

Object Decomposition

Our method can decompose a set of unlabeled images from into objects without using any labels.

Indoor Scene Decomposition

Our method can decompose the kitchen scenes into kitchen range (i.e., stove and microwave), kitchen island, and lighting effects. Note that we name each concept based on attention responses for easy visualization. Hover to image to visualize attention heat maps.

Kitchen Range

Kitchen Island

Lighting Effects

Silver

Artistic Concept Decomposition

Our method can decompose paintings from artists into artistic components.

Composing Discovered Concepts

After a set of factors is discovered from a collection of images, our method can further enable compositional generation using compositional operators from composable diffusion. Note that concept names (no quotation marks) are provided by us for easy understanding since we discover concepts without knowing the labels.

Related Projects

Check out a list of our related papers on compositional generation and energy based models. A full list can be found here!

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

We propose new samplers, inspired by MCMC, to enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers.

Unsupervised Learning of Compositional Energy Concepts

We propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision.

Compositional Visual Generation with Composable Diffusion Models

We present a method to compose different diffusion models together, drawing on the close connection of diffusion models with EBMs. We illustrate how compositional operators enable the ability to composing multiple sets of objects together as well as generate images subject to complex text prompts.

Learning to Compose Visual Relations

The visual world around us can be described as a structured set of objects and their associated relations. In this work, we propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully.

Compositional Visual Generation with Energy Based Models

We present a set of compositional operators that enable EBMs to exhibit zero-shot compositional visual generation, enabling us to compose visual concepts (through operators of conjunction, disjunction, or negation) together in a zero-shot manner. Our approach enables us to generate faces given a description ((Smiling AND Female) OR (NOT Smiling AND Male)) or to combine several different objects together.

BibTeX

@InProceedings{Liu_2023_ICCV,
    author    = {Liu, Nan and Du, Yilun and Li, Shuang and Tenenbaum, Joshua B. and Torralba, Antonio},
    title     = {Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {2085-2095}
}