Composing Ensembles of Pre-trained Models via Iterative Consensus

Shuang Li*¹, Yilun Du*¹, Joshua B. Tenenbaum¹, Antonio Torralba¹ Igor Mordatch²

¹ MIT ² Google Brain

*indicates equal contribution. Shuang Li did all the experiments on image generation, video question answering, and mathematical reasoning. Yilun Du did all the experiments on robot manipulation.

ICLR 2023

description Paper

code Code

A unified framework for composing pre-trained models.

Abstract

Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions.

In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation.

Method

The proposed framework that composes a "generator" and an ensemble of "scorers" through iterative consensus enables zero-shot generalization across a variety of multimodal tasks.

Overview of the proposed unified framework. Dashed lines are omitted for certain tasks. Orange lines represent the components used to refine the generated result.

Image generation: A pre-trained diffusion model is used as the generator, and multiple scorers, such as CLIP and image classifiers, are used to provide feedback to the generator.

Video question answering: GPT-2 is used as the generator, and a set of CLIP models are used as scorers.

Grade school math: GPT-2 is used as the generator, and a set of question-solution classifiers are used as scorers.

Robot manipulation: MPC+World model is used as the generator, and a pre-trained image segmentation model is used to compute the scores from multiple camera views to select the best action.

Video Question Answering Results

Grade School Math Results

Image Generation

Robot Manipulation Results

Related Projects

Check out a list of our related papers on compositionality. A full list can be found here!

Compositional Visual Generation with Composable Diffusion Models

We present a method to compose different diffusion models together, drawing on the close connection of diffusion models with EBMs. We illustrate how compositional operators enable the ability to composing multiple sets of objects together as well as generate images subject to complex text prompts.

Learning to Compose Visual Relations

The visual world around us can be described as a structured set of objects and their associated relations. In this work, we propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully.

Compositional Visual Generation with Energy Based Models

We present a set of compositional operators that enable EBMs to exhibit zero-shot compositional visual generation, enabling us to compose visual concepts (through operators of conjunction, disjunction, or negation) together in a zero-shot manner. Our approach enables us to generate faces given a description ((Smiling AND Female) OR (NOT Smiling AND Male)) or to combine several different objects together.

Team

Shuang Li

MIT

Yilun Du

MIT

Joshua Tenenbaum

MIT

Antonio Torralba

MIT

Igor Mordatch

Google Brain

This webpage template was recycled from here.

Accessibility