Iterative Refinement Improves Compositional Image Generation

Shantanu Jaiswal Mihir Prabhudesai Nikash Bhardwaj Zheyang Qin

Amir Zadeh Chuan Li Katerina Fragkiadaki Deepak Pathak

Carnegie Mellon University, Lambda

TL;DR: We introduce a test-time iterative refinement strategy that uses a VLM feedback critic and image editor in loop to improve compositional text-to-image generation. Our method significantly outperforms compute-matched parallel sampling across various models (including GPT-Image-1, NanoBanana, and Qwen-Image) on compositional generation benchmarks and is preferred by human evaluators 59% of the time.

Abstract

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation.

Iterative refinement improves compute-matched performance over parallel inference

Iterative inference-time refinement yields stronger gains than compute-matched parallel inference across multiple state-of-the-art text-to-image models.

Overview

Large language models (LLMs) have achieved remarkable progress in recent years, as a result of simply scaling test-time compute. A particularly influential development has been the use of chain-of-thought (CoT) prompting, where models are instructed to "think step by step". Despite its simplicity, this strategy enables models to exhibit sophisticated behaviors such as self-correction, error checking, and iterative refinement.

The success of CoT reasoning in LLMs is closely tied to their pre-training data. During training, LLMs are exposed to large volumes of text that naturally contain traces of human step-by-step reasoning. This supervision on the internet implicitly provides the prior that chain-of-thought prompting later exploits.

By contrast, text-to-image (T2I) models are trained on large-scale datasets of image-caption pairs that lack such structured reasoning traces. As a result, these models do not inherently develop capabilities like self-correction or iterative refinement.

Method

Our central idea is to leverage complementary modules that together mimic the iterative reasoning process observed in LLMs. Concretely, our framework integrates four components: (i) a text-to-image (T2I) model to generate an initial image, (ii) a vision-language model (VLM) critic to identify corrections by comparing the generated image with the target prompt, (iii) an image editor to apply the suggested edits, and (iv) a verifier to evaluate alignment between the final image and the desired description.

Our iterative refinement framework combines a generator, feedback critic, editor, and verifier for progressive image improvement.

Results

Quantitative Performance

We evaluate our approach against the widely adopted strategy of parallel sampling, where multiple images are generated independently and the best one is selected using a verifier. While parallel sampling increases diversity, it does not fundamentally change the underlying generation process.

Our iterative approach consistently outperforms parallel-only baselines across multiple benchmarks, with gains most pronounced on complex compositional tasks.

Performance comparison table across ConceptMix and T2I-CompBench

Performance comparison of parallel sampling, iterative refinement, and combined strategies across three state-of-the-art text-to-image models on ConceptMix and T2I-CompBench. Our iterative approach (Iter.) and combined iterative+parallel strategy (Iter.+Par.) consistently outperform traditional parallel-sampling baselines, with gains most pronounced on complex compositional tasks (ConceptMix k=4-7) and precise spatial and numeric reasoning (T2I-CompBench spatial, 3D spatial, and numeracy categories).

Comparison with Prior Compositional Generation Methods

As shown, our method consistently outperforms prior compositional generation methods that utilize tools (GenArtists) or regional priors (RPG), and scales better at higher binding complexities.

Performances across Frontier T2I Models on different prompt complexities

Performance of experimented models on ConceptMix k=1 to k=7 comparison for different models. As shown, our method consistently improves over the parallel sampling across models and prompt complexities.

ConceptMix Performance

Breakdown of performances across different ConceptMix categories.

Iterative vs Parallel Compute Allocation

Comparison of iterative and parallel compute allocations. Given a test-time budget of 16 steps, higher iterative allocations (e.g. 8 iterative with 2 parallel) outperform purely parallel sampling or purely iterative refinement strategies.

Human Evaluation

Human evaluators preferred our iterative method 58.7% of the time over parallel sampling.

Gallery

The gallery contains examples of our method compared to baseline approaches across different types of compositional prompts, and provides the step-by-step reasoning traces in the refinement process.

Open Gallery

Gallery of iterative refinement vs. baseline methods with detailed reasoning traces.

Related Work

Text-to-Image Inference-Time Strategies. Recent advances in text-to-image (T2I) generation have demonstrated impressive capabilities. However, complex prompts with multiple objects, relations, and fine-grained attributes remain challenging. Inference-time strategies such as classifier-free guidance, parallel sampling, and grounding-based methods improve prompt fidelity but often fail to scale to richly compositional prompts.

Chain-of-Thought Reasoning in Large Language Models. Chain-of-thought (CoT) prompting has been shown to elicit multi-step reasoning and improve performance on complex language tasks. Drawing inspiration from these strategies, our method applies a similar iterative reasoning paradigm to T2I generation: the critic functions analogously to a CoT process, enabling high-fidelity compositional image synthesis.

Limitations

Our method is subject to the limitations of its component models, primarily manifesting in two error modes:

VLM Imperfect Critic: The VLM acting as a critic may fail to accurately identify discrepancies between the image and the text prompt, leading to missing or incorrect feedback.
Edit Model Imperfect Editor: The image editing model may fail to execute the requested corrections perfectly, sometimes ignoring the edit instruction or introducing artifacts in the process.

We show examples of these failure modes below.

Failure Case 1: Imperfect Critic

Prompt: "The image features a heart-shaped giraffe, a tiny pink screwdriver, and a huge robot. The screwdriver is positioned at the bottom of the robot, touching it."

Our Result

Giraffe not exactly heart-shaped (and was not detected by the critic).

Reasoning Trace (Scroll to see steps)

Step 1

The image features a heart-shaped giraffe, a tiny pink screwdriver...

Step 2

Reposition the tiny pink screwdriver so that it is clearly touching...

Step 3

Ensure the tiny pink screwdriver is visibly touching...

Step 4

Introduce a tiny pink screwdriver lying on the ground...

Step 5

Refine the image to ensure the tiny pink screwdriver...

Failure Case 2: Imperfect Editor/Critic

Prompt: "A pixel art image shows a tiny pine tree with two carrots positioned in front of it. There is also a house in the scene."

Our Result

Carrots not placed directly in front of pine by editor (and was not detected by the critic).

Reasoning Trace (Scroll to see steps)

Step 1

A pixel art image shows a tiny pine tree...

Step 2

Edit the pixel art image by making the large pine tree...

Step 3

Edit the image to clearly show exactly two distinct...

Step 4

Edit the image to reposition the two distinct carrots...

BibTeX

@article{jaiswal2026iterative,
  author    = {Jaiswal, Shantanu and Prabhudesai, Mihir and Bhardwaj, Nikash and Qin, Zheyang and Zadeh, Amir and Li, Chuan and Fragkiadaki, Katerina and Pathak, Deepak},
  title     = {Iterative Refinement Improves Compositional Image Generation},
  journal   = {arXiv preprint arXiv:2601.15286},
  year      = {2026},
}