Inference-Time Scaling for Diffusion Models

This paper focuses on enhancing the quality of images generated by diffusion models. It goes beyond simply increasing the number of refinement steps and introduces a search-based approach to find the best "noise" to start with and the best way to refine it, guided by "verifiers" that judge the quality of the image at each step

TECH DRIVEN FUTURE

Snehanshu Jena

1/27/20254 min read

Diffusion models have emerged as a groundbreaking class of generative models, achieving remarkable success in generating high-quality images, audio and even videos. These models operate by learning to reverse a process of noise injection, progressively denoising data to create realistic samples.

While diffusion models offer flexibility in allocating inference-time compute budget through adjusting the number of denoising steps, previous research has shown that performance gains tend to plateau after a certain point. This research paper delves into the exciting possibilities of scaling diffusion model inference beyond simply increasing denoising steps.

The Purpose and Problem

The primary purpose of this research is to explore the inference-time scaling behavior of diffusion models beyond the conventional approach of increasing denoising steps. The authors investigate how generation quality can be further enhanced by strategically allocating additional computation during inference.

The problem this research tackles is the performance plateau encountered when solely relying on scaling denoising steps. The authors aim to discover alternative avenues for utilizing inference-time compute resources to unlock further performance gains.

Key Findings

This research uncovers several key findings:

Search for Better Noises: The researchers propose a search-based framework for identifying better noises during the diffusion sampling process. They show that not all noises are created equal and some lead to significantly better generation quality than others.
Verifiers and Algorithms: The research identifies two key design axes for the search framework: verifiers and algorithms. Verifiers provide feedback on the quality of noise candidates, while algorithms guide the search process.
No Universal Solution: The study reveals that there is no one-size-fits-all solution for inference-time scaling. The optimal combination of verifiers and algorithms depends on the specific task and dataset.
Effectiveness of Scaling: The researchers demonstrate that scaling inference-time compute through search leads to substantial improvements in generation quality across various tasks and model sizes.

Elaboration on Inference-Time Scaling Beyond Denoising Steps

The Search-Based Framework

The core idea behind this research is to treat the diffusion sampling process as a search problem. Instead of simply increasing denoising steps, the researchers propose actively searching for better noises that lead to higher-quality generations.

This search-based framework involves two main components:

Verifiers: Verifiers act as evaluators, providing feedback on the quality of noise candidates. They can be pre-trained models or metrics that assess various aspects of the generated images, such as visual quality, text alignment or human preferences.
Algorithms: Algorithms guide the search process, exploring the space of possible noises to identify promising candidates. The research explores different algorithms, including random search, zero-order search and search over paths.

Verifier-Task Alignment

The research emphasizes the importance of aligning verifiers with the specific requirements of the generation task. Different tasks may prioritize different aspects of image quality and using a misaligned verifier can lead to suboptimal results.

For instance, in the ImageNet class-conditional generation task, using an oracle verifier that directly optimizes for FID or IS leads to significant improvements. However, in the more complex text-conditioned image generation task, a more holistic evaluation is necessary and using a combination of verifiers that assess different aspects of image quality proves more effective.

Search Algorithms

The research explores different search algorithms for finding better noise candidates:

Random Search: This is the simplest algorithm, where a fixed set of noise candidates is randomly sampled and the best one is selected based on the verifier's score.
Zero-Order Search: This algorithm iteratively refines noise candidates based on feedback from the verifier. It approximates the gradient direction without explicitly calculating gradients, making it more computationally efficient.
Search over Paths: This algorithm explores the space of possible diffusion sampling trajectories, searching for paths that lead to higher-quality generations.

The choice of algorithm depends on factors such as the complexity of the task, the computational budget and the nature of the verifier.

Axes of Inference Compute Investment

The research also investigates different ways to invest inference-time compute resources within the search framework:

Number of Search Iterations: Increasing the number of search iterations allows the algorithm to explore more noise candidates and potentially find better solutions.
Compute per Search Iteration: This refers to the number of denoising steps performed for each noise candidate evaluated during the search process.
Compute of Final Generation: This refers to the number of denoising steps used for the final generation of the image, after the best noise candidate has been identified.

The optimal allocation of compute resources across these axes depends on the specific task and model.

Implications for the Future of AI

This research has significant implications for the future of AI:

More Efficient Generative Models: The findings suggest that we can achieve higher-quality generations from diffusion models without solely relying on increasing model size or training data. This opens doors to developing more computationally efficient generative models.
Task-Specific Customization: The research highlights the importance of tailoring inference-time compute strategies to the specific requirements of the task. This could lead to more customizable and adaptable generative models.
Improved Human Alignment: By carefully selecting verifiers that capture human preferences, we can guide diffusion models to generate images that are more aligned with human aesthetic sensibilities or other desired criteria.
New Applications: The ability to scale inference-time compute could enable new applications of diffusion models, such as interactive image generation or personalized content creation.

This research pushes the boundaries of diffusion model inference, paving the way for more powerful and versatile generative AI systems.

Summary: Inference-Time Scaling for Diffusion Models Beyond Scaling Denoising Steps

Focus: Enhancing the quality of images generated by diffusion models by using more computation during the image generation process.1
Key Idea: Diffusion models work by progressively refining a noisy image until it becomes clear.2 This paper goes beyond simply increasing the number of refinement steps and looks at how to make those steps more effective.
How it Works: The paper introduces a search-based approach to find the best "noise" to start with and the best way to refine it.3 This involves:
- Verifiers: Tools that judge the quality of the image at each step.
- Algorithms: Methods to explore different ways of refining the image.

Impact: This could lead to diffusion models that generate even more realistic and high-quality images, opening up new possibilities in areas like art, design and entertainment.