CFG-Zero★: Improved Classifier-Free Guidance for Flow Matching Models

S-Lab, Nanyang Technological University1      Department of Computer Science, Purdue University 2
Corresponding Author.
Custom Video Player

Abstract

Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

1. Methodology

We analyze the sources of error in classifier-free guidance (CFG) for flow-matching models and propose a novel approach to mitigate inaccuracies in the predicted velocity field. Our method, CFG-Zero*, combines an optimized scaling strategy with a zero-initialization scheme. Empirical results demonstrate that CFG-Zero* improves sample quality, particularly when the model is underfitted. Extensive experiments show that CFG-Zero* achieves competitive performance across both discrete and continuous conditional generation tasks, establishing it as an effective alternative to standard CFG.

Attention difference comparison graph

Optimized Scaler

We introduce an optimized scaling factor to improve Classifier-Free Guidance (CFG) for flow matching models. When the velocity field is underfitted, CFG may not accurately approximate the ground-truth flow. To address this, we propose a dynamic scaling parameter s that corrects inaccuracies by reweighting the conditional and unconditional velocity fields. We derive an upper bound for the error term and minimize it, leading to a more stable and effective guidance mechanism. Empirical validation on a Gaussian mixture dataset shows that this approach significantly reduces discrepancies between the estimated and optimal velocity, improving sample quality. The optimized scale parameter can be seamlessly integrated into existing CFG implementations with minimal computational overhead.

Attention difference comparison graph

Zero-Init

We analyze the velocity error in synthetic mixed Gaussian and compare the velocity error terms as shown in the Figure below. We notice that when model is underfitted, zero velocity (doing nothing) could help to imrpove the sample quality.

Attention difference comparison graph

2. Results

Quantitative Evaluation

As shown in the Tablem, we evaluate Text-to-Image generation using Lumina-Next, Stable Diffusion 3, Stable Diffusion 3.5, and Flux. The evaluation is based on Aesthetic Score and CLIP Score as key metrics. Results indicate that CFG-Zero* consistently enhances image quality and improves alignment with textual prompts across different models.

Attention difference comparison graph

This evaluation, conducted on T2I-CompBench, utilizes Lumina-Next, Stable Diffusion 3, and Stable Diffusion 3.5. Compared to CFG, CFG-Zero* demonstrates consistent improvements across all evaluated aspects.

Attention difference comparison graph

Comparisons of Text-to-Image Generation

Comparisons of Text-to-Video Generation