CFG-Zero★: Improved Classifier-Free Guidance for Flow Matching Models

Weichen Fan¹, Amber Yijia Zheng², Raymond A. Yeh², Ziwei Liu^1✉

S-Lab, Nanyang Technological University¹ Department of Computer Science, Purdue University ²
^✉Corresponding Author.

Paper Code arXiv

Custom Video Player

Abstract

Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

1. Methodology

We analyze the sources of error in classifier-free guidance (CFG) for flow-matching models and propose a novel approach to mitigate inaccuracies in the predicted velocity field. Our method, CFG-Zero*, combines an optimized scaling strategy with a zero-initialization scheme. Empirical results demonstrate that CFG-Zero* improves sample quality, particularly when the model is underfitted. Extensive experiments show that CFG-Zero* achieves competitive performance across both discrete and continuous conditional generation tasks, establishing it as an effective alternative to standard CFG.

Attention difference comparison graph

Optimized Scaler

We introduce an optimized scaling factor to improve Classifier-Free Guidance (CFG) for flow matching models. When the velocity field is underfitted, CFG may not accurately approximate the ground-truth flow. To address this, we propose a dynamic scaling parameter s that corrects inaccuracies by reweighting the conditional and unconditional velocity fields. We derive an upper bound for the error term and minimize it, leading to a more stable and effective guidance mechanism. Empirical validation on a Gaussian mixture dataset shows that this approach significantly reduces discrepancies between the estimated and optimal velocity, improving sample quality. The optimized scale parameter can be seamlessly integrated into existing CFG implementations with minimal computational overhead.

Attention difference comparison graph

Zero-Init

We analyze the velocity error in synthetic mixed Gaussian and compare the velocity error terms as shown in the Figure below. We notice that when model is underfitted, zero velocity (doing nothing) could help to imrpove the sample quality.

Attention difference comparison graph

2. Results

Quantitative Evaluation

As shown in the Tablem, we evaluate Text-to-Image generation using Lumina-Next, Stable Diffusion 3, Stable Diffusion 3.5, and Flux. The evaluation is based on Aesthetic Score and CLIP Score as key metrics. Results indicate that CFG-Zero* consistently enhances image quality and improves alignment with textual prompts across different models.

Attention difference comparison graph

This evaluation, conducted on T2I-CompBench, utilizes Lumina-Next, Stable Diffusion 3, and Stable Diffusion 3.5. Compared to CFG, CFG-Zero* demonstrates consistent improvements across all evaluated aspects.

Attention difference comparison graph

Comparisons of Text-to-Image Generation

Comparisons of Text-to-Video Generation

A horse running to join a herd of its kind.

An astronaut flying in space, featuring a steady and smooth perspective.

A person is playing chess.

A robot dancing in Times Square.

A person is crying.

An astronaut flying in space, zoom in.

An elephant spraying itself with water using its trunk to cool down.

A person is ice skating.

A cute happy Corgi playing in a park, sunset, watercolor painting.

A cat grooming itself meticulously with its tongue.

A horse galloping across an open field.

A bear catching a salmon in its powerful jaws.

A bicycle gliding through a snowy field.

An epic tornado attacking above a glowing city at night, the tornado is made of smoke.