Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)
We analyze the sources of error in classifier-free guidance (CFG) for flow-matching models and propose a novel approach to mitigate inaccuracies in the predicted velocity field. Our method, CFG-Zero*, combines an optimized scaling strategy with a zero-initialization scheme. Empirical results demonstrate that CFG-Zero* improves sample quality, particularly when the model is underfitted. Extensive experiments show that CFG-Zero* achieves competitive performance across both discrete and continuous conditional generation tasks, establishing it as an effective alternative to standard CFG.
We introduce an optimized scaling factor to improve Classifier-Free Guidance (CFG) for flow matching models. When the velocity field is underfitted, CFG may not accurately approximate the ground-truth flow. To address this, we propose a dynamic scaling parameter s that corrects inaccuracies by reweighting the conditional and unconditional velocity fields. We derive an upper bound for the error term and minimize it, leading to a more stable and effective guidance mechanism. Empirical validation on a Gaussian mixture dataset shows that this approach significantly reduces discrepancies between the estimated and optimal velocity, improving sample quality. The optimized scale parameter can be seamlessly integrated into existing CFG implementations with minimal computational overhead.
We analyze the velocity error in synthetic mixed Gaussian and compare the velocity error terms as shown in the Figure below. We notice that when model is underfitted, zero velocity (doing nothing) could help to imrpove the sample quality.
As shown in the Tablem, we evaluate Text-to-Image generation using Lumina-Next, Stable Diffusion 3, Stable Diffusion 3.5, and Flux. The evaluation is based on Aesthetic Score and CLIP Score as key metrics. Results indicate that CFG-Zero* consistently enhances image quality and improves alignment with textual prompts across different models.
This evaluation, conducted on T2I-CompBench, utilizes Lumina-Next, Stable Diffusion 3, and Stable Diffusion 3.5. Compared to CFG, CFG-Zero* demonstrates consistent improvements across all evaluated aspects.
A horse running to join a herd of its kind.
An astronaut flying in space, featuring a steady and smooth perspective.
A person is playing chess.
A robot dancing in Times Square.
A person is crying.
An astronaut flying in space, zoom in.
An elephant spraying itself with water using its trunk to cool down.
A person is ice skating.
A cute happy Corgi playing in a park, sunset, watercolor painting.
A cat grooming itself meticulously with its tongue.
A horse galloping across an open field.
A bear catching a salmon in its powerful jaws.
A bicycle gliding through a snowy field.
An epic tornado attacking above a glowing city at night, the tornado is made of smoke.