Z-Image: An Efficient Single-Stream Diffusion Transformer for Image Generation

artificial intelligence

Introducing Z-Image, a powerful and efficient 6B-parameter image generation model. Discover its variants—Turbo for sub-second inference, Base for fine-tuning, and Edit for creative image transformations—and its state-of-the-art performance on leading benchmarks.

Welcome to the official repository for the Z-Image project!

Z-Image is a powerful and highly efficient image generation model featuring 6 billion parameters. It currently offers three distinct variants:

Z-Image-Turbo: This distilled version of Z-Image achieves or surpasses the performance of leading competitors with only 8 NFEs (Number of Function Evaluations). It boasts sub-second inference latency on enterprise-grade H800 GPUs and operates comfortably on consumer devices with 16GB VRAM. Z-Image-Turbo excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
Z-Image-Base: The non-distilled foundational model. Released to empower community-driven fine-tuning and custom development, unlocking its full potential.
Z-Image-Edit: A variant fine-tuned specifically for image editing tasks. It supports creative image-to-image generation with impressive instruction-following capabilities, enabling precise edits based on natural language prompts.

Recent Updates

December 8, 2025: Z-Image-Turbo achieved 8th overall ranking on the Artificial Analysis Text-to-Image Leaderboard, securing its position as the #1 open-source model.
December 1, 2025: Our technical report for Z-Image is now available on arXiv.
November 26, 2025: Z-Image-Turbo has been released! Model checkpoints are available on Hugging Face and ModelScope. An online demo is also available.

Model Availability

Model	Hugging Face	ModelScope
Z-Image-Turbo	Checkpoint	Online Demo
Z-Image-Base	To be released	To be released
Z-Image-Edit	To be released	To be released

Showcase

Photorealistic Quality

Z-Image-Turbo delivers strong photorealistic image generation while maintaining excellent aesthetic quality.

Accurate Bilingual Text Rendering

Z-Image-Turbo excels at accurately rendering complex Chinese and English text.

Prompt Enhancing & Reasoning

Prompt Enhancer empowers the model with reasoning capabilities, enabling it to transcend surface-level descriptions and tap into underlying world knowledge.

Creative Image Editing

Z-Image-Edit demonstrates a strong understanding of bilingual editing instructions, enabling imaginative and flexible image transformations.

Model Architecture

Z-Image adopts a Scalable Single-Stream DiT (S3-DiT) architecture. In this setup, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream. This approach maximizes parameter efficiency compared to traditional dual-stream methods.

Performance

Z-Image-Turbo's performance has been rigorously validated on multiple independent benchmarks, consistently demonstrating state-of-the-art results, particularly as a leading open-source model.

Artificial Analysis Text-to-Image Leaderboard

On the highly competitive Artificial Analysis Leaderboard, Z-Image-Turbo ranked 8th overall and secured the top position as the #1 Open-Source Model, outperforming all other open-source alternatives.

Artificial Analysis Leaderboard

Artificial Analysis Leaderboard (Open-Source Model Only)

Alibaba AI Arena Text-to-Image Leaderboard

According to the Elo-based Human Preference Evaluation on Alibaba AI Arena, Z-Image-Turbo also achieves state-of-the-art results among open-source models and exhibits highly competitive performance against leading proprietary models.

Alibaba AI Arena Text-to-Image Leaderboard

Quick Start

(1) PyTorch Native Inference

First, build a virtual environment and install the dependencies:

pip install -e .

Then, run the following code to generate an image:

python inference.py

(2) Diffusers Inference

To use Diffusers, install the latest version. Note that you need to install diffusers from the source to get the latest features and Z-Image support, as two pull requests (#12703 and #12715) adding Z-Image support have been merged into the official diffusers repository.

pip install git+https://github.com/huggingface/diffusers

Then, try the following Python code to generate an image:

import torch
from diffusers import ZImagePipeline

# 1. Load the pipeline
# Use bfloat16 for optimal performance on supported GPUs
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Flash Attention for better efficiency if supported:
# pipe.transformer.set_attention_backend("flash")    # Enable Flash-Attention-2
# pipe.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3

# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
# pipe.transformer.compile()

# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
# pipe.enable_model_cpu_offload()

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."

# 2. Generate Image
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9, # This actually results in 8 DiT forwards
    guidance_scale=0.0, # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")

Decoupled-DMD: The Acceleration Magic Behind Z-Image

Decoupled-DMD is the core few-step distillation algorithm powering the 8-step Z-Image model. Our key insight is that the success of existing Distribution Matching Distillation (DMD) methods stems from two independent, collaborating mechanisms:

CFG Augmentation (CA): The primary engine driving the distillation process, a factor largely overlooked in previous work.
Distribution Matching (DM): Acts more as a regularizer, ensuring the stability and quality of the generated output.

By recognizing and decoupling these mechanisms, we could study and optimize them in isolation. This led to an improved distillation process that significantly enhances the performance of few-step generation.

DMDR: Fusing DMD with Reinforcement Learning

Building upon the strong foundation of Decoupled-DMD, our 8-step Z-Image model already demonstrates exceptional capabilities. To further enhance semantic alignment, aesthetic quality, and structural coherence—while producing images with richer high-frequency details—we introduce DMDR.

Our core insight behind DMDR is that Reinforcement Learning (RL) and Distribution Matching Distillation (DMD) can be synergistically integrated during the post-training of few-step models. We demonstrate that:

RL unlocks the performance of DMD.
DMD effectively regularizes RL.

Community Contributions

The Z-Image project benefits from various community-driven integrations and acceleration methods:

Cache-DiT: Offers inference acceleration support for Z-Image with DBCache, Context Parallelism, and Tensor Parallelism. Visit their example for more details.
stable-diffusion.cpp: A pure C++ diffusion model inference engine that supports fast and memory-efficient Z-Image inference across multiple platforms (CUDA, Vulkan, etc.). It allows generating images with Z-Image on machines with as little as 4GB of VRAM. For more information, refer to How to Use Z-Image on a GPU with Only 4GB VRAM.
LeMiCa: Provides a training-free, timestep-level acceleration method that conveniently speeds up Z-Image inference. For more details, see LeMiCa4Z-Image.
ComfyUI ZImageLatent: Provides an easy-to-use latent for official Z-Image resolutions.
DiffSynth-Studio: Offers comprehensive support for Z-Image, including LoRA training, full training, distillation training, and low-VRAM inference. Please refer to the DiffSynth-Studio documentation.
vllm-omni: A framework extending support for omni-modality model fast inference and serving, now supports Z-Image.
SGLang-Diffusion: Brings SGLang's state-of-the-art performance to accelerate image and video generation for diffusion models, now supporting Z-Image.

Citation

If you find our work useful in your research, please consider citing the following:

@article{
team2025zimage,
title={Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer},
author={Z-Image Team},
journal={arXiv preprint arXiv:2511.22699},
year={2025}
}

@article{
liu2025decoupled,
title={Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield},
author={Dongyang Liu and Peng Gao and David Liu and Ruoyi Du and Zhen Li and Qilong Wu and Xin Jin and Sihan Cao and Shifeng Zhang and Hongsheng Li and Steven Hoi},
journal={arXiv preprint arXiv:2511.22677},
year={2025}
}

@article{
jiang2025distribution,
title={Distribution Matching Distillation Meets Reinforcement Learning},
author={Jiang, Dengyang and Liu, Dongyang and Wang, Zanyi and Wu, Qilong and Jin, Xin and Liu, David and Li, Zhen and Wang, Mengmeng and Gao, Peng and Yang, Harry},
journal={arXiv preprint arXiv:2511.13649},
year={2025}
}

Career Opportunities

We are actively seeking Research Scientists, Engineers, and Interns to contribute to foundational generative models and their applications. Interested candidates are encouraged to send their resumes to: jingpeng.gp@alibaba-inc.com.