Transformers v5: Simple Model Definitions Powering the AI Ecosystem

artificial intelligence

Discover Transformers v5, a significant release focusing on simplicity, training, inference, and production. Learn about its modular design, code reduction, PyTorch-centric approach, and enhanced interoperability across the AI ecosystem.

Transformers v4.0.0rc-1, the initial release candidate for version 4, launched on November 19, 2020. Five years later, we are proud to introduce v5.0.0rc-0.

Today, with the launch of v5, Transformers sees over 3 million daily installs via pip, a substantial increase from 20,000/day in v4. The library has now surpassed an astounding 1.2 billion total installs.

The ecosystem has grown exponentially, expanding from 40 model architectures in v4 to over 400 today. Furthermore, the community has contributed more than 750,000 model checkpoints on the Hub compatible with Transformers, a massive leap from roughly 1,000 at the time of v4.

This incredible growth is fueled by the rapid evolution of the AI field and the mainstream adoption of artificial intelligence. As a leading library for model definitions within this ecosystem, continuous evolution and adaptation are crucial for Transformers to remain relevant and impactful. Reinvention is paramount for sustained longevity in AI.

We are fortunate to collaborate with numerous libraries and applications built upon Transformers, including (in no specific order): llama.cpp, MLX, onnxruntime, Jan, LMStudio, vLLM, SGLang, Unsloth, LlamaFactory, dLLM, MaxText, TensorRT, Argmax, among many other valued partners.

For v5, our efforts concentrated on several key areas: simplicity, training, inference, and production. This post delves into the work undertaken in each of these aspects.

Simplicity

The team's primary focus was simplicity. At Transformers, we view code as the product. Our goal is to ensure model integrations are clean, allowing the broader ecosystem to confidently depend on our model definitions, understand the nuances between models, and grasp the key features of each new architecture. Simplicity fosters wider standardization, generality, and broader support.

Model Additions

Transformers serves as the backbone for hundreds of thousands of projects, including Unsloth.

"We build on Transformers to help people fine-tune and train models efficiently, whether that’s BERT, text-to-speech (TTS), or others; to run fast inference for reinforcement learning (RL) even when models aren’t yet supported in other libraries. We're excited for Transformers v5 and are super happy to be working with the Hugging Face team!"

— Michael Han at Unsloth

At its core, Transformers remains a model architecture toolkit. We strive to include all recent architectures and serve as the "source of truth" for model definitions. We have consistently added between 1 and 3 new models every week for the past five years, as illustrated in the timeline below:

We have significantly improved our model-addition process.

Modular Approach

Over the past year, we have heavily advocated for our modular design as a substantial leap forward. This approach enables easier maintenance, faster integration, and enhanced collaboration across the community.

For a deeper dive, please refer to our blog post, "Maintain the Unmaintainable". In brief, our aim is to achieve a much simpler model contribution process and reduce the maintenance burden. A notable metric is the significant reduction in the number of lines of code to contribute (and review) when a modular approach is utilized:

While we respect the "One model, one file" philosophy, we continue to introduce abstractions that simplify the management of common helpers. A prime example is the introduction of the AttentionInterface, which provides a centralized abstraction for various attention methods. The eager method will remain in the modeling file, while others, such as FA1/2/3, FlexAttention, or SDPA, are moved to this interface.

"Over the past couple of years, the increasing amount of 0-day support for new model architectures and standardization of attention handling has helped to simplify our support for post-training modern LLMs."

— Wing Lian, Axolotl

Tooling for Model Conversion

We are developing tooling to help identify existing model architectures that a new model resembles. This feature leverages machine learning to find code similarities between independent modeling files. Our long-term goal is to automate the conversion process by opening a draft pull request for the model's integration into our Transformers format, thereby reducing manual effort and ensuring consistency.

Code Reduction

Streamlining Modeling & Tokenization/Processing Files

We have significantly refactored the modeling and tokenization files. Modeling files have been greatly enhanced thanks to the modular approach described above, in addition to standardization across models. This standardization helps abstract away most tools not essential to the model itself, ensuring that the modeling code contains only the relevant parts for a model's forward/backward passes.

Alongside this work, we are simplifying the tokenization and processing files. Moving forward, we will exclusively focus on the tokenizers backend, effectively removing the distinction between "Fast" and "Slow" tokenizers.

tokenizers will be our primary tokenization backend, mirroring its use for PyTorch-based models. We will offer alternatives for Sentencepiece or MistralCommon-backed tokenizers, which, while not default, will be fully supported. Image processors will now only exist in their fast variant, which relies on the torchvision backend.

Finally, we are sunsetting our Flax/TensorFlow support to concentrate solely on PyTorch as the primary backend. However, we are actively collaborating with partners in the JAX ecosystem to ensure compatibility between our models and their frameworks.

"With its v5 release, Transformers is going all in on PyTorch. Transformers acts as a source of truth and foundation for modeling across the field; we've been working with the team to ensure good performance across the stack. We're excited to continue pushing for this in the future across training, inference, and deployment."

— Matt White, Executive Director, PyTorch Foundation. GM of AI, Linux Foundation

Training

Training remains a major focus for the team heading into v5. While our previous emphasis was primarily on fine-tuning, rather than large-scale pre-training or full training, we have recently made significant strides to improve our support for the latter.

Pre-training at scale

Supporting pre-training required a rework of our model initialization, ensuring seamless operation at scale with diverse parallelism paradigms, and shipping support for optimized kernels for both forward and backward passes.

Looking ahead, we are excited to have extended compatibility with torchtitan, megatron, nanotron, and any other pre-training tool interested in collaborating with us.

Fine-tuning & Post-training

We continue to collaborate closely with all fine-tuning tools in the Python ecosystem. Our goal is to consistently provide model implementations compatible with Unsloth, Axolotl, LlamaFactory, TRL, and other tools within the PyTorch ecosystem. Additionally, we are working with tools like MaxText in the JAX ecosystem to ensure robust interoperability between their frameworks and Transformers.

All fine-tuning and post-training tools can now rely on Transformers for model definitions, further enabling Agentic use-cases through OpenEnv or the Prime Environment Hub.

Inference

Inference is a significant focus for v5, introducing several paradigm shifts: specialized kernels, cleaner defaults, new APIs, and support for optimized inference engines.

Similar to our training efforts, we have invested in packaging kernels to ensure their automatic utilization when your hardware and software permit. If you are new to kernels, we recommend exploring this documentation.

Alongside this effort, we are releasing two new APIs specifically for inference:

  1. Continuous batching and paged attention mechanisms: These have been used internally for some time, and we are now finalizing the rough edges and preparing usage guides.
  2. transformers serve: This is the new Transformers-specific serving system, which deploys an OpenAI API-compatible server.

We view this as a major step forward for use-cases like evaluation, where numerous inference requests are processed simultaneously. Our aim is not to replicate the specialized optimizations of dedicated inference engines (vLLM, SGLang, TensorRT LLM). Instead, we strive for perfect inter-compatibility with these engines, as detailed in the next section.

"The Transformers backend in vLLM has been very enabling to get more architectures, like BERT and other encoders, available to more users. We've been working with the Transformers team to ensure many models are available across modalities with the best performance possible. This is just the start of our collaboration: we're happy to see the Transformers team will have this as a focus going into version 5."

— Simon Mo, Harry Mellor at vLLM

"Standardization is key to accelerating AI innovation. Transformers v5 empowers the SGLang team to spend less time on model reimplementation and more time on kernel optimization. We look forward to building a more efficient and unified AI ecosystem together!"

— Chenyang Zhao at SGLang

Production & Local

Recently, we have collaborated closely with the most popular inference engines to encourage their use of Transformers as a backend. The value proposition is significant: as soon as a model is added to Transformers, it becomes available in these inference engines, leveraging each engine's unique strengths, such as inference optimizations, specialized kernels, and dynamic batching.

We have also worked intimately with ONNXRuntime, llama.cpp, and MLX to ensure excellent interoperability between Transformers and these modeling libraries. For example, thanks to a substantial community effort, it is now very easy to load GGUF files in Transformers for further fine-tuning. Conversely, Transformers models can be easily converted to GGUF files for use with llama.cpp.

"The Transformers framework is the go-to place for reference AI model implementations. The framework plays a crucial role in enabling modern AI across the entire stack. The team and the community behind the project truly understand and embrace the spirit of the open-source development and collaboration."

— Georgi Gerganov, ggml-org

The same interoperability holds true for MLX, where Transformers' safetensors files are directly compatible with MLX's models.

"It’s hard to overstate the importance of Transformers (and datasets, tokenizers, etc) to the open-source and overall AI ecosystem. I can’t count the number of times I’ve personally used Transformers as a source-of-truth."

— Awni Hannun, MLX

Finally, we are pushing the boundaries of local inference and working hand-in-hand with the executorch team to make Transformers models available on-device. We are expanding coverage to multimodal models (vision, audio) through Optimum.

Quantization

Quantization is rapidly emerging as the standard for state-of-the-art model development. Many SOTA models are now released in low-precision formats such as 8-bit and 4-bit (e.g., gpt-oss, Kimi-K2, Deepseek-r1), hardware is increasingly optimized for low-precision workloads, and the community actively shares high-quality quantized checkpoints. In v5, we are making quantization a central focus of Transformers support, ensuring full compatibility with all major features, and delivering a reliable framework for training and inference.

We introduce a significant change to how we load weights in our models, elevating quantization to a first-class citizen.

"Our collaboration with the Transformers team was highly productive, marked by their proactive code reviews, feedback, and technical expertise. Their support was crucial in integrating TorchAO, expanding quantization features, and improving documentation for broader adoption in the V5."

— Jerry Zhang at TorchAO

"We're excited that v5 has made quantization a first-class citizen. It provides the foundation for bitsandbytes to better support key features like TP and MoEs, and also makes it easier to integrate new quantization methods."

— Matthew Douglas & Titus von Koeller, bitsandbytes

Conclusion

The overarching theme of this version 5 release is "interoperability." All refactors, performance improvements, and standardization efforts align with this theme. V5 seamlessly integrates end-to-end with the growing ecosystem: train a model with Unsloth/Axolotl/LlamaFactory/MaxText, deploy it with vLLM/SGLang, and export it to llama.cpp/executorch/MLX to run locally!

Version 5 is undeniably an accomplishment achieved over the past five years by a vast number of people within our community. We also view it as a promise and a beacon guiding our future direction.

We seized this opportunity to streamline the toolkit and isolate its core essentials, providing a clean slate upon which to build. Thanks to numerous changes from the community and the team, shipping improvements in performance, usability, and readability will now be simpler.

Now that v5.0.0's first Release Candidate is available, we eagerly await your feedback. Please consult our release notes for all the technical details, and we look forward to your feedback in our GitHub issues.