Announcing Magika 1.0: Faster, Smarter, and Rebuilt in Rust

Open Source

Google unveils Magika 1.0, its AI-powered file type detection system. Rebuilt in Rust for speed and security, it now supports over 200 file types with enhanced accuracy and new features.

Announcing Magika 1.0: Faster, Smarter, and Rebuilt in Rust

By Elie Bursztein, Julien Cretin, and Yanick Fratantonio, Applied Cybersecurity Research

Early last year, we open-sourced Magika, Google's AI-powered file type detection system. Since its alpha release, Magika has seen significant adoption within open-source communities, boasting over one million monthly downloads. Today, we are excited to announce the release of Magika 1.0, the first stable version, introducing new features and a host of major improvements since our last announcement. Here are the highlights:

  • Expanded file type support for more than 200 types (doubled from approximately 100).
  • A brand-new, high-performance engine rewritten from the ground up in Rust.
  • A native Rust command-line client for maximum speed and security.
  • Improved accuracy for challenging text-based formats like code and configuration files.
  • Revamped Magika Python and TypeScript modules for even easier integrations.

Smarter Detection: Doubling Down on File Types

Magika 1.0 now identifies more than 200 content types, doubling the number of file types supported from its initial release. This expansion isn't merely about a larger number; it unlocks far more granular and useful identification, particularly for specialized, modern file types.

Some of the notable new file types detected include:

  • Data Science & ML: We've added support for formats such as Jupyter Notebooks (ipynb), Numpy arrays (npy, npz), PyTorch models (pytorch), ONNX (onnx) files, Apache Parquet (parquet), and HDF5 (h5).
  • Modern Programming & Web: The model now recognizes dozens of languages and frameworks. Key additions include Swift (swift), Kotlin (kotlin), TypeScript (typescript), Dart (dart), Solidity (solidity), Web Assembly (wasm), and Zig (zig).
  • DevOps & Configuration: We've expanded detection for critical infrastructure and build files, such as Dockerfiles (dockerfile), TOML (toml), HashiCorp HCL (hcl), Bazel (bazel) build files, and YARA (yara) rules.
  • Databases & Graphics: We also added support for common formats like SQLite (sqlite) databases, AutoCAD (dwg, dxf) drawings, Adobe Photoshop (psd) files, and modern web fonts (woff, woff2).

Enhanced Granularity

Magika is now smarter at differentiating similar formats that might have previously been grouped together. For example, it can now distinguish:

  • JSONL (jsonl) vs. generic JSON (json)
  • TSV (tsv) vs. CSV (csv)
  • Apple binary plists (applebplist) from regular XML plists (appleplist)
  • C++ (cpp) vs. C (c)
  • JavaScript (javascript) vs. TypeScript (typescript)

Expanding Magika's detection capabilities introduced two significant technical hurdles: data volume and data scarcity.

First, the scale of the data required for training was a key consideration. Our training dataset grew to over 3TB when uncompressed, which required an efficient processing pipeline. To handle this, we leveraged our recently released SedPack dataset library. This tool allows us to stream and decompress this large dataset directly to memory during training, bypassing potential I/O bottlenecks and making the process feasible.

Second, while common file types are plentiful, many of the new, specialized, or legacy formats presented a data scarcity challenge. It is often not feasible to find thousands of real-world samples for every file type. To overcome this, we turned to generative AI. We leveraged Gemini to create a high-quality, synthetic training set by translating existing code and other structured files from one format to another. This technique, combined with advanced data augmentation, allowed us to build a robust training set, ensuring Magika performs reliably even on file types for which public samples are not readily available.

The complete list of all 200+ supported file types is available in our revamped documentation.

Under the Hood: A High-Performance Rust Engine

We completely rewrote Magika's core in Rust to provide native, fast, and memory-safe content identification. This engine is at the heart of the new Magika native command-line tool that can safely scan hundreds of files per second.

Output of the new Magika Rust-based command-line tool

Magika is able to identify hundreds of files per second on a single core and easily scale to thousands per second on modern multi-core CPUs thanks to the use of the high-performance ONNX Runtime for model inference and Tokio for asynchronous parallel processing. For example, as visible in the chart below, on a MacBook Pro (M4), Magika processes nearly 1,000 files per second.

Getting Started

Ready to try it out? Getting started with the native command-line client is as simple as typing a single command:

On Linux and MacOS:

curl -LsSf https://securityresearch.google/magika/install.sh | sh

On Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://securityresearch.google/magika/install.ps1 | iex"

Alternatively, the new Rust command-line client is also included in the magika Python package, which you can install with:

pipx install magika

For developers looking to integrate Magika as a library into their own applications in Python, JavaScript/TypeScript, Rust, or other languages, head over to our comprehensive developer documentation to get started.

What's Next

We're incredibly excited to see what you will build using Magika's enhanced file detection capabilities. We invite you to join the community:

  • Try Magika: Install it and run it on your files, or try it out in our web demo.
  • Integrate Magika into your software: Visit our documentation to get started.
  • Give us a star on GitHub to show your support.
  • Report issues or suggest new file types you'd like to see by opening a feature request.
  • Contribute new features and bindings by opening a pull request.

Thank you to everyone who has contributed, provided feedback, and used Magika over the past year. We can't wait to see what the future holds.

Acknowledgements

Magika's continued success was made possible by the help and support of many people, including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Alex Petit-Bianco, Kurt Thomas, Luca Invernizzi, Lenin Simicich, and Amanda Walker.