Perplexity AI's TransferEngine Solves Cloud Lock-in and Hardware Challenges for Trillion-Parameter Models

artificial-intelligence

Perplexity AI introduces TransferEngine, an open-source tool enabling trillion-parameter AI models to run on older GPUs across multiple cloud providers. It eliminates vendor lock-in and reduces the need for costly, new hardware, enhancing performance and portability for LLMs.

Perplexity AI has introduced TransferEngine, an open-source software tool designed to tackle two significant challenges for enterprises deploying AI systems: vendor lock-in to specific cloud providers and the intensive hardware requirements for running massive AI models.

TransferEngine enables high-speed, cross-provider communication for large language models (LLMs). This innovation allows companies to deploy trillion-parameter models, such as DeepSeek V3 and Kimi K2, on readily available older H100 and H200 GPU systems, circumventing the need for expensive, scarce next-generation hardware. Perplexity detailed its findings in a research paper and made the tool publicly available on GitHub.

The researchers highlighted that "Existing implementations are locked to specific Network Interface Controllers, hindering integration into inference engines and portability across hardware providers."

Addressing Vendor Lock-in

The root of vendor lock-in lies in fundamental technical incompatibilities. Cloud providers employ distinct networking protocols for high-speed GPU communication; Nvidia’s ConnectX chips utilize one standard, while AWS’s Elastic Fabric Adapter (EFA) uses a proprietary protocol. Previous solutions were siloed, supporting only one system or the other, forcing companies into a single cloud ecosystem or accepting severely degraded performance.

This issue is particularly pronounced with modern Mixture-of-Experts (MoE) models, like DeepSeek V3 (671 billion parameters) and Kimi K2 (a full trillion parameters). These models are too large for single eight-GPU systems. While Nvidia's new GB200 systems (72-GPU servers) offer a powerful solution, they are costly, in short supply, and not universally accessible. In contrast, H100 and H200 systems are more abundant and affordable.

However, running large models across multiple older systems traditionally incurred significant performance penalties. As the research team noted, "There are no viable cross-provider solutions for LLM inference," with existing libraries either lacking AWS support or exhibiting severe performance degradation on Amazon's hardware. TransferEngine seeks to overcome these limitations. "TransferEngine enables portable point-to-point communication for modern LLM architectures, avoiding vendor lock-in while complementing collective libraries for cloud-native deployments," the researchers stated.

How TransferEngine Works

TransferEngine functions as a universal translator for GPU-to-GPU communication. It establishes a common interface compatible with diverse networking hardware by identifying shared core functionalities across various systems.

The technology leverages Remote Direct Memory Access (RDMA), which facilitates direct data transfer between graphics cards without involving the main processor—essentially creating a dedicated, high-speed channel between chips. Perplexity’s implementation achieved a throughput of 400 gigabits per second on both Nvidia ConnectX-7 and AWS EFA, matching existing single-platform solutions. Furthermore, TransferEngine supports aggregating bandwidth by utilizing multiple network cards per GPU, enhancing communication speed.

The paper explained, "We address portability by leveraging the common functionality across heterogeneous RDMA hardware," by creating "a reliable abstraction without ordering guarantees" over underlying protocols.

Live in Production Environments

TransferEngine is not merely a theoretical concept; Perplexity has deployed it in production to power its AI search engine. The company uses it across three key systems: for disaggregated inference, it manages high-speed transfer of cached data between servers, enabling dynamic scaling of AI services. It also supports Perplexity’s reinforcement learning system, completing weight updates for trillion-parameter models in just 1.3 seconds.

Crucially, TransferEngine is implemented for Mixture-of-Experts (MoE) routing. MoE models direct different requests to specialized "experts" within the model, generating significantly more network traffic than conventional models. While DeepSeek developed its DeepEP framework for this purpose, it was restricted to Nvidia ConnectX hardware. TransferEngine not only matched DeepEP’s performance on ConnectX-7 but also achieved "state-of-the-art latency" on Nvidia hardware, while establishing "the first viable implementation compatible with AWS EFA."

Tests with DeepSeek V3 and Kimi K2 on AWS H200 instances demonstrated substantial performance gains when models were distributed across multiple nodes, particularly at medium batch sizes—the optimal range for production serving.

The Open-Source Strategy

Perplexity’s decision to open-source its production infrastructure is a notable departure from competitors like OpenAI and Anthropic, who maintain proprietary technical implementations. The company released the complete library, including code, Python bindings, and benchmarking tools, under an open license. This move echoes Meta’s strategy with PyTorch: fostering an industry standard through open-source contributions. Perplexity continues to optimize the technology for AWS, collaborating on updates to Amazon’s networking libraries for further latency reduction.