From H.264 to JPEGs: How 15-Year-Old Screen Sharing Technology Solved Modern Streaming Challenges

Software Engineering

Discover how a sophisticated H.264 WebCodecs streaming pipeline was ultimately replaced by simple JPEG screenshot polling to achieve reliable screen sharing on challenging enterprise networks.

In 2025, our team embarked on a three-month journey to develop a sophisticated, hardware-accelerated, 60fps H.264 streaming pipeline using WebCodecs over WebSockets. Ironically, this advanced solution was eventually supplanted by a simple HTTP-based JPEG screenshot polling method when network conditions became unreliable.

The Challenge: Enterprise Network Constraints

We are developing Helix, an AI platform that enables autonomous coding agents to operate within cloud sandboxes. A core requirement is for users to observe their AI assistants in real-time—essentially, a screen-sharing solution where a robot is writing code. Previously, we detailed our transition from WebRTC to a custom WebSocket streaming pipeline. This week, we explain why even that sophisticated solution proved insufficient.

Our primary constraint was ensuring compatibility with enterprise networks. These environments typically impose strict limitations, often restricting traffic exclusively to HTTP/HTTPS on port 443. Common protocols and configurations are frequently blocked or deprioritized:

  • UDP: Blocked, deprioritized, or dropped due to perceived security risks.
  • WebRTC: Relies on TURN servers, which often necessitate UDP, leading to blockages.
  • Custom ports: Typically disallowed by firewalls.
  • STUN/ICE: NAT traversal mechanisms are rarely permitted within corporate networks.
  • Non-standard protocols: Generally denied by policy.

Our initial attempt with WebRTC performed well in development and cloud environments. However, upon deployment to an enterprise customer, connectivity issues arose. Network diagnostics revealed outbound UDP blockage, unreachable TURN servers, and failing ICE negotiation. Rather than engaging in lengthy battles with IT departments to configure proxies and TURN servers, we accepted the reality: all traffic had to flow over HTTPS on port 443.

This led us to develop a pure WebSocket video pipeline:

  • H.264 encoding powered by GStreamer and VA-API for hardware acceleration.
  • Binary frames transmitted over WebSocket (Layer 7), ensuring compatibility through any proxy.
  • Browser-side hardware decoding leveraging the WebCodecs API.
  • Achieving 60fps at 40Mbps with sub-100ms latency.

We were immensely proud of this pipeline, crafted with Rust and TypeScript, featuring a custom binary protocol optimized for microsecond-level performance.

Performance Issues and Denial

Despite our pipeline's robustness on ideal networks, real-world conditions presented new challenges. When tested from a coffee shop with unstable Wi-Fi, the video stream froze, and user input became unresponsive. The stream quickly accumulated a significant delay, showing actions that occurred 30 seconds or more in the past.

High-bandwidth video streams like our 40Mbps H.264 pipeline are highly sensitive to network latency. Under congestion:

  • Frames accumulate in the TCP/WebSocket buffer.
  • While still delivered in order, they experience increasing delays.
  • The video feed progressively falls behind real-time, resulting in a disconnected user experience where observing an AI's actions 45 seconds late renders feedback useless.

Lowering the bitrate to 10Mbps only resulted in a blocky, low-quality stream that remained significantly delayed.

Exploring Alternatives and Frustrations

Desperate for a solution, we explored various strategies:

Keyframe-Only Streaming

Our initial thought was to send only H.264 keyframes (IDR frames), which are self-contained and don't rely on previous frames. This approach aimed to provide approximately 1fps of corruption-free video, suitable for low-bandwidth scenarios. We implemented a keyframes_only flag, adjusted the decoder to check for FrameType::Idr, and set the Group of Pictures (GOP) to 60 (one keyframe per second at 60fps).

Upon testing, we received a single, perfect 1080p IDR frame, followed by silence. Despite the encoder running and GStreamer producing frames, nothing else came through. Our WebSocket streaming layer, built atop the Moonlight protocol (reverse-engineered from NVIDIA GameStream), seemed to have an internal logic that prevented further frame delivery if P-frames were not being consumed. Without deep dives into the Moonlight protocol, this proved insurmountable; the protocol demanded all frames or none.

Other ideas, such as implementing sophisticated TCP congestion control, were quickly dismissed due to their complexity. Facing the limitations of enterprise network throttling, frustration mounted.

The JPEG Revelation

During a late-night debugging session, while troubleshooting another frozen stream, a developer instinctively opened our internal screenshot debugging endpoint in a browser:

GET /api/v1/external-agents/abc123/screenshot?format=jpeg&quality=70

The result was immediate: a pristine, 150KB JPEG image of the remote desktop, crystal clear, artifact-free, with no reliance on keyframes or complex decoder state. Subsequent refreshes produced instant images. Rapidly hitting refresh yielded a consistent 5 FPS of perfect screenshots.

This simple, robust behavior stood in stark contrast to our elaborate WebCodecs pipeline. Initially, the idea of abandoning our sophisticated video codecs for HTTP-based individual frame requests, reminiscent of early 2000s web development, felt professionally unacceptable.

Acceptance and Implementation

Despite initial reluctance, the effectiveness of JPEG polling was undeniable. We implemented a basic polling mechanism:

// Poll screenshots as fast as possible (capped at 10 FPS max)
const fetchScreenshot = async () => {
  const response = await fetch(`/api/v1/external-agents/${sessionId}/screenshot`)
  const blob = await response.blob()
  screenshotImg.src = URL.createObjectURL(blob)
  setTimeout(fetchScreenshot, 100) // Poll every 100ms
}

This straightforward approach delivered perfect results.

The Advantages of JPEG Screenshots

Comparing our H.264 streaming pipeline with the JPEG polling method reveals compelling advantages:

PropertyH.264 StreamJPEG Polling
Bandwidth40 Mbps (constant)100-500 Kbps (varies with complexity)
StateStateful (corruptible)Stateless (each frame independent)
Latency SensitivityVery highLow
Packet Loss RecoveryWait for keyframe (seconds)Next frame (100ms)
Implementation Complexity3 months of Rustfetch() in a loop

JPEG screenshots are inherently self-contained. Each image either arrives completely or not at all, eliminating issues like partial decodes, reliance on keyframes, or decoder state corruption. Under poor network conditions, the only consequence is a reduced frame rate, with each delivered frame remaining perfectly clear.

Furthermore, a 70% quality 1080p JPEG desktop screenshot typically ranges from 100-150KB, often less than a single H.264 keyframe (200-500KB). This means we're transmitting less data per frame while achieving superior reliability.

The Hybrid Approach: Adaptive Switching

Our H.264 pipeline was not entirely discarded. Instead, we developed an adaptive switching mechanism:

  • Good connection (RTT < 150ms): The system utilizes the full 60fps H.264 stream with hardware decoding for a smooth experience.
  • Bad connection detected: The video is paused, and the system switches to screenshot polling.
  • Connection recovers: The user is prompted to click and retry video streaming.

A crucial insight was the continued necessity of WebSockets for user input. Keyboard and mouse events, being minimal (around 10 bytes each), are handled perfectly by WebSockets even on highly congested networks. The goal was simply to halt the transmission of large video frames.

We implemented a simple control message:

{"set_video_enabled": false}

Upon receiving this, the server ceases sending video frames, and the client initiates screenshot polling. Input flow remains uninterrupted. This core logic was implemented in approximately 15 lines of Rust:

if !video_enabled.load(Ordering::Relaxed) {
    continue; // Skips frame, enabling screenshot mode
}

Addressing the Oscillation Problem

An interesting bug emerged: when video frames stopped, the WebSocket's traffic significantly reduced to only input events and pings, causing latency to drop dramatically. Our adaptive mode would then falsely detect a "recovered" connection and switch back to video, instantly flooding the network, spiking latency, and triggering a switch back to screenshots. This created a continuous 2-second oscillation loop.

The solution was straightforward: once the system falls back to screenshots, it remains in that state until the user explicitly initiates a video retry. This is managed by:

setAdaptiveLockedToScreenshots(true) // Prevents oscillation

A visual cue, such as an amber icon and a message like "Video paused to save bandwidth. Click to retry," informs the user and puts them in control, eliminating the infinite loop.

Ubuntu's grim JPEG Support Omission

A surprising hurdle arose with grim, a Wayland screenshot tool ideal for our requirements due to its JPEG output capability for smaller file sizes. However, standard Ubuntu distributions compile grim without libjpeg support.

$ grim -t jpeg screenshot.jpg
error: jpeg support disabled

To overcome this, our Dockerfile now includes a build stage to compile grim from source with JPEG support enabled:

FROM ubuntu:25.04 AS grim-build
RUN apt-get install -y meson ninja-build libjpeg-turbo8-dev ...
RUN git clone https://git.sr.ht/~emersion/grim && \
    meson setup build -Djpeg=enabled && \
    ninja -C build

This ensures JPEG capabilities for our screenshot functionality, even if it requires building a screenshot tool from source in 2025.

The Final Adaptive Architecture

Our refined architecture combines the strengths of both approaches:

┌─────────────────────────────────────────────────────────────┐ │ User's Browser │ ├─────────────────────────────────────────────────────────────┤ │ WebSocket (always connected) │ │ ├── Video frames (H.264) ──────────── when RTT < 150ms │ │ ├── Input events (keyboard/mouse) ── always │ │ └── Control messages ─────────────── {"set_video_enabled"} │ │ │ │ HTTP (screenshot polling) ──────────── when RTT > 150ms │ │ └── GET /screenshot?quality=70 │ └─────────────────────────────────────────────────────────────┘
  • Good Connection: Provides 60fps H.264, hardware-accelerated, high-quality video.
  • Bad Connection: Switches to 2-10fps JPEGs, offering reliable performance across challenging networks.

Screenshot quality is also adaptively managed:

  • If a frame takes >500ms to transmit, quality is decreased by 10%.
  • If a frame takes <300ms, quality is increased by 5%.
  • The target is a minimum of 2 FPS consistently.

Key Lessons Learned

  1. Simplicity Often Prevails: Complex, cutting-edge solutions do not always outperform simpler, established methods. A late-night pivot to screenshots after three months of H.264 development dramatically improved reliability.
  2. Graceful Degradation is Essential: User experience should prioritize functionality over raw technical prowess. Users primarily care about seeing the screen and providing input, regardless of the underlying codec.
  3. WebSockets for Input, Not Always Video: While WebSockets are excellent for low-latency, small-payload data like user input, they are not always the optimal choice for large video streams, especially on unreliable networks.
  4. Verify Package Features: Standard distribution packages (like Ubuntu's grim) may omit crucial features, necessitating compilation from source.
  5. Measure Before Optimizing: Assumptions about the "best" technical approach should be challenged with real-world data and user experience considerations.

Helix is an open-source AI infrastructure designed for real-world reliability, even on difficult network conditions. Our journey involved first replacing WebRTC, then refining our own streaming solution, ultimately demonstrating that sometimes the most effective solution is one that has existed for over a decade.