Exploring the Causes and Impact of Large HTML Documents on the Web

Web Performance

Uncover the surprising reasons behind oversized HTML documents, from embedded media and extensive inline styles to client-side application state. Learn how these factors impact page speed and discover strategies for optimizing web performance.

Most HTML documents are relatively small, serving as a foundational starting point for other resources on a page. However, some websites load several megabytes of HTML code. This often isn't due to extensive page content, but rather the embedding of various resource types directly within the document. This article explores examples of large HTML documents across the web, examining their code to identify the factors contributing to their size.

Web HTML is full of surprises. During the research for this article, the HTML Size Analyzer was significantly rebuilt to support complex nested structures, such as scripts containing JSON that contains HTML, which in turn contains CSS, and even embedded images.

Embedded Images

Base64 encoding allows images to be converted into text, enabling their direct embedding within text files like HTML or CSS. This method offers a significant advantage: the browser no longer needs to make separate requests to display the image. However, for large files, this approach can lead to problems. Embedded images cannot be cached independently, and they are prioritized for download in the same way as the document content, even though images typically don't require immediate loading.

Below is an example of PNG files embedded in HTML using data URLs.

Different variations of this pattern exist:

  • Sometimes a single multi-megabyte image is accidentally included.
  • Other times, hundreds of small icons accumulate over time.
  • One observed site used responsive images with data URLs, which counteracts the goal of responsive images to load only the minimum necessary resolution by embedding all versions in the HTML.

Indirectly embedded images also contribute, such as:

  • Inline SVGs that are thin wrappers around PNG or JPEG files.
  • Background images specified in inlined CSS stylesheets.
  • Images nested within JSON data (a topic we'll revisit).

Here’s an example of a style tag containing 201 rules with embedded background images.

Inline CSS

Large inline CSS is frequently attributed to embedded images. However, long selectors resulting from deeply nested CSS also significantly increase both CSS and HTML size.

In one instance, an HTML document contained 20 inline style tags with similar content (e.g., variations like “header,” “header-mobile,” and “header-desktop”). Most selectors exceeded 200 characters, leading to selectors alone making up 47% of the overall stylesheet content, rather than actual style declarations.

Despite this, the HTML compressed efficiently due to repetition within the selectors, reducing the size from 20.5 megabytes to only 2.3 megabytes after GZIP compression.

Embedded Fonts

Like images, fonts are also sometimes encoded as Base64. For one or two small fonts, this can work well, allowing text to render with the proper font immediately. However, when many fonts are embedded, visitors must wait for these fonts to finish downloading before any page content can render.

Client-Side Application State

Many modern websites are built as JavaScript applications. It would be slow to display content only after all JavaScript and required data have loaded, so during the initial page load, the HTML is often rendered on the server. Once the client-side application code loads, this static HTML is “hydrated”: the page content becomes interactive with JavaScript, and client-side code takes control of future content updates.

Typically, client-side code makes fetch requests to backend API endpoints to load necessary data. However, since the initial client-side render requires the same data as the server-side rendering process, servers embed this 'hydration state' directly into the final HTML. This allows client-side hydration to occur immediately after loading all JavaScript, without requiring additional API requests.

As you might expect, this hydration state can become quite large! It can often be identified within script tags referencing framework-specific keywords, such as:

  • Next.js: self.__next_f.push or __NEXT_DATA__
  • Nuxt: __NUXT_DATA__
  • Redux: __PRELOADED_STATE__
  • Apollo: __APOLLO_STATE__
  • Angular: ng-state
  • Or similar __INITIAL_STATE__ or __INITIAL_DATA__ in many custom setups.

In a local development environment with minimal data, the size of the hydration state might not be noticeable. However, as more data is added to the production database, the hydration state grows accordingly. For instance, a list of hotels might reference 3,561 different images (which, thankfully, are not Base64 embedded in this particular example).

If Base64 images are passed into front-end components, they will also inevitably end up in the hydration state. One website was found to have 42 images embedded within the JSON data inside its HTML document, with the largest image weighing in at 2.5 megabytes.

There’s a surprising amount of nesting that can occur. In the previous example, we saw images in JSON within a script inside the HTML. But it can go even deeper. Consider this next example:

Upon inspecting the hydration state, 52 products were discovered, each with a judgmeWidget property. The value of this property was, itself, an HTML fragment!

Analyzing one of these values in the HTML Size Analyzer revealed that most of its HTML was actually embedded JSON code, presented this time as a data-json attribute on a div. The largest property within that JSON was amusingly named body_html.

Other Causes of Large HTML

Several other examples observed during research include:

  • A 4-megabyte inline script.
  • Unexpected metadata originating from Figma.
  • A mega-menu featuring over 7,000 items and 1,300 inline SVGs.
  • Responsive images configured with 180 supported sizes.

Some large websites still do not apply GZIP or Brotli compression to their HTML, resulting in significant transfer sizes despite having relatively little code. Another common frustration is encountering a 53-kilobyte NREUM script: many websites embed New Relic’s end-user monitoring script directly into the document’s <head>, which can introduce an undesirable performance impact if user experience is being measured.

How Does HTML Size Impact Page Speed?

HTML code must be downloaded and parsed as part of the page load process. The longer this takes, the longer visitors must wait for content to appear. Browsers also assign a high priority to HTML content, assuming all of it is essential. This can mean that non-critical hydration state is downloaded before render-blocking stylesheets and JavaScript files are loaded.

An example of this can be seen in a request waterfall from a website speed test. While the browser may be aware of other files early on, all available bandwidth is instead consumed by the document itself.

Embedding images or fonts directly in the HTML also prevents these files from being cached and reused across pages. Instead, they must be re-downloaded for every page load on the website. Time spent parsing HTML is also a concern. On a MacBook, parsing one megabyte of HTML code takes about 6 milliseconds. In contrast, a low-end phone used for testing takes approximately 80 milliseconds per megabyte. For very large documents, CPU processing thus becomes a significant factor.

Websites with Large HTML Can Still Be Fast

While there may be a strong focus on HTML size, it's important to evaluate if it truly poses a widespread problem for many real visitors. Most visitors today likely have reasonably fast connections and devices, and often, other web performance issues (such as the actual execution of JavaScript application code that uses hydration state) prove to be more pressing.

Pages also do not need to download the full HTML document before they can begin rendering. As shown below, the document and important stylesheets can load in parallel, allowing main content to render before the document is fully loaded.

Real visitor data from Google’s Chrome User Experience Report (CrUX) often indicates that such websites typically render in under 2 seconds, even on mobile devices.

Conclusion and Recommendations

Despite potential mitigating factors, a large document undoubtedly slows a page down. One indicator of this is when the Largest Contentful Paint (LCP) image does not appear immediately after loading. CrUX might report a render delay, suggesting that render-blocking stylesheets, which compete with other resources on the main server, are loading more slowly than images fetched from a different server.

It is highly recommended to periodically review your website's HTML to understand its contents. Often, quick, high-impact fixes can be implemented. While inlining images in HTML or CSS is frequently intended as a performance optimization, a convenient setup can inadvertently lead to an accumulation of embedded files without proper oversight. Consider implementing guardrails within your Continuous Integration (CI) build processes to detect and prevent unintended jumps in file size.