Scaling ArgoCD Diffs for Large-Scale Kubernetes Management
Discover how monday.com manages Kubernetes cluster state with GitOps and ArgoCD, focusing on a custom diffing mechanism for reviewing changes at scale, addressing complexities of hierarchical overlays.

At monday.com, we leverage GitOps and ArgoCD to manage our Kubernetes cluster state effectively, particularly focusing on the change review process. GitOps defines the desired cluster state declaratively within a Git repository. ArgoCD then continuously fetches and applies this state to the target clusters. We primarily use Helm charts with associated configuration values to define our state sources. To maintain a 'Don't Repeat Yourself' (DRY) principle, we employ a hierarchical structure of configuration files (overlays), enabling granular control over resources at various levels, such as region-specific or environment-wide.
While hierarchical overlays offer a flexible abstraction for scaling resource modifications and adhering to DRY principles, they introduce several challenges:
- Large blast radius: Modifying less specific configuration values can impact numerous resources across multiple environments.
- Complexity: It becomes difficult to visually determine the merged result of overlays and, consequently, the final applied manifest.
- Onboarding hurdles: New developers often lack confidence in the changes being introduced due to this complexity.
To mitigate these issues, many critical development paths in our GitOps repository are now automated. Bots from continuous deployment (CD) pipelines or developer portal backends manage state changes, with developers interacting through user-friendly UIs. These automated changes are smaller, less error-prone, and incorporate validation checks. However, for paths still requiring manual intervention, we recognized the need for a robust diffing mechanism to clearly understand the final applied state.

A crucial question arose: should we adopt the 'rendered manifests' pattern? This approach involves pre-rendering all Helm charts, storing them directly in the Git repository, and allowing ArgoCD to sync these explicit manifests. This would directly address the root cause of complexity, reduce ArgoCD's load, and make changes more transparent—what's in Git matches what's in the Kubernetes cluster.
Ultimately, we opted against this path primarily due to the substantial migration effort required. Our existing structure is deeply integrated with extensive tooling and automation, and migrating a large number of critical applications to new source definitions posed a significant incident risk. Furthermore, standard pull request UIs proved inadequate for browsing diffs at our scale. With numerous resource changes across many clusters, we needed a more sophisticated diff view capable of grouping changes by clusters, environments, or even by similarity (diff hash).
Given that manual GitOps state changes are initiated via pull requests, our ideal workflow involved opening a PR and reviewing the diff directly within its context. Our chosen solution involves rendering manifests on-the-fly for both the target and head branches within our CI system. These rendered manifests are then compared to generate a diff artifact, which is displayed in a dedicated UI. A link to this UI is automatically posted as a comment on the pull request thread, allowing users to access and approve the diff. The solution comprises three core components:
- Render CLI
- Backend for storing and approving diff artifacts
- UI for diff browsing

A critical aspect of our solution is the manifest rendering process, which adheres to several key constraints:
- High Accuracy: The rendering process must account for Kubernetes cluster versions and API capabilities.
- Efficient Rendering: Rendering times need to be reasonable.
- Minimal Load: It should not impose additional burden on existing ArgoCD instances.
- Local Testing: Developers require the ability to test Helm chart template changes by locally overriding fetching logic to use local charts instead of chart museums or VCS.
- Custom CRD Support: The system must support custom resource definitions.
We determined that the diff should compare the desired state against the new desired state, rather than against the live state (which would involve commands like argocd diff). Comparing against live state would necessitate runtime access to numerous ArgoCD instances, leading to unpredictable results, potential network errors, issues with broken or syncing applications, and excessively long render times, especially when dealing with hundreds of applications per pull request. Addressing drift between live and desired states falls under a separate set of tools and processes.
Initially, we attempted to spin up an ArgoCD instance with Kind (Kubernetes-in-Docker) within our CI system for rendering. However, this approach proved too slow and complex for implementing local overrides, as it required on-the-fly modification of source URLs. Given our applications' numerous sources at various revisions across diverse clusters with differing versions and capabilities, we ultimately decided to develop a custom rendering tool built upon the core helm template command, aiming to closely mimic ArgoCD's behavior.

The core algorithm involves comparing two repositories (representing head and target branches of the same GitOps repository). Manifests are rendered for both, and as long as ArgoCD application manifests exist in the rendering queue, processing continues. Once the queue is empty, a diff artifact is generated by comparing resources individually, ensuring that any fields specified in an application's 'ignore differences' section are excluded. This custom tool extensively caches VCS repositories, pre-fetches Kubernetes cluster versions and capabilities from live ArgoCD instances, and includes a mechanism for overriding VCS and chart museum sources with local content.
Upon generation, the diff artifact is uploaded to a dedicated diff service backend. The pull request author is then redirected to a specialized frontend UI via a link in the PR comment. This UI offers robust capabilities for searching, sorting, and grouping resources. The 'group by' functionality is particularly powerful for large-scale operations, allowing users to:
- Group by environment label, ensuring modifications are restricted to specific environments (e.g., pre-production).
- Group by cluster, verifying that changes only affect intended clusters.
- Group by diff hash, confirming that a specific label change is the sole modification across all affected resources.

The UI supports grouping based on any label defined on a resource, as well as common Kubernetes resource properties such as Kind, ApiVersion, or Name.
Our solution delivers fast and accurate results, leveraging actual Kubernetes cluster capabilities and mimicking ArgoCD's rendering algorithm, which significantly enhances the review experience compared to standard pull request UIs. We extensively utilize local overrides to assess the impact of custom Helm chart changes on existing applications. This tool has positively contributed to:
- Lower Incident Risk: Achieved through improved change visibility.
- Enhanced Productivity: By enabling 'shift-left' validation of templating logic; if the diff artifact cannot be rendered, the change is blocked.
- Faster Onboarding: New users can focus on understanding Kubernetes resources rather than the complexities of hierarchical abstraction.
However, custom tools inherently come with trade-offs. This solution is vulnerable to changes in ArgoCD's core rendering logic, which could lead to discrepancies in diff results. Future mitigation efforts may involve directly extracting ArgoCD modules responsible for rendering, rather than reimplementing them externally.