The Unstable Path of Grafana: Why I Can't Recommend It Anymore

Observability

An engineer details growing frustration with Grafana's rapid, often breaking changes across its product suite (Agent, Mimir, OnCall), leading to instability and complexity.

Published: November 14, 2025 Read Time: 6 minutes

I Can't Recommend Grafana Anymore

Disclaimer: This article outlines my personal experiences with Grafana products, supplemented with some factual observations. While individual experiences may vary, I welcome your perspectives.

My career began at a small software company affiliated with my university. We developed and operated websites and web services for numerous clients. With multiple responsibilities falling on each team member, the company heavily relied on interns and new graduates – a situation that presented both challenges and opportunities.

For me, it was largely beneficial, providing a steep learning curve.

At one point, we identified a need for a modern monitoring solution. Traditional tools like Zabbix felt out of place in the new, declarative world of containers and Docker. I was tasked with finding a suitable alternative. My research led me to consider Loki/Prometheus with Grafana, and Elastic with Kibana. Elastic, however, proved to be an overwhelming beast – heavy, difficult to run, resource-intensive, and overly complex. In contrast, Loki and Prometheus were an ideal fit at the time.

I quickly set up a docker-compose.yaml file encompassing Loki, Prometheus, and Grafana. Operating within an internal Docker network, these services required no authentication between them. Grafana itself was only exposed via an SSH tunnel. With a single static scrape configuration and the Docker Loki log plugin, our observability stack was operational. For logs originating outside Docker containers, we leveraged Promtail.

Loki and Prometheus resided on the same machine, requiring only a local volume mount, and maintaining minimal load.

ℹ️ A crucial lesson learned here was to avoid transforming every log parameter into a label purely for easier selection in the Grafana UI. Assigning a label for metrics like 'latency' with a virtually limitless range of values will rapidly consume disk inodes, due to how Cortex handles bin-packing.

I also discovered Grafana Labs' cloud offering, which included a generous free tier. I even utilized this for personal projects and had a consistently positive experience with their services.

As time progressed, I transitioned to a new role, where Kubernetes became our primary orchestration platform. The Prometheus container, now migrating across nodes, highlighted challenges with roaming storage, particularly as our workload significantly increased. We also faced a requirement for long-term data storage, specifically 13 months. This led me to explore solutions like Thanos and Mimir.

Given my prior positive interactions with Grafana products, I opted for Mimir. Its basis in Cortex, similar to Loki, suggested a smooth integration. At this point, our reliance on Prometheus diminished, as we primarily used its remote_write functionality. Grafana offered a solution: the Grafana Agent, a single binary capable of shipping both logs and metrics to a remote destination. This seemed like an obvious choice.

However, as time went on, Grafana evolved the Grafana Agent setup into Grafana Agent Flow Mode. While requiring some adjustments, this was understandable – software naturally changes. Yet, Grafana demonstrated a consistent inclination for change.

They embarked on building their own comprehensive observability platform, seemingly aiming to attract customers from competitors like DataDog. This involved creating Grafana OnCall, their proprietary notification system. Beyond that, they made substantial investments in Helm charts and general starter templates. The promise was alluring: install metric/log shippers and integrate with Grafana Cloud in just two commands. Even for users unable or unwilling to use Grafana Cloud, Helm charts were provided for deploying Mimir, Loki, and Tempo. To further simplify, an 'umbrella chart' was introduced (which, in its default state, rendered to 6,000 lines of configuration). Alternatively, the Grafana Operator was offered to manage Grafana installations, or at least components of them.

As many in the software industry experience, maintenance challenges tend to emerge with age and rapid evolution.

Grafana OnCall was deprecated. The Grafana Agent and its successor, Agent Flow, were deprecated within 2-3 years of their inception. Some of the once 'easy-to-use' Helm charts are no longer maintained. Furthermore, Grafana deprecated Angular within its own product, transitioning to React for dashboards, which unfortunately broke many existing dashboards.

On the very same day the Grafana Agent was deprecated, Grafana Alloy was announced – billed as the all-in-one replacement. It promised support for logs, metrics, traces (Zipkin & Jaeger), and OpenTelemetry (OTEL) – truly a universal solution!

Grafana Alloy's launch was somewhat turbulent, characterized by initial bugs. However, it gradually matured. Predictably, the Alloy Operator also made its debut.

ℹ️ A noteworthy decision was their choice to use a custom configuration language for Alloy, resembling HCL. While I understand the rationale for moving away from YAML, I remain unconvinced. Not every component necessitates its own Domain Specific Language (DSL).

This sounds like a happy ending, right? Not quite.

The 'all-in-one' solution doesn't universally support everything. While Grafana was busy constructing its monitoring empire, the kube-prometheus community continued its organic and consistent development. The Prometheus Operator, with its ServiceMonitor and PodMonitor Custom Resource Definitions (CRDs), became the de facto standard in Kubernetes. Consequently, Alloy does support aspects of the monitoring.coreos.com API group CRDs, specifically native compatibility with ServiceMonitor and PodMonitor. However, PrometheusRules require additional configuration, and AlertmanagerConfig – which would ideally be implemented in Mimir – is not supported. This is partly because Mimir integrates its own Alertmanager, albeit with version differences and minor incompatibilities.

Despite these hurdles, I managed to get everything working. I could finally stop justifying yearly monitoring stack restructures to my management.

Then, Grafana released Mimir 3.0. This version features a re-architected ingestion logic for enhanced scalability and now fundamentally requires Apache Kafka to function.

Individually, none of these changes would be a decisive reason to abandon Grafana products. Setting aside the fact that they've made it increasingly difficult to locate ingestion endpoints for Grafana Cloud, seemingly to push users towards their new fleet-config management service, the cumulative effect of these rapid and often breaking changes makes me hesitant to recommend Grafana's offerings.

I simply cannot anticipate what will change next.

For my monitoring infrastructure, I crave stability; I want it to be 'boring.' This is precisely what Grafana is failing to provide. The pace within Grafana Labs appears excessively fast for many organizations, and I understand this pace is partially driven by career-oriented development. While Grafana employs brilliant individuals, not every customer possesses the technical acumen or capacity to prioritize constant monitoring stack adjustments. As we've seen repeatedly, complexity kills projects.

ℹ️ To be clear, Mimir, Loki, and Grafana are technically excellent software products, and I generally still appreciate them. My reservations stem from the way these products are managed and continuously altered.

Occasionally, I reflect on how my perspective might differ had I chosen the ELK stack in my first job. I also ponder whether the OpenShift approach (kube-prometheus-stack) combined with Thanos for long-term storage truly represents the most time-stable solution. My ultimate hope is that OpenTelemetry (OTEL) will soon stabilize, become 'boring,' and allow me the freedom to choose any backend I prefer. Because, frankly, I'm exhausted with monitoring. My priority is to support our applications, not to revisit the monitoring setup every few weeks, because monitoring is a necessity – not the core product, at least for most companies.