Observability Cost Control: Cardinality, Rollups, and What Actually Works

Table of Contents

While this article focuses primarily on metrics cardinality and rollups, where cost dynamics are most acute, cost control in practice spans logs, traces, and query activity as well. Kloudfuse provides rate control and governance across all telemetry streams, not just metrics.

The Bill Nobody Expected

Every observability cost conversation starts the same way. Engineering leadership looks at the invoice, asks why it grew 40 percent in one quarter, and nobody can explain which teams, services, or decisions drove the increase. The platform team suspects cardinality. The vendor says the contract is usage-based. And the application teams who generated the telemetry have no visibility into what they are consuming.

This is not a pricing problem. It is a measurement problem. And most observability platforms are structurally unable to solve it, because they separate the act of generating telemetry from the act of paying for it.

Observability costs are driven by three things: the volume of data ingested, the cardinality of metrics generated, and the frequency and volume at which data is queried. Most cost discussions focus on the first — how many GBs of logs, how many spans, how many data points. But the second and third are where costs compound in ways that are difficult to predict and harder to control.

This problem is about to get worse. AI-driven SRE workflows - MCP servers, agentic troubleshooting bots, copilot-style interfaces are changing what gets queried and how. Traditional dashboards and alerts run canned queries against known metrics with known label combinations. An AI agent investigating a latency spike runs exploratory, ad hoc queries across whatever metrics and labels exist, because it does not know in advance which series will be relevant to the investigation. That fundamentally changes the cost-control calculus.

Vendor approaches that manage cost by controlling what gets indexed and what gets dropped, selecting which tags to keep at ingestion time, assume someone knows in advance which labels matter. In a world where AI agents explore your telemetry dynamically during incident response, that assumption breaks down. You cannot drop the label that an AI agent will need tomorrow during a production incident you have not had yet. The shift from dashboard-driven to agent-driven observability makes cardinality governance and multi-resolution rollups more important, not less because the query patterns are no longer predictable.

Cardinality: The Cost Driver Nobody Manages

Cardinality is the number of unique time series generated by a metric. A metric named http_request_duration with labels for method, status_code, and endpoint might generate 50 time series. Add a label for user_id or request_id, and it generates 50,000. Add pod_name in a Kubernetes environment with rolling deployments, and it generates a new time series every time a pod restarts.

High cardinality is not inherently bad. Some analytical use cases require it. The problem is that most teams do not know they are generating it. An engineer adds a label to improve debugging, does not realize it creates 10,000 new time series, and the cost does not surface until the next billing cycle.

Here is how the major vendors handle this:

Datadog's Metrics without Limits decouples ingested and indexed custom metrics. Per Datadog's billing documentation, ingested custom metrics exceeding the account allotment are billed at a per-100-metric rate, while indexed custom metric pricing depends on the customer's contract terms. The architecture gives teams more control over which tags are queryable, but it also requires ongoing tag-configuration discipline — teams must decide at ingestion time which labels to index, and that decision directly affects both cost and query capability.
Grafana's Adaptive Metrics identifies underused telemetry and reduces metrics cost by aggregating or dropping low-value series. The approach works for telemetry optimization — reducing what you store when you know what you do not need. But it is still a telemetry-selection approach: it assumes someone (or something) can determine in advance which metrics and labels are low-value. In an environment where AI agents run exploratory queries during incident response, that assumption is harder to make confidently.
Chronosphere treats cardinality as a governance problem. Its Control Plane and Metrics Quotas let platform teams allocate and limit persisted metric writes across pools that can map to services or teams. That is a strong framing for metrics governance. The quota model gives platform teams direct control over who can write how much — which is closer to the kind of operational control this article argues teams need. The limitation is that it remains metrics-centric rather than a unified cost-governance model across all telemetry streams.
Elastic's 2026 observability research found that 96% of surveyed organizations are actively taking steps to reduce observability cost, with 51% consolidating existing toolsets (Source: Elastic, 2026 Global Observability Trends). The finding underscores how widespread cost pressure has become — and how most responses focus on toolset consolidation rather than platform-level cost governance.

Comparison: Cost Control Approaches

Capability	Datadog	Grafana	Chronosphere	Kloudfuse
Cardinality visibility	Tag-level config	ML recommendations	Per-team quotas	Real-time explorer
Prevention vs. cleanup	After the fact	After the fact	Preventive quotas	Rate control at ingestion
Multi-resolution rollups	Metrics without Limits (tag selection)	Adaptive aggregation	Smart aggregation	Automatic resolution selection
Per-team cost attribution	Limited	Global only	Yes	Built-in dashboards
Cross-stream coverage	Per-product pricing	Metrics + Logs + Traces + Profiles	Metrics-focused	All 5 streams unified
Deployment model	Hosted SaaS	Hosted SaaS	Hosted SaaS	Self-SaaS (your VPC)

Rollups: The Feature Everyone Has, Nobody Uses Well

Every time-series database supports some form of downsampling or rollup. Store raw data for recent queries, aggregate into lower-resolution data for longer time ranges. The concept is simple. The implementation is where vendors diverge.

Most platforms offer a single rollup tier, configured globally. Raw data for 15 days, 5-minute aggregates after that. This is a blunt instrument. A platform team managing 20 application teams has 20 different retention and resolution requirements. Some teams need raw data for 7 days. Others need 30. Some metrics require 1-minute resolution for real-time alerting. Others are fine at 1-hour resolution for capacity planning.

Kloudfuse supports multi-rollup resolution with automatic query selection. Metrics are stored at multiple resolutions simultaneously: raw, 5-minute, 1-hour, and custom rollups configurable by stream or label. When a user queries a 24-hour time range, the query engine automatically selects 5-minute rollups. A 7-day range uses hourly rollups. A 15-minute investigation uses raw data. The user does not choose. The platform optimizes based on the query.

This matters for cost because query cost is a function of the amount of data scanned. A 30-day dashboard query that scans raw data is 360 times more expensive in compute than the same query against hourly rollups. Most platforms force this tradeoff onto the user or the platform team. Kloudfuse makes it automatic.

Rate Control: Prevention, Not Cleanup

The most expensive observability cost is the one you did not see coming. An application team deploys a new service with high-cardinality labels. A log pipeline starts emitting debug-level messages in production. A trace instrumentation change captures every database query parameter. By the time anyone notices, the data is ingested, stored, indexed, and billed.

Cost cleanup is reactive. You find the waste, remove it, and negotiate a credit with the vendor if you are lucky. Cost prevention is architectural. You set boundaries at the ingestion layer so that cost surprises cannot happen.

Kloudfuse provides stream-level rate controls. Platform teams set per-stream ingestion limits for metrics, logs, traces, events, and RUM. Each stream type has independent thresholds. When an application team exceeds their allocation, the rate control activates: excess data is either dropped, sampled, or queued depending on the policy. The platform team gets notified. Other teams are unaffected.

This is fundamentally different from what hosted vendors offer. Datadog does not provide per-team ingestion limits. Grafana's cost controls operate after ingestion, recommending what to remove. New Relic's Compute Capacity Unit (CCU) model explicitly ties platform cost to actions such as queries, alert evaluations, API calls, and page loads. That makes query behavior a direct input to the cost model which is transparent, but also means that exploratory AI-driven queries during incident response contribute to cost in ways that are harder to predict or budget. The CCU model creates cost awareness, but does not by itself provide the kind of platform-level rate control and per-team isolation that this article argues platform engineering teams need to govern observability economics proactively. The rate control has to happen at the ingestion layer, before the data is stored and billed.

The Cardinality Explorer: See It Before You Pay For It

Most cardinality problems are invisible until the bill arrives. An engineer adds a label. A Helm chart change introduces a new tag. A migration generates temporary high-cardinality data. The platform team finds out 30 days later.

Kloudfuse's Metrics Cardinality Explorer is a real-time analytical interface that shows cardinality by metric name, by label, by service, and by team. Platform teams can see exactly which labels are driving cardinality growth, identify which team deployed the change, and decide whether to keep the label, aggregate it, or drop it — before the next billing cycle.

This is not a recommendation engine. It does not use machine learning to suggest what you might want to remove. It is a diagnostic tool that gives platform teams the same visibility into their observability costs that they have into their cloud compute costs. Exact numbers. Exact attribution. Real-time.

This matters more in an agent-driven observability environment. When AI agents run ad hoc exploratory queries, they interact with whatever label combinations exist in the data. A cardinality problem that was invisible behind canned dashboards becomes a cost and performance problem the moment an agent starts traversing high-cardinality series during incident investigation. The Explorer gives platform teams the visibility to identify and resolve these issues before they affect both human and AI-driven query performance.

The Self-SaaS Cost Advantage

There is a structural cost difference between running observability in a vendor’s cloud and running it in your own VPC. Hosted observability pricing includes not only the underlying compute, storage, and network required to process your data, but also the vendor’s platform and service layer on top of that infrastructure. For high-volume environments, that additional cost layer can materially change total cost of ownership, especially when it scales with data volume regardless of how much of that data is actually useful.

Kloudfuse runs in your VPC on your cloud infrastructure. You pay cloud provider rates for compute and storage, plus the Kloudfuse software license. For organizations processing more than a few terabytes per day, the total cost of ownership difference is significant. The exact number depends on volume, but the structural economics are clear: removing the vendor's infrastructure margin changes the cost equation fundamentally.

This also eliminates a second hidden cost: data transfer. Hosted observability vendors require you to send all telemetry data out of your VPC to their infrastructure. At scale, egress charges are material. With Kloudfuse, data never leaves your VPC. Kloudfuse's architecture eliminates the recurring data-egress pattern associated with shipping observability data to a vendor-hosted platform. There is no per-GB charge for sending telemetry to a third party, the data moves within your infrastructure, governed by your own cloud networking costs.

A Cost Control Checklist for Platform Teams

Do you know your current cardinality by team and by service? If not, you are optimizing blind.
Can you set per-team ingestion limits? If not, one team can spike the bill for everyone.
Are your dashboards using rollups automatically? If every query scans raw data, you are overpaying for compute.
Can you attribute cost by team, by stream, by time period? If not, you cannot have a productive conversation with engineering leadership about the budget.
Does your data stay in your VPC? If not, you are paying a vendor margin plus egress charges on every byte of telemetry.

Cost control is not about spending less on observability. It is about knowing exactly what you spend, why you spend it, and having the architectural controls to make intentional decisions instead of reacting to bills.