How We Built Multi-Resolution Rollups to Make Dashboard Queries 10x Faster

Published on
Table of Contents
Pull up a 7-day dashboard for 200 metrics at 15-second raw resolution. That's 8 million data points your query engine needs to scan, aggregate, and render. Now do it during an incident, when three teams have the same dashboard open and your alerting system is evaluating the same queries in the background. The query layer buckles, dashboards spin, and the people who need data most are staring at loading indicators.
This is not a hypothetical. It's the daily reality for platform teams running large-scale Prometheus-based monitoring. The standard workaround is recording rules: pre-computed queries that run on a schedule and store results as new time series. They work, until they don't. Recording rules accumulate over time (hundreds across dozens of services), each one a potential source of drift, staleness, or misconfiguration. They require manual maintenance as services evolve. And they never quite cover the ad hoc query someone needs during an incident at 3 AM.
This post covers how we built multi-resolution rollups in Kloudfuse to solve this problem at the architecture level. We'll walk through the Kafka-Pinot pipeline that computes rollups in parallel with raw ingestion, how the query service automatically selects the right resolution, and why this approach drastically reduces your reliance on recording rules. The idea originated from a conversation with Benjamin at Zscaler, where the scale of their monitoring infrastructure made the recording rule problem impossible to ignore.
What Problem Do Multi-Resolution Rollups Actually Solve?
The core issue is a mismatch between data resolution and query intent. Raw metrics at 15-second or 30-second intervals are essential for real-time troubleshooting. But for a 30-day capacity planning dashboard, nobody needs 2.1 million data points per metric. They need the trend, not the individual samples.
Traditional approaches force engineers to choose upfront: either query raw data (slow for long ranges) or create recording rules that pre-aggregate (fast but inflexible). Multi-resolution rollups remove this choice by maintaining multiple pre-aggregated views automatically.
Here's the concrete impact for a single metric at 15-second raw intervals:
Query duration | Raw, 15s | 5min rollup | 10min rollup |
|---|---|---|---|
6 hours | 1,440 records | 72 records | 36 records |
7 days | 40,320 records | 2,016 records | 1,008 records |
30 days | 172,800 records | 8,640 records | 4,320 records |
1 year | 2,102,400 records | 105,120 records | 52,560 records |
At enterprise scale, the difference becomes significant. With 200 metrics, a 7-day query drops from roughly 8 million raw records to about 403,000 records with 5-minute rollups, or 201,000 records with 10-minute rollups. That is a 20x reduction with 5-minute rollups and a 40x reduction with 10-minute rollups when compared with 15-second raw data. For 30-second raw intervals, a 5-minute rollup gives a clean 10x reduction.
These improvements compound. Every dashboard load, every alert evaluation, every ad hoc investigation benefits from the reduced scan volume.
How Does the Rollup Pipeline Work Under the Hood?
The architecture uses a Kafka-Pinot pipeline with two parallel processing paths that branch from the same ingestion point.
Raw metrics path: Time series data arrives from agents (Datadog, OpenTelemetry, Prometheus remote write) and enters the Ingester Service, which publishes to Kafka as kf_metrics_topic. Kafka forwards this directly to Pinot via the Metrics Decoder, writing to the kf_metrics table with columns: name, timestamp, labels, value, le.
Rollup metrics path: The same kf_metrics_topic feeds a Metrics Transformer that computes aggregations at multiple configured resolutions. The transformer publishes results to kf_metrics_rollup_topic, which Pinot ingests into the kf_metrics_rollup table with columns: name, timestamp, labels, sum, count, min, max, counter, first, first_ts, last, le (for classic histograms), and histogram (for native histograms).
The two paths run in parallel from the same Kafka topic. Rollup computation adds zero latency to raw metric ingestion. The Metrics Transformer handles all configured resolutions (default: 5min, 10min, 30min, 1hr, 4hr) and publishes them as a single rollup topic.
Why Store sum, count, min, max, counter, last, and histogram Separately?
Each rollup bucket preserves enough statistical information to reconstruct any common aggregation at query time:
sum and count together allow computing averages without losing precision
min and max preserve extremes that would be lost in a simple average
counter stores the last counter value accounting for resets within the rollup window, preserving Prometheus counter semantics
first, first_ts, and last record the first value, its timestamp, and the last value in the rollup window for accurate rate and delta calculations across counter resets
le preserves classic histogram bucket boundaries through the rollup process
histogram stores native histogram data, supporting Prometheus's newer native histogram format
This design choice means you don't sacrifice query flexibility for speed. A rate() function works correctly across rollup boundaries because the counter reset information is preserved. A histogram_quantile() works because the le boundaries are intact. You get the performance of pre-aggregated data with the accuracy of raw data for any supported aggregation.
How Does the Query Service Select the Right Resolution?
The Query Service is the intelligent routing layer that makes rollups transparent to users. When a query arrives, the service evaluates the step size and time range, then selects the most appropriate data source:
Short time range or small step size: reads from the raw
kf_metricstable to preserve fine-grained detailLonger time ranges: selects the coarsest rollup resolution that still meets the query's step size and range vector selector requirements
The selection is entirely automatic. No configuration changes. No query modifications. Existing PromQL queries work without changes because the resolution selection happens below the query language layer.
What About Partial Rollup Coverage?
This is where an important edge case appears. If rollups were recently enabled, or if data is being backfilled, a query might span a time range where only part of the data has rollup coverage. The Query Service handles this by automatically splitting the query: it uses the rollup table for the portion with rollup data and the raw table for the remainder, then combines the results transparently.
This means enabling rollups on an existing deployment is seamless. There's no migration step. No backfill job to run. The system uses rollups where they exist and falls back to raw data where they don't, converging to full rollup coverage as time passes.
Moving Beyond Recording Rules
Moving Beyond Recording Rules
Recording rules exist because Prometheus's storage model doesn't support multi-resolution queries natively. If you need a 30-day SLO burn rate, you create a recording rule that pre-computes it every minute and stores the result as a new time series. Over time, teams accumulate hundreds of these rules:
SLO burn rates at 7-day and 30-day windows
Aggregated error rates by service and region
P99 latency percentiles across cluster boundaries
Capacity utilization summaries for planning dashboards
Each recording rule is a configuration artifact that needs to match the current service topology, use the correct label selectors, and be maintained as services evolve. When they drift, the symptoms are subtle: an SLO dashboard showing stale data, an alert evaluating against an outdated aggregation, a capacity report missing a newly deployed service.
But the problems go deeper than maintenance burden. Recording rules introduce their own performance and correctness issues that are easy to miss.
Serial execution creates cascading delays. Recording rules within the same group run sequentially, not in parallel, with unaligned time windows. A group with 50 rules means the last rule starts only after the first 49 complete. During a metrics spike, when you need results most, the evaluation pipeline falls behind.
Recording rules are steady-state load. Each rule is a PromQL query that runs every evaluation interval, whether anyone is looking at the dashboard or not. Complex rules using expensive functions like
histogram_quantileor multi-metric aggregations create constant background query pressure. If alerts evaluate at the same interval as the recording rule, you've saved nothing.Teams create redundant rules for the same metric. It's common to find 20 recording rules for the same underlying metric with different time windows and filter combinations, one per dashboard panel. The dashboard looks clean because each panel queries a single pre-computed metric, but the system is paying the compute and storage cost 20 times over. And when someone needs a filter combination that wasn't pre-configured, they're back to querying raw data anyway.
Aggregation without label reduction doubles cardinality. If a recording rule aggregates but preserves all original labels including high-cardinality ones, the output has nearly the same series count as the raw metric. Instead of reducing load, the rule has effectively doubled the metric's storage footprint.
Unsupervised rules cause repeated damage. A recording rule that nobody reviews can execute expensive or misleading queries on every evaluation cycle. A rule like
topk(1000, sum(rate(http_requests_total[1h])) by (user_id))aggregates across a high-cardinality label over a wide window. It might look reasonable at creation time, but it's expensive to evaluate and the results may be unreliable due to unaligned metric boundaries. Unlike a one-off bad query, a recording rule runs this on repeat.
With Kloudfuse's intelligent query routing, you no longer have to rely on recording rules to keep dashboards fast. A 30-day capacity query automatically scans the 1-hour rollup table instead of 30 days of raw data, significantly reducing load. For highly precise calculations like SLOs, Kloudfuse computes 7-day and 30-day burn rates at query time directly from raw data, improving accuracy and eliminating the need for SLO recording rules entirely. The result? No configuration to update, no drift to detect, and perfectly accurate SLOs.
The tradeoff is explicit: rollup tables consume additional storage and in-memory resources. For large historical retention windows, this may require additional disk capacity. We considered this acceptable because the operational cost of maintaining hundreds of recording rules across dozens of services far exceeds the infrastructure cost of additional storage.
How Does This Interact with Dashboard Rendering?
Kloudfuse supports both implicit and explicit rollups in charts:
Implicit rollups activate automatically when the query's time range exceeds what raw resolution can serve efficiently. For a 7-day query, the system selects a coarser rollup interval based on the overall timeframe rather than rendering all 40,320 raw data points per metric. The default aggregation method is average.
Explicit rollups let users configure both the rollup interval and the aggregation method (avg, sum, min, max, count) per chart panel. This is useful when the aggregation semantics matter: a sum for request counts, a max for peak latency, a min for available capacity.
The combination means dashboards render faster without manual tuning, while power users retain full control over aggregation behavior.
What Does This Look Like at Zscaler's Scale?
Zscaler operates one of the largest cloud security platforms in the world, processing traffic for thousands of enterprise customers. Their observability infrastructure generates telemetry volumes where the recording rule problem is not an inconvenience but an operational blocker.
Michael Kuperman, Chief Reliability Officer at Zscaler, described the value in terms of unified production visibility: the ability to see how systems actually behave, with the performance to render those views interactively, changes how teams investigate and plan.
The multi-resolution rollup concept emerged from conversations with their engineering team about what it would take to make week-long and month-long dashboard queries interactive rather than batch-scheduled. The answer was architectural: the query engine needs to serve queries at the right resolution automatically, not force engineers to pre-configure the answer.
Configuration and Rollout
Rollups are enabled by default since Kloudfuse 4.0. The default resolutions (5min, 10min, 30min, 1hr, 4hr) cover the most common dashboard time ranges. Administrators can customize resolutions or disable rollups entirely via cluster configuration.
For existing deployments upgrading to a version with rollups, the transition is gradual. Rollup tables begin populating from the moment of upgrade. Queries automatically use rollups for the time ranges where they're available. No migration window, no coordinated cutover.
With workload isolation in Kloudfuse 4.0, the rollup computation (part of the ingestion pipeline) scales independently from the query layer. Heavy rollup computation during a metrics spike doesn't compete with dashboard query resources.
The Design Decision: Why Not Just Downsample?
Downsampling, the approach used by Thanos and some Prometheus long-term storage backends, reduces resolution by discarding data points. This is simpler to implement but irreversible. Once you've downsampled 15-second data to 5-minute resolution, the original data points are gone. If an incident investigation needs the original resolution for a time range that's been downsampled, you're out of luck.
Multi-resolution rollups keep the raw data alongside the rollup tables. The query service selects the appropriate resolution per query, but the raw data is always available for drill-down. This costs more storage, but preserves the ability to investigate at full resolution for any time range within the retention window.
The alternative we considered, compute-on-read aggregation without pre-materialized rollups, preserves flexibility but transfers the cost to query time. For a platform where dozens of teams run dashboards concurrently, the query-time cost is unacceptable. Pre-materialized rollups move the cost to ingestion time, where it can be amortized and scaled independently.
Next Steps
The metrics rollup configuration guide covers resolution customization and storage planning. For teams evaluating how rollups interact with their existing PromQL queries, the rollup aggregation documentation details how implicit and explicit rollups work in charts.
What's your team's recording rule situation? We've heard everything from "we have 12 and they work fine" to "we have 400 and nobody knows which ones are still valid." We'd be curious where you fall on that spectrum.
