By
Pralav Dessai
By
Harold Lim
Published on
Sep 5, 2024
In our previous blog post, we explored techniques for optimizing cardinality analysis to address resource constraints, enhance query performance, and streamline data processing. We discussed various data shaping strategies aimed at effectively managing high cardinality data and its implications for observability systems. In this blog we will talk about the storage considerations for high cardinality data.
To optimize the storage and retrieval of observability data, it's advantageous to use different indexing strategies for low and high cardinality attributes.
Understanding Low Cardinality vs. High Cardinality Data
Before diving into indexing strategies, let’s briefly recap what we mean by cardinality:
Low Cardinality: Attributes with a limited number of distinct values. For example, an attribute like “status” in a monitoring system might only have a few possible values such as “error,” “warning,” and “success.”
High Cardinality: Attributes with a large number of distinct values. For instance, a user ID in a large-scale application where each user has a unique identifier would exhibit high cardinality.
Indexing Strategies for Low Cardinality Attributes
For low cardinality attributes, using an inverted index is most beneficial. An inverted index is a data structure that maps each unique value of an attribute to a list of records containing that value. It's particularly efficient for filtering and aggregation operations. When querying data based on low cardinality attributes, such as categories or statuses, an inverted index allows for rapid lookup and retrieval of relevant records.
Example of Low Cardinality Indexing
Consider a monitoring system where you need to query records based on the “status” attribute. If you use an inverted index for this attribute, it looks like this:
Status: Error → [Record ID 1, Record ID 7, Record ID 9]
Status: Warning → [Record ID 3, Record ID 5]
Status: Success → [Record ID 2, Record ID 4, Record ID 6, Record ID 8]
When you need to query all records with the status “error,” the inverted index allows for quick retrieval of Record IDs 1, 7, and 9 without scanning through the entire dataset.
Indexing Strategies for Low Cardinality Attributes
High cardinality attributes require a different indexing approach, as traditional indexing techniques may not scale well due to the large number of unique values. High cardinality indexes are useful for operations like unique counts, where you need to determine the count of distinct values for a particular attribute.
Advanced techniques such as data sketches can be used for high cardinality indexes. Data Sketches are a specialized class of algorithms known as streaming algorithms. Developed by Apache, they aim to expedite the analysis of large data volumes that are otherwise challenging to scale due to their demand for extensive compute resources and time to produce exact results.
Example of High Cardinality Indexing
By harnessing Data Sketches, observability solutions can effectively manage high cardinality attributes with significantly reduced computational and storage overhead compared to conventional methods. Some examples include:
HyperLogLog: Used for approximate distinct counting. Suppose you need to count the number of unique user IDs in a large dataset. Instead of maintaining a list of all IDs (which is resource-intensive), HyperLogLog provides a compact representation and an approximate count.
Count-Min Sketch: Useful for frequency estimation. If you want to determine how often each user ID appears in a dataset, Count-Min Sketch can approximate these frequencies without storing the exact counts for every unique ID.
Quantile Sketch: This method is used to estimate quantiles, such as the median or percentiles, of a dataset. For example, you can use Quantile Sketch to estimate the 95th percentile of response times when monitoring latency in a distributed system.
Using different indexes for low and high cardinality attributes optimizes data storage and query performance. Low cardinality indexes facilitate efficient filtering and aggregation, speeding up common query operations. High cardinality indexes enable accurate unique counts and other operations on attributes with a large number of unique values, while minimizing resource usage.