The Making of Kloudfuse 3.5: Building Kloudfuse MCP Server for Observability Intelligence

The Kloudfuse MCP server exposes unified observability intelligence through natural language, enabling AI agents to troubleshoot like engineers do.

Table of Contents

When we started planning Kloudfuse 3.5, our team kept returning to a fundamental question: how do engineers actually interact with observability data? They write PromQL queries. They filter traces. They search logs. They correlate timestamps across dashboards. It works, but it requires expertise and cognitive overhead.

The Model Context Protocol (MCP) emerged as a standard for connecting AI agents to external data sources. We saw an opportunity not just to expose our APIs, but to fundamentally rethink how engineers ask questions about their systems.

The Problem with API Wrappers

Most MCP implementations we studied followed a straightforward approach: expose APIs for metrics, logs, and traces, then let the AI figure out how to use them. On the surface, this seems reasonable. But as we dug deeper, we realized this approach creates significant problems.

The AI sees disconnected data. It might fetch metrics showing high latency, then separately fetch traces, then separately search logs. All this correlation happens in the AI's context window, which burns tokens and slows down responses. More importantly, the AI doesn't understand observability workflows the way engineers do. When you're troubleshooting, you follow patterns: identify the affected service, check its dependencies, look for recent changes, correlate signals across telemetry types. A basic API wrapper doesn't encode any of this domain knowledge.

Performance suffers too. Multiple sequential API calls with large responses create latency and cost issues at scale. And the results are often inconsistent because the AI might not always make optimal API call sequences.

We needed something different: an MCP server that exposes intelligence, not just data.

Service-Centric Architecture

The core architectural decision was organizing observability data around services as the primary entity. When you query about a service in Kloudfuse, you see unified context: metrics, traces, logs, dependencies, and infrastructure mapping automatically correlated.

Kloudfuse MCP server exposes this same unified view. Ask about a service, and you receive:

Current health metrics (throughput, latency, error rates)
Recent traces showing actual request flows
Logs with errors or warnings from relevant timeframes
Upstream and downstream service dependencies
Infrastructure context (pods, nodes, clusters)

This isn't convenience. It's how troubleshooting actually works. Engineers think in terms of services and their relationships, not isolated metrics.

Multi-Layer Dependencies

Modern distributed systems have complex dependency graphs. A slow API might be caused by a downstream service, which calls a database, which runs on a Kubernetes node experiencing resource contention.

We built the Kloudfuse MCP server to expose three dependency layers:

Service dependencies from distributed tracing show microservice call graphs. If checkout-service calls payment-service calls database-service, these relationships are explicit.
Workload dependencies map services to Kubernetes deployments and pods. When investigating payment-service performance, you need to know which specific pods are affected.
Infrastructure dependencies connect workloads to nodes, storage volumes, and network interfaces. If a pod is slow because the underlying node is CPU-constrained, this relationship matters.

Exposing this three-layer dependency graph through MCP enables AI agents to follow the same troubleshooting paths engineers use in the Kloudfuse UI.

Maintaining Conversation Context

Real investigations are conversational. You ask a question, get an answer, then ask a follow-up based on what you learned. "What services have high error rates?" leads to "Show me traces from checkout-service" leads to "What are its dependencies?"

The Kloudfuse MCP server maintains this conversation context. When you ask "show me the previous hour," it remembers you were investigating checkout-service and adjusts the time window. When you ask "what about its dependencies," it knows which service you mean.

This temporal and entity tracking enables natural troubleshooting workflows without requiring every query to restate full context.

Tools Beyond Simple Queries

The MCP server exposes several specialized tools that AI agents can use:

Prometheus queries for metric aggregation and label discovery, enabling questions like "which services have the highest memory usage?"
APM tools for service discovery, tracing, and dependency mapping, answering questions like "what dependencies does checkout-service have?"
Kubernetes entity queries across pods, nodes, clusters, and workloads, correlating application performance with infrastructure health.
Log search using FuseQL and LogQL, enabling both simple searches and complex aggregations.
Alert management for retrieving alert history and configurations, answering questions like "when was this service last alerted?"
Event retrieval with filtering and facet discovery, correlating deployments and configuration changes with performance shifts.

Each tool is designed to return structured, correlated data optimized for AI interpretation, not raw API responses requiring extensive parsing.

Query Intelligence

The Kloudfuse MCP server includes query intelligence that interprets natural language intent. Ask "Why is checkout-service returning errors?" and the server:

Identifies checkout-service as the target entity
Recognizes "errors" as error rate metrics
Fetches recent error rate trends
Retrieves traces with error status codes
Searches logs for error messages in relevant timeframes
Checks dependencies for cascading failures
Returns unified, correlated context with execution metadata

Responses include the actual time ranges queried, the FuseQL/PromQL queries executed, and pagination tokens for large result sets. This transparency lets AI agents explain exactly where their answers came from, building trust in automated troubleshooting.

The AI agent then synthesizes this into natural language, but the heavy lifting happens in the Kloudfuse MCP server's intelligence layer.

Standards-Based Implementation

We've built Kloudfuse on open standards (OpenTelemetry, open query languages) to prevent vendor lock-in. The same philosophy guided our MCP implementation.

We implemented the Model Context Protocol specification exactly as designed, with no proprietary extensions. This means our MCP server works with any MCP-compatible AI agent: Claude Desktop, ChatGPT with MCP plugins, custom agents using MCP SDKs, or future MCP-compatible tools.

Customers aren't locked into a specific LLM provider. The same MCP server works across providers, and your observability data remains accessible through standard protocols.

Getting Started with the Kloudfuse MCP Server

Setting up the Kloudfuse MCP server takes minutes. Generate a service account token in Kloudfuse with appropriate RBAC permissions. Configure your MCP client (Claude Desktop, custom agent, or MCP SDK) with three parameters: your Kloudfuse API endpoint, the service account token, and optional stream-specific access controls.

The Kloudfuse MCP server authenticates using bearer tokens, inheriting the same RBAC policies as human users. A service account with read-only metrics access can query Prometheus but not modify alert configurations. This security model means you control exactly what AI agents can see and do.

Performance and Security

Building an MCP server on top of a high-volume observability platform required careful optimization. We implemented smart caching for service metadata, efficient query patterns using scheduled views and pre-aggregated data, response size limits to respect AI token constraints, and streaming for large result sets.

Security follows Kloudfuse's existing model: RBAC enforcement scopes MCP queries to user permissions, stream-specific access controls what data users can view, audit logging tracks all MCP queries for compliance, and service account authentication enables programmatic access.

AI agents inherit the same security boundaries as human users.

Looking Forward

Natural language queries aren't a replacement for dashboards or query languages. They're a complementary interface that reduces cognitive overhead for common troubleshooting workflows.

The Kloudfuse MCP server makes observability accessible to AI agents, enabling automated incident response, proactive monitoring, and knowledge accumulation. As the Model Context Protocol standard evolves, we'll continue exposing Kloudfuse's unified intelligence through this open interface.

The goal isn't to replace engineers. It's to give them a more natural way to interact with increasingly complex distributed systems.

The Kloudfuse MCP server is available in Kloudfuse 3.5. Learn more in our launch announcement.