Day 17: Building Topology Visualization with AI-Assisted Health Monitoring



This content originally appeared on DEV Community and was authored by Clay Roach

Day 17: Building Topology Visualization with AI-Assisted Health Monitoring

The Strategic Pivot That Paid Off

Sometimes the best architectural decision is knowing when to pivot. Today, instead of continuing with the planned infrastructure work, we made a strategic call: implement the topology visualization feature that had been on our roadmap. The result? A complete, production-ready feature delivered in under 4 hours.

This wasn’t luck. This was the payoff from 16 days of infrastructure investment.

Full Topology Visualization

AI-Powered Insights in Action

The topology visualization is just the visual layer. The real power comes from the AI analysis that provides actionable insights:

AI-Powered Insights with Claude

Each model brings different perspectives:

  • Claude: Architectural pattern analysis and system design insights
  • GPT-4: Performance optimization opportunities
  • Llama: Resource utilization and scalability analysis
  • Local Statistical: Pure metrics-based anomaly detection

Why the Pivot Worked: The Infrastructure Foundation

The decision to pause other work and focus on topology visualization succeeded because of four key infrastructure investments:

1. AI Agent Infrastructure (Inspired by @ColeMedin)

A special shoutout to Cole Medin whose YouTube videos on AI-assisted development inspired today’s tooling improvements. After reviewing his content this morning, we created the code-implementation-agent – a specialized Claude Code agent that transforms design documents into production-ready Effect-TS code with strong typing and comprehensive tests.

This agent was instrumental in today’s rapid implementation:

# .claude/agents/code-implementation-agent.md
Purpose: Transform design documents into Effect-TS code
Tools: Read, Write, Edit, MultiEdit, Glob, Grep
Capabilities:
  - Creates interfaces and schemas first
  - Implements services with Effect patterns
  - Generates unit and integration tests
  - Ensures no "any" types or eslint issues

The agent-based approach meant we could focus on architecture while the AI handled boilerplate and implementation details.

2. Comprehensive Test Infrastructure (Days 5-7)

pnpm test:e2e
# ✓ 13 tests passing
# Total time: 31.3s

Our e2e test suite caught issues immediately:

  • TypeScript errors flagged before runtime
  • Component integration issues detected early
  • Real data flow validation with OpenTelemetry demo

3. CI/CD Pipeline (Days 10-12)

The automated pipeline caught and fixed:

  • Missing type definitions
  • ESLint violations
  • Unused imports and variables
  • Breaking changes in real-time

4. Real Data Integration (Day 14)

Having the OpenTelemetry demo integrated meant:

  • Immediate validation with 13 real services
  • Realistic performance metrics
  • Edge cases we wouldn’t have imagined

The 4-Hour Implementation Sprint

Here’s how we delivered a complete feature in less than half a workday:

Hour 0.5: Agent Setup & Planning

  • Reviewed Cole Medin’s AI workflow videos
  • Created code-implementation-agent for Effect-TS patterns
  • Set up ADR-013 as the design document

Hour 1: Core Visualization (with code-implementation-agent)

  • Agent generated ECharts force-directed graph setup
  • Automated node and edge data structures
  • Initial health color mapping with proper TypeScript types

Hour 2: Intelligence Layer

  • Service-specific thresholds implementation
  • LLM health explanations with Effect-TS schemas
  • Context-aware recommendations system

Hour 3: UI Polish & Integration

  • Tooltip positioning fixes (caught by e2e tests)
  • Service panel layout optimization
  • Interactive health filters with state management

Hour 3.5: Testing & Refinement

  • All 13 e2e tests passing
  • TypeScript errors resolved by CI/CD
  • Production ready with zero “any” types

The Challenge: Context-Aware Health Monitoring

Not all services are created equal. A 500ms response time might be perfectly acceptable for a reporting service but catastrophic for a payment gateway. Traditional monitoring treats every service the same, leading to alert fatigue and missed critical issues.

The Solution: Dynamic Health Visualization with AI Insights

We’ve built a topology visualization that displays service health dynamically, with the foundation for intelligent monitoring that will learn from your system over time.

Current Implementation: Visual Health Indicators

For now, we use basic thresholds to provide immediate visual feedback:

// Temporary thresholds for visualization
// These will be replaced by autoencoder-learned patterns
errorStatus: node.metrics.errorRate > 5 ? 2 : node.metrics.errorRate > 1 ? 1 : 0,
durationStatus: node.metrics.duration > 500 ? 2 : node.metrics.duration > 200 ? 1 : 0,
rateStatus: node.metrics.rate < 1 ? 1 : node.metrics.rate > 200 ? 1 : 0

These are intentionally simple because the real intelligence will come from:

Next Steps: Autoencoder-Based Learning

The next phase involves implementing the autoencoder for pattern learning:

  1. 📊 Pattern Learning: The autoencoder will learn normal behavior patterns for each service over time
  2. 🎯 Anomaly Detection: Deviations from learned patterns will trigger alerts, not arbitrary thresholds
  3. 📈 Adaptive Thresholds: Each service gets its own learned baseline based on historical data
  4. 🔄 Continuous Learning: The system adapts as your architecture evolves

Why we’re not using hard-coded service-type rules:

  • Every deployment is different: Your payment service != someone else’s payment service
  • Context matters: A service’s “normal” depends on time of day, load, dependencies
  • Evolution over time: Services change, thresholds should adapt automatically
  • Avoid assumptions: Let the data tell us what’s normal, not our preconceptions

AI-Powered Health Explanations

But we didn’t stop at smart thresholds. Each service gets an AI-generated health explanation that provides context and actionable recommendations:

export function generateHealthExplanation(
  serviceName: string,
  metrics?: ServiceMetricsDetail
): HealthExplanation {
  // Analyze each metric with context
  const impactedMetrics: HealthExplanation['impactedMetrics'] = []

  // Smart analysis based on metric combinations
  if (metrics.errorStatus >= 1 && metrics.durationStatus >= 1) {
    recommendations.push('Combined high errors and latency suggest infrastructure or dependency issues')
  }

  if (metrics.rateStatus === 2 && metrics.errorStatus === 0) {
    recommendations.push('High traffic with low errors indicates successful scaling - monitor resource usage')
  }

  return {
    status,
    summary: `${serviceName} is experiencing critical issues with ${criticalMetrics.join(', ')}. Immediate action required.`,
    recommendations: [...new Set(recommendations)]
  }
}

User Experience Features

Interactive Health Filtering

Click any health badge to filter the topology:

const handleHealthFilter = (status: string) => {
  setFilteredHealthStatuses(prev => {
    if (prev.includes(status)) {
      return prev.filter(s => s !== status)
    } else {
      return [...prev, status]
    }
  })
}

Smart Tooltip Positioning

No more tooltips covering important information:

tooltip: {
  trigger: 'item',
  position: function(point: number[]) {
    // Position tooltip to bottom-left of cursor
    return [point[0] - 10, point[1] + 10]
  },
  confine: true
}

Service Details Panel

When you click a node, you get:

  • 📊 Real-time RED metrics (Rate, Errors, Duration)
  • 🤖 AI-powered health analysis
  • 💡 Specific recommendations
  • 📈 Historical trending graphs

Real-World Integration

Connected to the OpenTelemetry demo, our visualization monitors 13 real services generating hundreds of thousands of spans:

const response = await axios.post(
  'http://localhost:4319/api/ai-analyzer/topology-visualization',
  { timeRange: params }
)

// Transform and enrich with intelligent thresholds
const transformedData = {
  ...response.data,
  nodes: response.data.nodes?.map((node: any) => ({
    ...node,
    metrics: enrichWithIntelligentThresholds(node.metrics)
  }))
}

Performance at Scale

The visualization handles large topologies efficiently:

  • Force-directed layout: Automatic organization of complex service meshes
  • Dynamic filtering: Instantly filter 100+ services by health status
  • Optimized rendering: Smooth interactions even with heavy data

Key Technical Innovations

1. Visual Health Representation

// Color-coded health status for immediate visual feedback
const getNodeOverallHealthColor = (metrics?: ServiceMetricsDetail): string => {
  const statuses = [metrics.rateStatus, metrics.errorStatus, metrics.durationStatus]
  const maxStatus = Math.max(...statuses)

  if (maxStatus === 2) return '#f5222d' // Critical - red
  if (maxStatus === 1) return '#faad14' // Warning - yellow
  return '#52c41a' // Healthy - green
}

2. Interactive Filtering

// Click health badges to filter topology view
const handleHealthFilter = (status: string) => {
  setFilteredHealthStatuses(prev => 
    prev.includes(status) 
      ? prev.filter(s => s !== status)
      : [...prev, status]
  )
}

3. Edge Intelligence

Show operation-level breakdowns on service connections:

operations: [
  { name: 'GET /api/products', count: 45, errorRate: 0.001, avgDuration: 35 },
  { name: 'POST /api/checkout', count: 45, errorRate: 0.005, avgDuration: 55 }
]

Testing & Quality

All 13 e2e tests pass, validating:

  • ✅ Topology rendering and interactions
  • ✅ Health filtering functionality
  • ✅ Service panel display
  • ✅ Tooltip positioning
  • ✅ Real data integration
pnpm test:e2e
# ✓ 13 passed (31.3s)

Lessons Learned

  1. Infrastructure Investment Pays Dividends: The 16 days spent on testing, CI/CD, and real data integration made this 4-hour sprint possible.

  2. Strategic Pivots Can Accelerate Progress: Sometimes the best plan is to capitalize on momentum and deliver value now.

  3. Start Simple, Build Intelligence: Basic thresholds today, autoencoder-learned patterns tomorrow. Ship value now, add intelligence iteratively.

  4. AI Enhances, Not Replaces: LLM explanations complement visual data, they don’t replace good visualization.

  5. Real Data Matters: Testing with the OpenTelemetry demo revealed edge cases mock data would miss.

  6. UX Details Count: Small improvements like tooltip positioning significantly impact usability.

Validating the 4-Hour Workday Approach

This implementation demonstrates that with proper infrastructure and AI assistance, we can deliver complete features in focused 4-hour sessions. The key isn’t working longer—it’s building the foundation that enables rapid delivery.

Consider what made this possible:

  • Automated Testing: Caught issues before they became problems
  • TypeScript + ESLint: Prevented entire categories of bugs
  • Real Data Pipeline: Validated against production-like scenarios
  • AI Code Generation: Accelerated boilerplate and implementation
  • Modular Architecture: Allowed focused feature development

We didn’t just build a feature today. We proved that the infrastructure investments of the past 16 days have created a platform for rapid, high-quality feature delivery.

What’s Next?

Tomorrow we’re focusing on:

  • Predictive Analytics: Use ML to predict issues before they happen
  • Custom Dashboards: Let users define their own service categories and thresholds
  • Alert Integration: Connect health monitoring to PagerDuty/Slack
  • Performance Optimization: Handle 1000+ service topologies

Try It Yourself

# Clone the repository
git clone https://github.com/clayroach/otel-ai.git
cd otel-ai

# Start the platform
pnpm dev:up

# Start the OpenTelemetry demo
pnpm demo:up

# Open the UI
open http://localhost:5173

# Navigate to AI Analyzer → Topology Graph

The Big Picture

We’re not just building another monitoring tool. We’re creating an AI-native observability platform that understands your architecture, learns from your patterns, and helps you make better decisions. The topology visualization is just the beginning.

Every service is different. Your monitoring should know that.

Building in public, learning in public. Follow the journey as we compress 12 months of enterprise development into 30 days with AI.

Day 17 Status: ✅ Topology visualization complete with intelligent health monitoring
Lines of Code: ~500 added
Tests Passing: 13/13
Services Monitored: 13 real services
Time Invested: <4 focused hours
AI Agents Created: 1 (code-implementation-agent)

Special thanks to Cole Medin’s YouTube channel for AI development workflow inspiration!

GitHub | Previous Day | Next Day


This content originally appeared on DEV Community and was authored by Clay Roach