This content originally appeared on DEV Community and was authored by Clay Roach
Day 17: Building Topology Visualization with AI-Assisted Health Monitoring
The Strategic Pivot That Paid Off
Sometimes the best architectural decision is knowing when to pivot. Today, instead of continuing with the planned infrastructure work, we made a strategic call: implement the topology visualization feature that had been on our roadmap. The result? A complete, production-ready feature delivered in under 4 hours.
This wasn’t luck. This was the payoff from 16 days of infrastructure investment.
AI-Powered Insights in Action
The topology visualization is just the visual layer. The real power comes from the AI analysis that provides actionable insights:
Each model brings different perspectives:
- Claude: Architectural pattern analysis and system design insights
- GPT-4: Performance optimization opportunities
- Llama: Resource utilization and scalability analysis
- Local Statistical: Pure metrics-based anomaly detection
Why the Pivot Worked: The Infrastructure Foundation
The decision to pause other work and focus on topology visualization succeeded because of four key infrastructure investments:
1. AI Agent Infrastructure (Inspired by @ColeMedin)
A special shoutout to Cole Medin whose YouTube videos on AI-assisted development inspired today’s tooling improvements. After reviewing his content this morning, we created the code-implementation-agent
– a specialized Claude Code agent that transforms design documents into production-ready Effect-TS code with strong typing and comprehensive tests.
This agent was instrumental in today’s rapid implementation:
# .claude/agents/code-implementation-agent.md
Purpose: Transform design documents into Effect-TS code
Tools: Read, Write, Edit, MultiEdit, Glob, Grep
Capabilities:
- Creates interfaces and schemas first
- Implements services with Effect patterns
- Generates unit and integration tests
- Ensures no "any" types or eslint issues
The agent-based approach meant we could focus on architecture while the AI handled boilerplate and implementation details.
2. Comprehensive Test Infrastructure (Days 5-7)
pnpm test:e2e
# ✓ 13 tests passing
# Total time: 31.3s
Our e2e test suite caught issues immediately:
- TypeScript errors flagged before runtime
- Component integration issues detected early
- Real data flow validation with OpenTelemetry demo
3. CI/CD Pipeline (Days 10-12)
The automated pipeline caught and fixed:
- Missing type definitions
- ESLint violations
- Unused imports and variables
- Breaking changes in real-time
4. Real Data Integration (Day 14)
Having the OpenTelemetry demo integrated meant:
- Immediate validation with 13 real services
- Realistic performance metrics
- Edge cases we wouldn’t have imagined
The 4-Hour Implementation Sprint
Here’s how we delivered a complete feature in less than half a workday:
Hour 0.5: Agent Setup & Planning
- Reviewed Cole Medin’s AI workflow videos
- Created
code-implementation-agent
for Effect-TS patterns - Set up ADR-013 as the design document
Hour 1: Core Visualization (with code-implementation-agent)
- Agent generated ECharts force-directed graph setup
- Automated node and edge data structures
- Initial health color mapping with proper TypeScript types
Hour 2: Intelligence Layer
- Service-specific thresholds implementation
- LLM health explanations with Effect-TS schemas
- Context-aware recommendations system
Hour 3: UI Polish & Integration
- Tooltip positioning fixes (caught by e2e tests)
- Service panel layout optimization
- Interactive health filters with state management
Hour 3.5: Testing & Refinement
- All 13 e2e tests passing
- TypeScript errors resolved by CI/CD
- Production ready with zero “any” types
The Challenge: Context-Aware Health Monitoring
Not all services are created equal. A 500ms response time might be perfectly acceptable for a reporting service but catastrophic for a payment gateway. Traditional monitoring treats every service the same, leading to alert fatigue and missed critical issues.
The Solution: Dynamic Health Visualization with AI Insights
We’ve built a topology visualization that displays service health dynamically, with the foundation for intelligent monitoring that will learn from your system over time.
Current Implementation: Visual Health Indicators
For now, we use basic thresholds to provide immediate visual feedback:
// Temporary thresholds for visualization
// These will be replaced by autoencoder-learned patterns
errorStatus: node.metrics.errorRate > 5 ? 2 : node.metrics.errorRate > 1 ? 1 : 0,
durationStatus: node.metrics.duration > 500 ? 2 : node.metrics.duration > 200 ? 1 : 0,
rateStatus: node.metrics.rate < 1 ? 1 : node.metrics.rate > 200 ? 1 : 0
These are intentionally simple because the real intelligence will come from:
Next Steps: Autoencoder-Based Learning
The next phase involves implementing the autoencoder for pattern learning:
-
Pattern Learning: The autoencoder will learn normal behavior patterns for each service over time
-
Anomaly Detection: Deviations from learned patterns will trigger alerts, not arbitrary thresholds
-
Adaptive Thresholds: Each service gets its own learned baseline based on historical data
-
Continuous Learning: The system adapts as your architecture evolves
Why we’re not using hard-coded service-type rules:
- Every deployment is different: Your payment service != someone else’s payment service
- Context matters: A service’s “normal” depends on time of day, load, dependencies
- Evolution over time: Services change, thresholds should adapt automatically
- Avoid assumptions: Let the data tell us what’s normal, not our preconceptions
AI-Powered Health Explanations
But we didn’t stop at smart thresholds. Each service gets an AI-generated health explanation that provides context and actionable recommendations:
export function generateHealthExplanation(
serviceName: string,
metrics?: ServiceMetricsDetail
): HealthExplanation {
// Analyze each metric with context
const impactedMetrics: HealthExplanation['impactedMetrics'] = []
// Smart analysis based on metric combinations
if (metrics.errorStatus >= 1 && metrics.durationStatus >= 1) {
recommendations.push('Combined high errors and latency suggest infrastructure or dependency issues')
}
if (metrics.rateStatus === 2 && metrics.errorStatus === 0) {
recommendations.push('High traffic with low errors indicates successful scaling - monitor resource usage')
}
return {
status,
summary: `${serviceName} is experiencing critical issues with ${criticalMetrics.join(', ')}. Immediate action required.`,
recommendations: [...new Set(recommendations)]
}
}
User Experience Features
Interactive Health Filtering
Click any health badge to filter the topology:
const handleHealthFilter = (status: string) => {
setFilteredHealthStatuses(prev => {
if (prev.includes(status)) {
return prev.filter(s => s !== status)
} else {
return [...prev, status]
}
})
}
Smart Tooltip Positioning
No more tooltips covering important information:
tooltip: {
trigger: 'item',
position: function(point: number[]) {
// Position tooltip to bottom-left of cursor
return [point[0] - 10, point[1] + 10]
},
confine: true
}
Service Details Panel
When you click a node, you get:
Real-time RED metrics (Rate, Errors, Duration)
AI-powered health analysis
Specific recommendations
Historical trending graphs
Real-World Integration
Connected to the OpenTelemetry demo, our visualization monitors 13 real services generating hundreds of thousands of spans:
const response = await axios.post(
'http://localhost:4319/api/ai-analyzer/topology-visualization',
{ timeRange: params }
)
// Transform and enrich with intelligent thresholds
const transformedData = {
...response.data,
nodes: response.data.nodes?.map((node: any) => ({
...node,
metrics: enrichWithIntelligentThresholds(node.metrics)
}))
}
Performance at Scale
The visualization handles large topologies efficiently:
- Force-directed layout: Automatic organization of complex service meshes
- Dynamic filtering: Instantly filter 100+ services by health status
- Optimized rendering: Smooth interactions even with heavy data
Key Technical Innovations
1. Visual Health Representation
// Color-coded health status for immediate visual feedback
const getNodeOverallHealthColor = (metrics?: ServiceMetricsDetail): string => {
const statuses = [metrics.rateStatus, metrics.errorStatus, metrics.durationStatus]
const maxStatus = Math.max(...statuses)
if (maxStatus === 2) return '#f5222d' // Critical - red
if (maxStatus === 1) return '#faad14' // Warning - yellow
return '#52c41a' // Healthy - green
}
2. Interactive Filtering
// Click health badges to filter topology view
const handleHealthFilter = (status: string) => {
setFilteredHealthStatuses(prev =>
prev.includes(status)
? prev.filter(s => s !== status)
: [...prev, status]
)
}
3. Edge Intelligence
Show operation-level breakdowns on service connections:
operations: [
{ name: 'GET /api/products', count: 45, errorRate: 0.001, avgDuration: 35 },
{ name: 'POST /api/checkout', count: 45, errorRate: 0.005, avgDuration: 55 }
]
Testing & Quality
All 13 e2e tests pass, validating:
Topology rendering and interactions
Health filtering functionality
Service panel display
Tooltip positioning
Real data integration
pnpm test:e2e
# ✓ 13 passed (31.3s)
Lessons Learned
Infrastructure Investment Pays Dividends: The 16 days spent on testing, CI/CD, and real data integration made this 4-hour sprint possible.
Strategic Pivots Can Accelerate Progress: Sometimes the best plan is to capitalize on momentum and deliver value now.
Start Simple, Build Intelligence: Basic thresholds today, autoencoder-learned patterns tomorrow. Ship value now, add intelligence iteratively.
AI Enhances, Not Replaces: LLM explanations complement visual data, they don’t replace good visualization.
Real Data Matters: Testing with the OpenTelemetry demo revealed edge cases mock data would miss.
UX Details Count: Small improvements like tooltip positioning significantly impact usability.
Validating the 4-Hour Workday Approach
This implementation demonstrates that with proper infrastructure and AI assistance, we can deliver complete features in focused 4-hour sessions. The key isn’t working longer—it’s building the foundation that enables rapid delivery.
Consider what made this possible:
- Automated Testing: Caught issues before they became problems
- TypeScript + ESLint: Prevented entire categories of bugs
- Real Data Pipeline: Validated against production-like scenarios
- AI Code Generation: Accelerated boilerplate and implementation
- Modular Architecture: Allowed focused feature development
We didn’t just build a feature today. We proved that the infrastructure investments of the past 16 days have created a platform for rapid, high-quality feature delivery.
What’s Next?
Tomorrow we’re focusing on:
- Predictive Analytics: Use ML to predict issues before they happen
- Custom Dashboards: Let users define their own service categories and thresholds
- Alert Integration: Connect health monitoring to PagerDuty/Slack
- Performance Optimization: Handle 1000+ service topologies
Try It Yourself
# Clone the repository
git clone https://github.com/clayroach/otel-ai.git
cd otel-ai
# Start the platform
pnpm dev:up
# Start the OpenTelemetry demo
pnpm demo:up
# Open the UI
open http://localhost:5173
# Navigate to AI Analyzer → Topology Graph
The Big Picture
We’re not just building another monitoring tool. We’re creating an AI-native observability platform that understands your architecture, learns from your patterns, and helps you make better decisions. The topology visualization is just the beginning.
Every service is different. Your monitoring should know that.
Building in public, learning in public. Follow the journey as we compress 12 months of enterprise development into 30 days with AI.
Day 17 Status: Topology visualization complete with intelligent health monitoring
Lines of Code: ~500 added
Tests Passing: 13/13
Services Monitored: 13 real services
Time Invested: <4 focused hours
AI Agents Created: 1 (code-implementation-agent)
Special thanks to Cole Medin’s YouTube channel for AI development workflow inspiration!
GitHub | Previous Day | Next Day
This content originally appeared on DEV Community and was authored by Clay Roach