Day 6: When Protobuf Breaks Everything – Real Engineering in the Trenches



This content originally appeared on DEV Community and was authored by Clay Roach

Day 6: When Protobuf Breaks Everything 🔥

The Plan: Add real-time updates and bootstrap AI anomaly detection.
The Reality: “Why are all my operations named ‘protobuf-fallback-trace’?!”

Welcome to Day 6 of building an AI-native observability platform in 30 days. Today was supposed to be about sexy features. Instead, it was about the unglamorous reality of systems engineering: making protobuf work correctly.

The Problem That Changed Everything

I started the day confident. The OpenTelemetry demo was running, traces were flowing, the UI was displaying data. Time to add real-time updates, right?

Then I looked closer at the trace details:

// What I expected:
{
  service: "CartService",
  operation: "AddItemToCart",
  duration: 125
}

// What I got:
{
  service: "CartService", 
  operation: "protobuf-fallback-trace", // 😱
  duration: 50
}

Every. Single. Operation. Was named “protobuf-fallback-trace”.

The Investigation Begins

Discovery #1: Gzip Was Being Ignored

The OpenTelemetry demo sends protobuf data with gzip compression. My middleware had “clever” conditional logic:

// The broken approach
app.use('/v1*', (req, res, next) => {
  if (req.headers['content-type']?.includes('protobuf')) {
    // Special handling that SKIPPED gzip decompression 🤦
    express.raw({ type: 'application/x-protobuf' })(req, res, next)
  } else {
    express.json()(req, res, next)
  }
})

The fix was embarrassingly simple:

// The working approach
app.use('/v1*', express.raw({ 
  limit: '10mb',
  type: '*/*',
  inflate: true  // THIS enables gzip decompression for ALL content types
}))

Lesson: Sometimes “clever” code is just complicated code. Unified handling often beats conditional logic.

Discovery #2: Protobufjs vs ES Modules

Next challenge: parsing the actual protobuf data. The protobufjs library is CommonJS, but my project uses ES modules. This led to hours of:

// Attempt 1: Named imports (doesn't work)
import { load } from 'protobufjs' // ❌ "Named export 'load' not found"

// Attempt 2: What actually works
import pkg from 'protobufjs'
const { load } = pkg // ✅

Discovery #3: Path Resolution Hell

Even with protobufjs loading, the OTLP protobuf definitions have imports that need custom resolution:

// The protobuf loader that finally worked
const { Root } = pkg
this.root = new Root()
this.root.resolvePath = (origin: string, target: string) => {
  // Custom resolution for OTLP imports
  if (target.startsWith('opentelemetry/')) {
    return path.join(protoPath, target)
  }
  return path.resolve(path.dirname(origin), target)
}

The Nuclear Option: Enhanced Fallback Parsing

When the “proper” protobuf parsing kept failing, I built something unconventional – a raw protobuf parser that extracts data through pattern matching:

function parseOTLPFromRaw(buffer: Buffer): any {
  const data = buffer.toString('latin1')

  // Extract service names by pattern
  const serviceMatches = [...data.matchAll(
    /service\.name[\x00-\x20]*([a-zA-Z][a-zA-Z0-9\-_]+)/g
  )]

  // Extract operation names
  const operationCandidates = operationMatches
    .map(match => match[1])
    .filter(op => 
      op.length > 3 && 
      !op.match(/^[0-9a-f]+$/) && // Skip hex strings
      (op.includes('.') || op.includes('/') || op.includes('_'))
    )

  // Build spans from extracted data
  return {
    resourceSpans: [{
      resource: { 
        attributes: [
          { key: 'service.name', value: { stringValue: serviceName }}
        ]
      },
      scopeSpans: [{
        spans: operationCandidates.map(op => ({
          name: op, // Real operation names!
          // ... rest of span data
        }))
      }]
    }]
  }
}

Is this elegant? No. Does it work? Absolutely.

The Results

After 8 hours of protobuf wrestling:

Before:

  • ❌ All operations: “protobuf-fallback-trace”
  • ❌ 1 fake span per trace
  • ❌ No real telemetry data

After:

  • ✅ Real operations: oteldemo.AdService, CartService.AddItem
  • ✅ 10+ real spans per trace
  • ✅ Authentic resource attributes and timing data

Key Learnings

1. Fallback Strategies Are Not Defeat

Building a fallback parser wasn’t giving up – it was ensuring the system works even when dependencies fail. In production, working beats perfect.

2. Debug at the Lowest Level

I spent hours assuming the protobuf data was corrupt. Finally logging the raw buffer bytes revealed it was fine – the decompression was being skipped.

3. Integration Points Are Where Systems Break

The individual components all worked:

  • ✅ OpenTelemetry demo: sending valid data
  • ✅ Express server: receiving requests
  • ✅ ClickHouse: storing data

The failure was in the glue between them.

4. Real Data Reveals Real Problems

Mock data would never have exposed this issue. Testing with the actual OpenTelemetry demo forced me to handle real-world complexity.

The Bigger Picture

Today didn’t go according to plan, and that’s exactly what building production systems is like. The glossy demo videos don’t show the 8 hours spent debugging why protobuf.load is not a function.

But here’s what matters: the system now correctly processes thousands of real traces from a production-like demo application. Every service is visible, every operation is named correctly, and the data flowing through the pipeline is authentic.

What’s Next (Day 7)

Now that protobuf parsing actually works:

  • Implement the real-time updates (for real this time)
  • Add WebSocket support for live trace streaming
  • Bootstrap the AI anomaly detection system
  • Create service dependency visualization

Code Snippets That Saved the Day

For anyone fighting similar battles:

# Debug protobuf data in container
docker compose exec backend xxd -l 100 /tmp/trace.pb

# Test gzip decompression
curl -X POST http://localhost:4319/v1/traces \
  -H "Content-Type: application/x-protobuf" \
  -H "Content-Encoding: gzip" \
  --data-binary @trace.pb.gz

# Check what protobufjs actually exports
node -e "console.log(Object.keys(require('protobufjs')))"

Conclusion

Day 6 was humbling. The plan was to build flashy features. Instead, I spent the day in the trenches making basic data ingestion work correctly.

But that’s real engineering. It’s not always about the elegant algorithm or the clever architecture. Sometimes it’s about making protobuf parsing work at 2 AM because your entire platform depends on it.

The platform is stronger because of today’s battles. And tomorrow, with real data flowing correctly, we can build the features that actually matter.

Are you fighting your own protobuf battles? Share your war stories in the comments. Sometimes knowing you’re not alone in the debugging trenches makes all the difference.

Progress: Day 6 of 30 ✅ | Protobuf: Finally Working | Sanity: Questionable

GitHub Repository | Follow the Journey


This content originally appeared on DEV Community and was authored by Clay Roach