NodeJS Microservices: 7 Observability Checks Before Launch Published



This content originally appeared on DEV Community and was authored by Budventure Technologies

You don’t “add monitoring later.” If a microservice ships without observability, your on-call pays the tax.

Below is a pre-launch checklist we run on Node.js services. It’s short, opinionated, and battle-tested.

TL;DR (pin this)

1) RED metrics per route/operation (Rate, Errors, Duration).

2) SLOs + error budget policy (with burn-rate alerts).

3) Distributed tracing (OpenTelemetry, baggage for tenant/request IDs).

4) Queue depth & consumer lag (and DLQ rate) for each message bus.

5) Synthetic checks that hit public routes and critical user flows.

6) Liveness/Readiness that model real dependencies.

7) Release/rollback sanity (alert routing, dashboards, and “what page wakes whom”).

1) *RED metrics (Prometheus with prom-client) *

Measure Rate (RPS), Errors (non-2xx/5xx by class), Duration (p95/p99). Export per route/operation.

//metric.js
import client from ‘prom-client’;
const Registry = client.Registry;
export const registry = new Registry();

export const httpReqDur = new client.Histogram({
name: ‘http_request_duration_seconds’,
help: ‘Request duration’,
labelNames: [‘method’,’route’,’status’],
buckets: [0.025,0.05,0.1,0.25,0.5,1,2,5]
});
export const httpReqs = new client.Counter({
name: ‘http_requests_total’,
help: ‘Total requests’,
labelNames: [‘method’,’route’]
});
export const httpErrors = new client.Counter({
name: ‘http_errors_total’,
help: ‘Non-2xx responses’,
labelNames: [‘method’,’route’,’status’]
});

registry.registerMetric(httpReqDur);
registry.registerMetric(httpReqs);
registry.registerMetric(httpErrors);

// server.js (Express example)
app.use((req,res,next)=>{
const end = httpReqDur.startTimer({ method:req.method, route:req.path });
res.on(‘finish’, ()=>{
httpReqs.inc({ method:req.method, route:req.path });
if (res.statusCode >= 400) httpErrors.inc({ method:req.method, route:req.path, status: String(res.statusCode) });
end({ status: String(res.statusCode) });
});
next();
});
app.get(‘/metrics’, async (_req,res)=>{ res.set(‘Content-Type’, registry.contentType); res.end(await registry.metrics()); });

Dashboard: one panel each for RPS, Error %, and p95/p99 Duration per route.

2) SLOs + error budgets
Pick SLIs that users feel. Example API SLI: availability = 1 − (5xx + timeouts) / total.

service: checkout-api
sli:
type: events
good: http_requests_total{status=~”2..|3..”}
total: http_requests_total
slo: 99.9 # monthly objective
alerting:
burn_rates:
– window: 5m rate: 14 # page (fast burn)
– window: 1h rate: 6 # page
– window: 6h rate: 3 # ticket

You page on budget burn, not on every 500.

3) Distributed tracing (OpenTelemetry)
Instrument HTTP, DB, and queue operations; propagate trace id + tenant id across services.

// tracing.js
import { NodeSDK } from ‘@opentelemetry/sdk-node’;
import { HttpInstrumentation } from ‘@opentelemetry/instrumentation-http’;
import { ExpressInstrumentation } from ‘@opentelemetry/instrumentation-express’;
import { PrismaInstrumentation } from ‘@prisma/instrumentation’;
import { OTLPTraceExporter } from ‘@opentelemetry/exporter-trace-otlp-http’;

const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation(), new PrismaInstrumentation()]
});
sdk.start();
Minimum: parent/child spans, HTTP attributes (route, status, target), DB statement summaries, and message queue spans (publish/consume).

4) Queue depth & consumer lag

For RabbitMQ/Kafka/SQS, track:

  • Queue depth (messages ready).
  • Lag (Kafka consumer group lag).
  • Age of oldest message (or time-in-queue).
  • DLQ rate.

// Example: RabbitMQ depth (management API)
const depth = await fetch(${RMQ}/api/queues/%2F/orders).then(r=>r.json());
metrics.queueDepth.set({ queue:’orders’ }, depth.messages_ready);
Alert when depth/lag grows while consumer CPU is idle → likely stuck handler or poison message.

5) Synthetic checks (outside-in)
Hit public routes from multiple regions every minute; alert when error rate or latency breaks SLO.

// k6 smoke example (smoke.js)
import http from ‘k6/http’; import { check } from ‘k6’;
export const options = { vus: 1, iterations: 10, thresholds: { http_req_duration: [‘p(95)<500’] } };
export default function () {
const res = http.get(`${__ENV.BASE_URL}/healthz`);
check(res, { ‘status 200’: r => r.status === 200 });
}

Run smoke on deploy; run full flows (login → create → pay) on schedule.

6) Liveness / Readiness
/healthz (liveness): process is alive; quick checks only.
/readyz (readiness): dependencies OK (DB ping, queue connect, config loaded). Fail readiness when backpressure kicks in.

app.get(‘/healthz’, (_req,res)=> res.send(‘ok’));
app.get(‘/readyz’, async (_req,res)=>{
const ok = await db.ping() && await queue.ping();
res.status(ok?200:503).send(ok?’ready’:’not-ready’);
});

7) Release/rollback sanity

  • Log version/commit on every request (trace attr + metric label).
  • Dashboards pinned for latest version.
  • Alert routes: paging only for fast budget burn, tickets for slow burn.
  • Rollback plan documented (traffic switch, canary %, who approves).

What we keep on one dashboard

  • RED per route (RPS, Error%, p95/p99).
  • SLO objective vs. actual & budget left.
  • Trace waterfall for 3 slowest endpoints.
  • Queue depth/lag + DLQ rate.
  • Synthetic latency (per region).
  • Deploy marker overlays.

If you want a lean Node.js microservice checklist we share with teams, ping me.


This content originally appeared on DEV Community and was authored by Budventure Technologies