This content originally appeared on DEV Community and was authored by Budventure Technologies
You don’t “add monitoring later.” If a microservice ships without observability, your on-call pays the tax.
Below is a pre-launch checklist we run on Node.js services. It’s short, opinionated, and battle-tested.
TL;DR (pin this)
1) RED metrics per route/operation (Rate, Errors, Duration).
2) SLOs + error budget policy (with burn-rate alerts).
3) Distributed tracing (OpenTelemetry, baggage for tenant/request IDs).
4) Queue depth & consumer lag (and DLQ rate) for each message bus.
5) Synthetic checks that hit public routes and critical user flows.
6) Liveness/Readiness that model real dependencies.
7) Release/rollback sanity (alert routing, dashboards, and “what page wakes whom”).
1) *RED metrics (Prometheus with prom-client) *
Measure Rate (RPS), Errors (non-2xx/5xx by class), Duration (p95/p99). Export per route/operation.
//metric.js
import client from ‘prom-client’;
const Registry = client.Registry;
export const registry = new Registry();
export const httpReqDur = new client.Histogram({
name: ‘http_request_duration_seconds’,
help: ‘Request duration’,
labelNames: [‘method’,’route’,’status’],
buckets: [0.025,0.05,0.1,0.25,0.5,1,2,5]
});
export const httpReqs = new client.Counter({
name: ‘http_requests_total’,
help: ‘Total requests’,
labelNames: [‘method’,’route’]
});
export const httpErrors = new client.Counter({
name: ‘http_errors_total’,
help: ‘Non-2xx responses’,
labelNames: [‘method’,’route’,’status’]
});
registry.registerMetric(httpReqDur);
registry.registerMetric(httpReqs);
registry.registerMetric(httpErrors);
// server.js (Express example)
app.use((req,res,next)=>{
const end = httpReqDur.startTimer({ method:req.method, route:req.path });
res.on(‘finish’, ()=>{
httpReqs.inc({ method:req.method, route:req.path });
if (res.statusCode >= 400) httpErrors.inc({ method:req.method, route:req.path, status: String(res.statusCode) });
end({ status: String(res.statusCode) });
});
next();
});
app.get(‘/metrics’, async (_req,res)=>{ res.set(‘Content-Type’, registry.contentType); res.end(await registry.metrics()); });
Dashboard: one panel each for RPS, Error %, and p95/p99 Duration per route.
2) SLOs + error budgets
Pick SLIs that users feel. Example API SLI: availability = 1 − (5xx + timeouts) / total.
service: checkout-api
sli:
type: events
good: http_requests_total{status=~”2..|3..”}
total: http_requests_total
slo: 99.9 # monthly objective
alerting:
burn_rates:
– window: 5m rate: 14 # page (fast burn)
– window: 1h rate: 6 # page
– window: 6h rate: 3 # ticket
You page on budget burn, not on every 500.
3) Distributed tracing (OpenTelemetry)
Instrument HTTP, DB, and queue operations; propagate trace id + tenant id across services.
// tracing.js
import { NodeSDK } from ‘@opentelemetry/sdk-node’;
import { HttpInstrumentation } from ‘@opentelemetry/instrumentation-http’;
import { ExpressInstrumentation } from ‘@opentelemetry/instrumentation-express’;
import { PrismaInstrumentation } from ‘@prisma/instrumentation’;
import { OTLPTraceExporter } from ‘@opentelemetry/exporter-trace-otlp-http’;
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation(), new PrismaInstrumentation()]
});
sdk.start();
Minimum: parent/child spans, HTTP attributes (route, status, target), DB statement summaries, and message queue spans (publish/consume).
4) Queue depth & consumer lag
For RabbitMQ/Kafka/SQS, track:
- Queue depth (messages ready).
- Lag (Kafka consumer group lag).
- Age of oldest message (or time-in-queue).
- DLQ rate.
// Example: RabbitMQ depth (management API)
const depth = await fetch(${RMQ}/api/queues/%2F/orders).then(r=>r.json());
metrics.queueDepth.set({ queue:’orders’ }, depth.messages_ready);
Alert when depth/lag grows while consumer CPU is idle → likely stuck handler or poison message.
5) Synthetic checks (outside-in)
Hit public routes from multiple regions every minute; alert when error rate or latency breaks SLO.
// k6 smoke example (smoke.js)
import http from ‘k6/http’; import { check } from ‘k6’;
export const options = { vus: 1, iterations: 10, thresholds: { http_req_duration: [‘p(95)<500’] } };
export default function () {
const res = http.get(`${__ENV.BASE_URL}/healthz`);
check(res, { ‘status 200’: r => r.status === 200 });
}
Run smoke on deploy; run full flows (login → create → pay) on schedule.
6) Liveness / Readiness
/healthz (liveness): process is alive; quick checks only.
/readyz (readiness): dependencies OK (DB ping, queue connect, config loaded). Fail readiness when backpressure kicks in.
app.get(‘/healthz’, (_req,res)=> res.send(‘ok’));
app.get(‘/readyz’, async (_req,res)=>{
const ok = await db.ping() && await queue.ping();
res.status(ok?200:503).send(ok?’ready’:’not-ready’);
});
7) Release/rollback sanity
- Log version/commit on every request (trace attr + metric label).
- Dashboards pinned for latest version.
- Alert routes: paging only for fast budget burn, tickets for slow burn.
- Rollback plan documented (traffic switch, canary %, who approves).
What we keep on one dashboard
- RED per route (RPS, Error%, p95/p99).
- SLO objective vs. actual & budget left.
- Trace waterfall for 3 slowest endpoints.
- Queue depth/lag + DLQ rate.
- Synthetic latency (per region).
- Deploy marker overlays.
If you want a lean Node.js microservice checklist we share with teams, ping me.
This content originally appeared on DEV Community and was authored by Budventure Technologies