Azure Application Insights — The No‑BS Guide for Pro Teams (APM, KQL, cost, and gotchas) – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Cristian Sifuentes

Azure Application Insights — The No‑BS Guide for Pro Teams (APM, KQL, cost, and gotchas)

If you ship on Azure, Application Insights (App Insights) is your default telemetry rail. It sits inside Azure Monitor, ingests client + server telemetry, and gives you APM, distributed tracing, dashboards, KQL queries, and alerts—out of the box.

This post turns the marketing bullets into a production‑grade playbook. You’ll get: what to enable, how to wire it up, KQL you’ll actually use, and the gotchas that bite teams at scale.

TL;DR for senior devs

What it is: Cloud APM + analytics for apps, with deep integration across Azure, GitHub, and DevOps.
Why it matters: Find regressions fast, correlate failures across services, and keep an eye on SLOs.
How to start: Workspace‑based App Insights + SDK + 5 KQL queries + 6 alerts. Ship dashboards; keep costs sane with sampling.

What is Application Insights?

A managed telemetry service that collects requests, dependencies, traces, exceptions, metrics, availability pings, and custom events from your apps (web, APIs, workers, mobile) and correlates them end‑to‑end. Data lands in a Log Analytics workspace and is queryable with Kusto Query Language (KQL) in near‑real time.

Core capabilities

APM: request timings, dependency maps, cold starts, failures, live metrics.
User analytics: sessions, funnels, retention (for web/mobile with the client SDK).
Custom telemetry: your own events/metrics for business KPIs.
DevOps integrations: Visual Studio / GitHub / Azure DevOps links in failures and deployments.
Alerts & automation: metric & log‑based alerts; trigger Logic Apps, Functions, webhooks.
Log Analytics integration: one workspace to join app logs with infra/network signals.

Architecture at a glance

App SDKs → Ingestion → App Insights resource → Log Analytics Workspace
                                     │
                       Dashboards / KQL / Alerts / Workbooks

Choose workspace‑based from day 1 (required for advanced KQL, cross‑resource queries, and central governance).

Quickstart: wire it up

.NET (ASP.NET Core)

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// 1) Connection via connection string or managed identity (preferred via Azure)
builder.Services.AddApplicationInsightsTelemetry();

// 2) Useful custom telemetry
builder.Services.AddSingleton<TelemetryClient>();

var app = builder.Build();

app.MapGet("/health", () => "ok");

app.MapGet("/pay", (TelemetryClient ai) => {
    using var op = ai.StartOperation<RequestTelemetry>("ChargePayment");
    try {
        ai.TrackEvent("PaymentInitiated", new() { ["amount"] = "42.00" });
        // ... do work / call dependencies
        ai.TrackMetric("checkout_ms", 187);
        return Results.Ok();
    }
    catch (Exception ex) {
        ai.TrackException(ex);
        throw;
    }
    finally { op.Telemetry.Success = true; }
});

app.Run();

Tip: In Azure, prefer managed identity → ingestion over Azure Monitor OpenTelemetry exporter for secure, key‑less setups.

Node.js (Express)

npm i @microsoft/applicationinsights-web @microsoft/applicationinsights-web-snippet --save
npm i applicationinsights --save

// server.js
const appInsights = require("applicationinsights");
appInsights.setup(process.env.APPINSIGHTS_CONNECTIONSTRING)
  .setAutoCollectDependencies(true)
  .setAutoCollectExceptions(true)
  .setSendLiveMetrics(true)
  .start();

const app = require("express")();
app.get("/api", (_, res) => res.json({ ok: true }));
app.listen(3000);

The 8 KQL queries I run every week

Open Application Insights → Logs and paste these.

1) Top failing operations (last 24h)

requests
| where timestamp >= ago(24h)
| summarize failures = countif(success == false), total = count() by name
| extend failureRate = todouble(failures) / total
| top 20 by failures desc

2) Slowest dependencies

dependencies
| where timestamp >= ago(24h)
| summarize avg_ms = avg(duration), p95_ms = percentile(duration, 95) by target, type
| top 20 by p95_ms desc

3) Exceptions with stack & GitHub link

exceptions
| where timestamp >= ago(24h)
| project timestamp, type, outerMessage, problemId, operation_Name, cloud_RoleName
| order by timestamp desc

4) Cold starts (Functions / container warmups)

traces
| where timestamp >= ago(24h) and message has "ColdStart"
| summarize count() by cloud_RoleName

5) Request → Dependency correlation

requests
| where timestamp > ago(2h)
| join kind=leftouter (dependencies | project operation_Id, depDuration=duration, depName=name, depTarget=target) on operation_Id
| summarize reqCount = count(), depCount = count(depName), avgDepMs = avg(depDuration) by name
| order by avgDepMs desc

6) User geography & client

pageViews
| where timestamp > ago(7d)
| summarize views=count() by client_Browser, client_OS, client_CountryOrRegion
| top 50 by views desc

7) Custom business KPI

customMetrics
| where name == "checkout_ms" and timestamp > ago(7d)
| summarize p50=percentile(value,50), p95=percentile(value,95), max(value)

8) Trace severity mapping

traces
| where timestamp >= ago(72h)
| extend severity = case(severityLevel == 0, "Verbose",
                         severityLevel == 1, "Information",
                         severityLevel == 2, "Warning",
                         severityLevel == 3, "Error",
                         severityLevel == 4, "Critical", "Unknown")
| summarize count() by severity

Dashboards & Alerts that matter

Dashboards (Workbooks)

Service Overview: requests, failure rate, P95 latency, dependency health.
Reliability: exception rate by release (link to deployment annotations).
User Journey: funnel (Home → Product → Checkout), drop‑offs, client errors.
Live Ops: live metrics (server count, requests/sec, CPU, memory).

Alerts (start with these)

P95 latency > threshold for 10 min (requests).
Failure rate > 2% for 10 min (requests where success == false).
Exception spike > N in 5 min (exceptions).
Dependency availability < 99% (dependency failures).
Custom KPI breach (e.g., checkout_ms P95 > 5000).
No data for 10 min (heartbeat missing = outage).

Route alerts to Action Groups (email, Teams, PagerDuty, webhook → runbooks).

Feature deep‑dive (with pros/cons)

1) APM & Distributed Tracing

Pros: automatic correlation across services; application map is gold for newcomers.
Cons: cross‑platform correlation can break without proper headers (W3C trace context). Use OpenTelemetry for polyglot stacks.

2) User Analytics

Useful for web/mobile if you’re allowed to collect client telemetry. Respect privacy: cookie banners + sampling.

3) Custom Telemetry

Track your business events (signup, payment, quote). They drive the most actionable dashboards.

4) DevOps Integration

Link failures to commits/deployments. Surface release annotations on charts to catch regressions fast.

5) Alerts & Notifications

Prefer log‑based alerts for rich conditions; metric alerts for low latency.

6) Log Analytics + KQL

Joins across requests, dependencies, traces, exceptions, availabilityResults.
Store queries in Queries Hub; export to Workbooks.

Cost, sampling, and retention (don’t skip this)

Ingestion cost scales with events. Enable adaptive sampling (SDK) or fixed sampling (ingestion) to control bills.
Workspace retention: set 30–90 days hot; archive older data to Cold/Basic Logs or Storage.
Cardinality: avoid high‑cardinality dimensions (e.g., per‑user IDs) on metrics—use them in logs instead.
Log Alerts cost: use 5–15 min frequency; avoid dozens of redundant rules.

// .NET adaptive sampling (typical)
services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = true;
    options.AdaptiveSamplingExcludedTypes = "Event"; // keep business events at 100%
});

Security & governance

Use managed identity for ingestion (no connection strings in code).
Lock Application Insights + Workspace behind Private Link where possible.
RBAC: Monitoring Reader for most devs, Monitoring Contributor for SREs, least privilege for query-only consumers.
Turn on Diagnostic settings to export to Storage/Event Hub for long‑term compliance.

Common pitfalls (and fixes)

“No data” in Logs: wrong workspace or sampling too aggressive → verify connection string & ingestion key; check Live Metrics first.
Double counting: don’t mix classic + workspace‑based resources.
Broken correlation: ensure traceparent headers propagate through API gateways and message buses.
Alert fatigue: start with 5–6 signals, tune thresholds weekly, add suppression windows for deployments.

A simple rollout plan

1) Create workspace‑based App Insights.

2) Instrument .NET/Node (and front‑end if allowed).

3) Ship Workbooks (Overview, Reliability, Business).

4) Enable 6 Alerts above.

5) Enable adaptive sampling; set 30–90 days retention.

6) Weekly hygiene: run the 8 queries, review alerts, adjust thresholds.

Final take

Application Insights is not just “more logs.” Used well, it becomes your source of truth for reliability and customer experience. Start small, wire the golden signals, automate the dull parts—and your MTTR will drop while your confidence climbs.

Written by: Cristian Sifuentes

Full-stack developer & AI/JS enthusiast — passionate about React, TypeScript, and scalable architectures.

This content originally appeared on DEV Community and was authored by Cristian Sifuentes