Amazon API Gateway Observability Best Practices with Datadog – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Indika_Wimalasuriya

AWS API Gateway is a fully managed service from AWS that allows you to create, publish, and maintain APIs at any scale. It acts as a gateway to your application’s backend services, including AWS Lambda, EKS, ECS, EC2, and more.

You can explore the full documentation here:
API Gateway Developer Guide – You can refer all the details you wants related to API Gateway here

To make sure we’re aligned on the fundamentals, I’ve created an API Gateway Essentials summary below. It gives you a quick overview of the core capabilities this service offers.

The main objective of this blog is to walk through how to monitor and observe AWS API Gateway using Datadog — one of the leading observability platforms that provides full-stack visibility into AWS environments.

Before diving in, a quick refresher:
Observability is the practice of using telemetry data (logs, metrics, and traces) to understand a system’s internal state. In this case, we’ll leverage API Gateway’s logs, metrics, and traces to gain insights into what’s really happening under the hood.

API Gateway Logs
AWS provides built-in support for enabling logs. You can enable them under API Gateway → Stages, where logging options are available for both access logs and execution logs.

Once logging is enabled, you can configure API Gateway to send logs to Datadog.

Configuration guide: Datadog + API Gateway Integration

Why Logs Matter
Logs are essential for troubleshooting issues in API Gateway. In most cases, failures fall into one of two categories:

Backend-related issues
Unresponsive services (e.g., Lambda, EC2, EKS) or misconfigurations such as timeouts or incorrect integration responses.

AWS infrastructure-level issues (rare)
These could include internal AWS errors or regional service disruptions.

Common Causes of API Gateway Failures

Misconfigured integrations (e.g., VPC links, request/response mapping templates)
Backend timeouts
Incorrect or missing HTTP status code mappings

API Gateway Metrics

AWS provides a rich set of metrics for API Gateway that align with the three golden signals of observability: traffic, errors, and latency. These metrics are essential for monitoring the health, performance, and reliability of your APIs — helping you detect issues early and respond proactively.

API Gateway Metrics – Grouped Summary

Type	Metric	Description
Traffic	`aws.apigateway.count`	Total number of API requests received
	`aws.apigateway.count.p50` – `.p99`	Percentile distribution of request count
	`trace.aws.apigateway.hits`	Total hits from traces
	`trace.aws.apigateway.hits.by_http_status`	Hits grouped by HTTP status code
	`trace.aws.apigateway.stage.hits`	Hits per deployment stage
	`trace.aws.apigateway.stage.hits.by_http_status`	Stage-level hits by HTTP status
Errors	`aws.apigateway.4xxerror`	Client-side errors (e.g., invalid request, unauthorized)
	`aws.apigateway.4xxerror.p50` – `.p99`	Percentiles of 4xx error rates
	`aws.apigateway.5xxerror`	Server-side/API errors (e.g., backend failure)
	`aws.apigateway.5xxerror.p50` – `.p99`	Percentiles of 5xx error rates
Latency	`aws.apigateway.latency`	Total time from request to response (includes backend)
	`aws.apigateway.latency.p50` – `.p99`	Percentile breakdown of total latency
	`aws.apigateway.latency.minimum` / `.maximum`	Min and max observed latency values
Integration Latency	`aws.apigateway.integration_latency`	Time spent in the backend integration only
	`aws.apigateway.integration_latency.p50` – `.p99`	Percentile breakdown of backend latency
	`aws.apigateway.integration_latency.minimum` / `.maximum`	Min and max integration latency
Tracing / Duration	`trace.aws.apigateway.duration`	Trace-based total API duration
	`trace.aws.apigateway.duration.by_http_status`	Duration per status code
	`trace.aws.apigateway.stage.duration`	Duration per stage
	`trace.aws.apigateway.stage.duration.by_http_status`	Stage duration by status code
Tracing / Apdex	`trace.aws.apigateway.stage.apdex`	User satisfaction score (Apdex) per stage
Meta	`trace.aws.apigateway`	Base trace for API Gateway
	`trace.aws.apigateway.stage`	Trace identifier for specific stage

API Gateway Tracing

A best practice is to enable tracing for Application Performance Monitoring (APM) on your backend services—such as AWS Lambda or microservices running on ECS, EKS, or EC2. Enabling tracing automatically provides you with the API Gateway tracer view, giving detailed insights into the flow and performance of your APIs.

In the example below, I have enabled tracing for an AWS Lambda backend, which allows me to view the API Gateway trace data.

The example below shows a trace starting from API Gateway, capturing the end-to-end flow through the backend Lambda function and any other integrated services

Service Level Indicator (SLI) Dashboard for API Gateway

Finally, you need to bring everything together and create a single source of truth dashboard for API Gateway, which provides insights into traffic, errors, and latency. It should include request volume and trends to help identify potential issues promptly.

The dashboard should also highlight:

Failed traces

Traces taking more than x seconds — useful for identifying slow requests passing through API Gateway that require further investigation

Relevant logs for deeper analysis

A combination of all these elements will give you a comprehensive view of your API Gateway, enabling effective monitoring and faster troubleshooting of any potential failures or performance issues.

And that wraps up a complete guide to achieving observability for Amazon API Gateway using Datadog.

This content originally appeared on DEV Community and was authored by Indika_Wimalasuriya