This content originally appeared on DEV Community and was authored by Indika_Wimalasuriya
AWS API Gateway is a fully managed service from AWS that allows you to create, publish, and maintain APIs at any scale. It acts as a gateway to your application’s backend services, including AWS Lambda, EKS, ECS, EC2, and more.
You can explore the full documentation here:
API Gateway Developer Guide – You can refer all the details you wants related to API Gateway here
To make sure we’re aligned on the fundamentals, I’ve created an API Gateway Essentials summary below. It gives you a quick overview of the core capabilities this service offers.
The main objective of this blog is to walk through how to monitor and observe AWS API Gateway using Datadog — one of the leading observability platforms that provides full-stack visibility into AWS environments.
Before diving in, a quick refresher:
Observability is the practice of using telemetry data (logs, metrics, and traces) to understand a system’s internal state. In this case, we’ll leverage API Gateway’s logs, metrics, and traces to gain insights into what’s really happening under the hood.
API Gateway Logs
AWS provides built-in support for enabling logs. You can enable them under API Gateway → Stages, where logging options are available for both access logs and execution logs.
Once logging is enabled, you can configure API Gateway to send logs to Datadog.
Configuration guide: Datadog + API Gateway Integration
Why Logs Matter
Logs are essential for troubleshooting issues in API Gateway. In most cases, failures fall into one of two categories:
Backend-related issues
Unresponsive services (e.g., Lambda, EC2, EKS) or misconfigurations such as timeouts or incorrect integration responses.
AWS infrastructure-level issues (rare)
These could include internal AWS errors or regional service disruptions.
Common Causes of API Gateway Failures
- Misconfigured integrations (e.g., VPC links, request/response mapping templates)
- Backend timeouts
- Incorrect or missing HTTP status code mappings
API Gateway Metrics
AWS provides a rich set of metrics for API Gateway that align with the three golden signals of observability: traffic, errors, and latency. These metrics are essential for monitoring the health, performance, and reliability of your APIs — helping you detect issues early and respond proactively.
API Gateway Metrics – Grouped Summary
Type | Metric | Description |
---|---|---|
Traffic | aws.apigateway.count |
Total number of API requests received |
aws.apigateway.count.p50 – .p99
|
Percentile distribution of request count | |
trace.aws.apigateway.hits |
Total hits from traces | |
trace.aws.apigateway.hits.by_http_status |
Hits grouped by HTTP status code | |
trace.aws.apigateway.stage.hits |
Hits per deployment stage | |
trace.aws.apigateway.stage.hits.by_http_status |
Stage-level hits by HTTP status | |
Errors | aws.apigateway.4xxerror |
Client-side errors (e.g., invalid request, unauthorized) |
aws.apigateway.4xxerror.p50 – .p99
|
Percentiles of 4xx error rates | |
aws.apigateway.5xxerror |
Server-side/API errors (e.g., backend failure) | |
aws.apigateway.5xxerror.p50 – .p99
|
Percentiles of 5xx error rates | |
Latency | aws.apigateway.latency |
Total time from request to response (includes backend) |
aws.apigateway.latency.p50 – .p99
|
Percentile breakdown of total latency | |
aws.apigateway.latency.minimum / .maximum
|
Min and max observed latency values | |
Integration Latency | aws.apigateway.integration_latency |
Time spent in the backend integration only |
aws.apigateway.integration_latency.p50 – .p99
|
Percentile breakdown of backend latency | |
aws.apigateway.integration_latency.minimum / .maximum
|
Min and max integration latency | |
Tracing / Duration | trace.aws.apigateway.duration |
Trace-based total API duration |
trace.aws.apigateway.duration.by_http_status |
Duration per status code | |
trace.aws.apigateway.stage.duration |
Duration per stage | |
trace.aws.apigateway.stage.duration.by_http_status |
Stage duration by status code | |
Tracing / Apdex | trace.aws.apigateway.stage.apdex |
User satisfaction score (Apdex) per stage |
Meta | trace.aws.apigateway |
Base trace for API Gateway |
trace.aws.apigateway.stage |
Trace identifier for specific stage |
API Gateway Tracing
A best practice is to enable tracing for Application Performance Monitoring (APM) on your backend services—such as AWS Lambda or microservices running on ECS, EKS, or EC2. Enabling tracing automatically provides you with the API Gateway tracer view, giving detailed insights into the flow and performance of your APIs.
In the example below, I have enabled tracing for an AWS Lambda backend, which allows me to view the API Gateway trace data.
The example below shows a trace starting from API Gateway, capturing the end-to-end flow through the backend Lambda function and any other integrated services
Service Level Indicator (SLI) Dashboard for API Gateway
Finally, you need to bring everything together and create a single source of truth dashboard for API Gateway, which provides insights into traffic, errors, and latency. It should include request volume and trends to help identify potential issues promptly.
The dashboard should also highlight:
Failed traces
Traces taking more than x seconds — useful for identifying slow requests passing through API Gateway that require further investigation
Relevant logs for deeper analysis
A combination of all these elements will give you a comprehensive view of your API Gateway, enabling effective monitoring and faster troubleshooting of any potential failures or performance issues.
And that wraps up a complete guide to achieving observability for Amazon API Gateway using Datadog.
This content originally appeared on DEV Community and was authored by Indika_Wimalasuriya