This content originally appeared on DEV Community and was authored by Hejun Wong
When you’re new to MongoDB, diving into the monitoring dashboard can feel like trying to read a foreign language. You’re met with a sea of charts and numbers, but what do they all mean? What’s good? What’s bad? What demands immediate attention?
This is a critical skill for anyone managing or building on MongoDB, from DBAs and DevOps engineers to developers. Understanding these metrics is the key to proactive performance tuning, efficient resource provisioning, and preventing outages before they happen.
Over the years, I’ve spent a lot of time walking new team members, customers, and SREs through these charts. This article is my attempt to distill that knowledge and help you understand your MongoDB deployment’s health a little better.
Let’s demystify some of the most important metrics you should be watching.
Core Performance Indicators
These metrics give you a high-level overview of the database’s workload and responsiveness.
1. OpCounters
While OpCounters
(operation counters) don’t directly signal a health problem, they provide essential context. This metric breaks down the database operations (insert
, query
, update
, delete
, etc.) happening over a specific period. By itself, it tells you what the database is doing. When correlated with other metrics like CPU or disk I/O, it helps you understand why the system is behaving a certain way.
2. Operation Execution Times
This chart shows the average time, in milliseconds, that database operations take to execute.
- Good: Low values. This indicates your database is processing requests efficiently.
- Bad: Rising values or Spikes. Increasing execution times are a clear signal of performance degradation. This could be due to inefficient queries, resource contention, or network issues.
By default, MongoDB considers any operation that takes longer than 100ms to be slow. If you see a trend of rising execution times, it’s time to use the Query Profiler to investigate and identify the specific queries that are slowing things down.
3. Normalized CPU (System & Process)
CPU is a fundamental resource. In most monitoring tools, you’ll see a “normalized” value, which is incredibly helpful. Normalization divides the absolute CPU usage by the number of CPU cores, giving you an easy-to-read percentage from 0-100%.
-
Normalized Process CPU: Tracks the CPU usage of the
mongod
process itself. This is your primary indicator of the database’s CPU load. - Normalized System CPU: Tracks the total CPU usage of all processes on the host machine.
A healthy range for Normalized Process CPU is often between 40-70%.
- Under 40%: You might be over-provisioned for your current workload.
- Over 70% (sustained): You may be under-provisioned, and the CPU could become a bottleneck. When provisioning and sizing, CPU should always be considered alongside memory, storage, and IOPS.
4. Queues
The Queues
metric shows the number of operations waiting for the database to process them. It’s a direct measure of demand versus capacity.
-
Good:
0
. When there are no queues, your database is keeping up with incoming requests. - Bad: Any sustained number greater than zero. A large queue indicates the database cannot process operations in a timely fashion, leading to increased latency for your application.
Queues are often a symptom of other problems, such as:
- Inefficient queries that need indexes.
- Hardware bottlenecks (CPU, IOPS).
- Inefficient data models, such as many applications trying to update the very same document, causing lock contention.
Storage and Disk Metrics
5. Disk Space Percent Free
This one is straightforward: it’s the percentage of your disk that is available. Monitoring the percentage is often more intuitive than tracking absolute gigabytes free, as it saves you a few mental steps.
- Good: Consistently above 20%.
- Caution: Falling below 10%. If available space is fully depleted, your database will stop accepting writes, leading to downtime. While 10% of a very large disk is still a lot of space, you may consider setting alerts for this.
6. Disk IOPS (I/O Operations Per Second)
This metric reflects the read and write throughput of your disk. It’s crucial to ensure this value stays comfortably below the maximum IOPS provisioned for your hardware.
If your average IOPS is consistently hovering near the maximum, you are approaching a performance cliff. When disk IOPS are saturated, the storage subsystem cannot service read and write requests in a timely manner. This causes a cascading failure:
- The database journaling system may block, waiting to write to disk.
- The storage engine cannot flush modified data from memory to disk in a timely manner(checkpointing).
- This leads to a surge in queue length and operation latency, causing the cluster to become unresponsive or stall.
To fix this, you can either provision more IOPS (which can be costly) or increase your storage size, as IOPS often scale with disk capacity on cloud providers.
7. Disk Latency
This is the average time, in milliseconds, for read and write operations to complete on the disk. It’s a direct measure of your storage performance.
- Excellent: Consistently under 5ms.
- Acceptable: Between 5ms and 20ms.
- Bad: Sustained latency over 20ms signals a disk bottleneck. If you see latency spiking above 100ms, it’s a critical issue that needs immediate investigation with your infrastructure provider.
Memory Related Metrics
MongoDB loves memory. It uses RAM to cache your working set (frequently accessed data and indexes), which dramatically reduces the need for slow disk I/O.
8. Memory (Resident vs. Virtual)
You’ll typically see two memory metrics: virtual and resident. While virtual memory is the total address space allocated by the process, resident memory is the one to watch. It shows the actual amount of physical RAM the mongod
process is using.
After your database has been running for a while and has loaded its working set into memory, the resident memory usage should stabilize into a relatively flat line. This indicates it has reached a steady state.
9. Cache Ratio (Fill & Dirty)
The WiredTiger storage engine uses an internal cache to hold data.
Cache Fill Ratio: This measures how full that cache is. In a healthy, active deployment, this value should hover around 80%. If your working set (the data you access frequently) fits in memory, this ratio will be high. If it approaches 100%, it could mean your working set is larger than your cache. Increasing the instance’s RAM could reduce disk I/O and improve performance.
Dirty Fill Ratio: This represents the percentage of the cache that contains “dirty” data—data that has been modified in memory but not yet written (flushed) to disk. This value should stay below 5%. If it consistently goes above this, MongoDB may employ application threads to help with data eviction, directly degrading your database’s performance.
10. Connections
Every open connection to your database consumes resources, typically around 1MB of RAM. It’s vital to manage your connection pool effectively. Uncontrolled connections can exhaust RAM and bring a database to its knees.
For applications running in containerized environments like Kubernetes, where many pods can spin up, it’s easy to create a connection storm. I typically recommend setting these two connection string options in your driver:
-
maxIdleTimeMS=60000
: Closes connections that have been idle for 60 seconds. -
maxPoolSize=5
: Limits the number of connections per application instance to a small number.
Your total connections can be estimated with:
Total Connections ≈ (Number of Application Pods) * maxPoolSize
Developer-Focused Metrics
These metrics provide direct feedback on the efficiency of your queries and data model.
11. Query Targeting
This is a powerful metric that measures index efficiency. It’s the ratio of documents scanned to documents returned.
Query Targeting Ratio = documents examined / documents returned
- Good: A ratio of 1:1 is perfect. This means your index was so effective that for every document MongoDB had to look at, it was a document your query needed.
- Bad: A high ratio indicates your queries are scanning many irrelevant documents to find the ones they need. This points to missing or suboptimal indexes.
12. Scan and Order
This metric tracks queries that perform an in-memory sort. Sorting large result sets in memory after fetching them is very expensive, consuming significant CPU and memory.
-
Good:
0
. This means all sorting is being done efficiently using an index’s inherent order. -
Bad: Any value greater than zero. If you see
scanAndOrder
operations, you should review your queries to see if a new or modified index can provide the requested sort order, eliminating the costly in-memory step.
Replication Metrics
For any production replica set, ensuring data is copied efficiently and reliably is paramount.
13. Replication Lag
This is the approximate time, in seconds, that a secondary member is behind the primary’s write operations.
- Good: Low and stable lag, typically under 5 seconds.
- Bad: High (over 10 seconds) or growing lag. This can lead to stale reads from secondaries.
14. Replication Oplog Window
The oplog (operations log) is the special collection that records all write operations. The “oplog window” is the duration of time that those operations are retained. A sufficiently large window allows a secondary that falls behind (e.g., due to a network issue or downtime) to catch up without needing an “initial sync” — a highly resource-intensive process of re-copying the entire dataset.
We typically recommend customers maintain an oplog window of at least 3 days (72 hours). Why? Imagine a secondary stops replicating on a Friday evening. When the DBA comes in on Monday, they have a comfortable buffer to fix the node before it falls too far behind and requires a full resync.
15. Oplog GB/Hour
So, how do you size your oplog for a 3-day window? Use the Oplog GB/Hour
metric. Find the amount of oplog data generated during your busiest hour and use that as a baseline.
Required Oplog Size = (Peak Oplog GB/Hour) * 72
Final Thoughts
Monitoring is not a passive activity. It’s about understanding the story your database is telling you. By keeping an eye on these key metrics, you can move from a reactive to a proactive approach, ensuring your MongoDB deployment remains healthy, scalable, and performant.
What are your go-to metrics? Do you have other tips for interpreting database health? Share your thoughts and suggestions in the comments below!
(This article was inspired by and builds upon some concepts from MongoDB’s official monitoring guide.)
This content originally appeared on DEV Community and was authored by Hejun Wong