Skip to content

Files

Latest commit

 

History

History
181 lines (154 loc) · 25.5 KB

mongodb-atlas.md

File metadata and controls

181 lines (154 loc) · 25.5 KB

mongodb-atlas

Monitor Type: mongodb-atlas (Source)

Accepts Endpoints: No

Multiple Instances Allowed: Yes

Overview

MongoDB Atlas is a provider of MongoDB as an on-demand fully managed service. Atlas exposes MongoDB cluster monitoring and logging data through its monitoring and logs REST API endpoints. These Atlas monitoring API resources are grouped into measurements for MongoDB processes, host disks and MongoDB databases.

This monitor repeatedly scrapes MongoDB monitoring data from Atlas at the configured time interval. It scrapes the process and disk measurements into metric groups called mongodb and hardware. The original measurement names are included in the metric descriptions. A set of data points are fetched at the configured granularity and period for each measurement. Metric values are set to the latest non-empty data point value in the set. The finest granularity supported by Atlas is 1 minute. The configured period for the monitor needs to be wider than the interval at which Atlas provides values for measurements. Otherwise some of the sets of fetched data points will contain only empty values. The default configured period is 20 minutes which works across all measurements and gives a reasonable response payload size.

Below is an excerpt of the agent configuration yaml showing the minimal required fields. Note that disableHostDimensions is set to true so that the host name in which the agent/monitor is running is not used for the host metric dimension value. The names of the MongoDB cluster hosts from which metrics emanate are used instead.

monitors:
- type: mongodb-atlas
  projectID:  <Project ID>
  publicKey:  <Public API Key>
  privateKey: <Private API Key>
  disableHostDimensions: true

Configuration

To activate this monitor in the Smart Agent, add the following to your agent config:

monitors:  # All monitor config goes under this key
 - type: mongodb-atlas
   ...  # Additional config

For a list of monitor options that are common to all monitors, see Common Configuration.

Config option Required Type Description
projectID yes string ProjectID is the Atlas project ID.
publicKey yes string PublicKey is the Atlas public API key
privateKey yes string PrivateKey is the Atlas private API key
timeout no int64 Timeout for HTTP requests to get MongoDB process measurements from Atlas. This should be a duration string that is accepted by https://golang.org/pkg/time/#ParseDuration (default: 5s)
enableCache no bool EnableCache enables locally cached Atlas metric measurements to be used when true. The metric measurements that were supposed to be fetched are in fact always fetched asynchronously and cached. (default: true)
granularity no string Granularity is the duration in ISO 8601 notation that specifies the interval between measurement data points from Atlas over the configured period. The default is shortest duration supported by Atlas of 1 minute. (default: PT1M)
period no string Period the duration in ISO 8601 notation that specifies how far back in the past to retrieve measurements from Atlas. (default: PT20M)

Metrics

These are the metrics available for this monitor. Metrics that are categorized as container/host (default) are in bold and italics in the list below.

Group hardware

All of the following metrics are part of the hardware metric group. All of the non-default metrics below can be turned on by adding hardware to the monitor config option extraGroups:

  • disk.partition.iops.read (counter)
    This is Atlas metric measurement DISK_PARTITION_IOPS_READ. The read throughput of I/O operations per second for the disk partition used for MongoDB.
  • disk.partition.iops.total (counter)
    This is Atlas metric measurement DISK_PARTITION_IOPS_TOTAL. The total throughput of I/O operations per second for the disk partition used for MongoDB.
  • disk.partition.iops.write (counter)
    This is Atlas metric measurement DISK_PARTITION_IOPS_WRITE. The write throughput of I/O operations per second for the disk partition used for MongoDB.
  • disk.partition.latency.read (gauge)
    This is Atlas metric measurement DISK_PARTITION_LATENCY_READ. The read latency in milliseconds of the disk partition used by MongoDB.
  • disk.partition.latency.write (gauge)
    This is Atlas metric measurement DISK_PARTITION_LATENCY_WRITE. The write latency in milliseconds of the disk partition used by MongoDB.
  • disk.partition.space.free (gauge)
    This is Atlas metric measurement DISK_PARTITION_SPACE_FREE. The total bytes of free disk space on the disk partition used by MongoDB.
  • disk.partition.space.percent_free (gauge)
    This is Atlas metric measurement DISK_PARTITION_SPACE_PERCENT_FREE. The percent of free disk space on the partition used by MongoDB.
  • disk.partition.space.percent_used (gauge)
    This is Atlas metric measurement DISK_PARTITION_SPACE_PERCENT_USED. The percent of used disk space on the partition that runs MongoDB.
  • disk.partition.space.used (gauge)
    This is Atlas metric measurement DISK_PARTITION_SPACE_USED. The total bytes of used disk space on the partition that runs MongoDB.
  • disk.partition.utilization (gauge)
    This is Atlas metric measurement DISK_PARTITION_UTILIZATION. The percentage of time during which requests are being issued to and serviced by the partition. This includes requests from any process, not just MongoDB processes.
  • process.cpu.kernel (gauge)
    This is Atlas metric measurement PROCESS_CPU_KERNEL. The percentage of time the CPU spent servicing operating system calls for this MongoDB process. For servers with more than 1 CPU core, this value can exceed 100%.
  • process.cpu.user (gauge)
    This is Atlas metric measurement PROCESS_CPU_USER. The percentage of time the CPU spent servicing this MongoDB process. For servers with more than 1 CPU core, this value can exceed 100%.
  • process.normalized.cpu.children_kernel (gauge)
    This is Atlas metric measurement PROCESS_NORMALIZED_CPU_CHILDREN_KERNEL. The percentage of time the CPU spent servicing operating system calls for this MongoDB process's children, scaled to a range of 0-100% by dividing by the number of CPU cores.
  • process.normalized.cpu.children_user (gauge)
    This is Atlas metric measurement PROCESS_NORMALIZED_CPU_CHILDREN_USER. The percentage of time the CPU spent servicing this MongoDB process's children, scaled to a range of 0-100% by dividing by the number of CPU cores.
  • process.normalized.cpu.kernel (gauge)
    This is Atlas metric measurement PROCESS_NORMALIZED_CPU_KERNEL. The percentage of time the CPU spent servicing operating system calls for this MongoDB process, scaled to a range of 0-100% by dividing by the number of CPU cores.
  • process.normalized.cpu.user (gauge)
    This is Atlas metric measurement PROCESS_NORMALIZED_CPU_USER. The percentage of time the CPU spent servicing this MongoDB process, scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.cpu.guest (gauge)
    This is Atlas metric measurement SYSTEM_CPU_GUEST. The percentage of time the CPU spent servicing guest, which is included in user. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.iowait (gauge)
    This is Atlas metric measurement SYSTEM_CPU_IOWAIT. The percentage of time the CPU spent waiting for IO operations to complete. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.irq (gauge)
    This is Atlas metric measurement SYSTEM_CPU_IRQ. The percentage of time the CPU spent performing hardware interrupts. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.kernel (gauge)
    This is Atlas metric measurement SYSTEM_CPU_KERNEL. The percentage of time the CPU spent servicing operating system calls from all processes. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.nice (gauge)
    This is Atlas metric measurement SYSTEM_CPU_NICE. The percentage of time the CPU spent occupied by all processes with a positive nice value. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.softirq (gauge)
    This is Atlas metric measurement SYSTEM_CPU_SOFTIRQ. The percentage of time the CPU spent performing software interrupts. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.steal (gauge)
    This is Atlas metric measurement SYSTEM_CPU_STEAL. The percentage of time the CPU had something runnable, but the hypervisor chose to run something else. For servers with more than 1 CPU core, this value can exceed 100%.
  • system.cpu.user (gauge)
    This is Atlas metric measurement SYSTEM_CPU_USER. The percentage of time the CPU spent servicing all user applications (not just MongoDB processes). For servers with more than 1 CPU core, this value can exceed 100%.
  • system.normalized.cpu.guest (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_GUEST. The percentage of time the CPU spent servicing guest, which is included in user. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.iowait (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_IOWAIT. The percentage of time the CPU spent waiting for IO operations to complete. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.irq (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_IRQ. The percentage of time the CPU spent performing hardware interrupts. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.kernel (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_KERNEL. The percentage of time the CPU spent servicing operating system calls from all processes. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.nice (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_NICE. The percentage of time the CPU spent occupied by all processes with a positive nice value. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.softirq (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_SOFTIRQ. The percentage of time the CPU spent performing software interrupts. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.steal (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_STEAL. The percentage of time the CPU had something runnable, but the hypervisor chose to run something else. It is scaled to a range of 0-100% by dividing by the number of CPU cores.
  • system.normalized.cpu.user (gauge)
    This is Atlas metric measurement SYSTEM_NORMALIZED_CPU_USER. The percentage of time the CPU spent servicing all user applications (not just MongoDB processes). It is scaled to a range of 0-100% by dividing by the number of CPU cores.

Group mongodb

All of the following metrics are part of the mongodb metric group. All of the non-default metrics below can be turned on by adding mongodb to the monitor config option extraGroups:

  • asserts.msg (counter)
    This is Atlas metric measurement ASSERT_MSG. The average rate of message asserts per second over the selected sample period. These are internal server errors that have a well defined text string. Stack traces are logged for these.
  • asserts.regular (counter)
    This is Atlas metric measurement ASSERT_REGULAR. The average rate of regular asserts raised per second over the selected sample period.
  • asserts.user (counter)
    This is Atlas metric measurement ASSERT_USER. The average rate of user asserts per second over the selected sample period. These are errors that can be generated by a user such as out of disk space or duplicate key.
  • asserts.warning (counter)
    This is Atlas metric measurement ASSERT_WARNING. The average rate of warnings per second over the a selected sample period.
  • background_flush_avg (counter)
    This is Atlas metric measurement BACKGROUND_FLUSH_AVG. Amount of data flushed in the background.
  • cache.bytes.read_into (counter)
    This is Atlas metric measurement CACHE_BYTES_READ_INTO. The average rate of bytes per second read into WiredTiger's cache over the selected sample period.
  • cache.bytes.written_from (counter)
    This is Atlas metric measurement CACHE_BYTES_WRITTEN_FROM. The average rate of bytes per second written from WiredTiger's cache over the selected sample period.
  • cache.dirty_bytes (gauge)
    This is Atlas metric measurement CACHE_DIRTY_BYTES. The number of tracked dirty bytes currently in the WiredTiger cache.
  • cache.used_bytes (gauge)
    This is Atlas metric measurement CACHE_USED_BYTES. The number of bytes currently in the WiredTiger cache.
  • connections.current (gauge)
    This is Atlas metric measurement CONNECTIONS. The number of currently active connections to this server. A stack is allocated per connection; thus very many connections can result in significant RAM usage.
  • cursors.timed_out (counter)
    This is Atlas metric measurement CURSORS_TOTAL_TIMED_OUT. The average rate of cursors that have timed out per second over the selected sample period.
  • cursors.total_open (gauge)
    This is Atlas metric measurement CURSORS_TOTAL_OPEN. The number of cursors that the server is maintaining for clients. Because MongoDB exhausts unused cursors, typically this value is small or zero. However, if there is a queue, stale tailable cursors, or a large number of operations this value may rise.
  • data_size (gauge)
    This is Atlas metric measurement DB_DATA_SIZE_TOTAL. Sum total size in bytes of the document data (including the padding factor) across all databases.
  • document.metrics.deleted (counter)
    This is Atlas metric measurement DOCUMENT_METRICS_DELETED. The average rate per second of documents deleted over the selected sample period.
  • document.metrics.inserted (counter)
    This is Atlas metric measurement DOCUMENT_METRICS_INSERTED. The average rate per second of documents inserted over the selected sample period.
  • document.metrics.returned (counter)
    This is Atlas metric measurement DOCUMENT_METRICS_RETURNED. The average rate per second of documents returned by queries over the selected sample period.
  • document.metrics.updated (counter)
    This is Atlas metric measurement DOCUMENT_METRICS_UPDATED. The average rate per second of documents updated over the selected sample period.
  • extra_info.page_faults (counter)
    This is Atlas metric measurement EXTRA_INFO_PAGE_FAULTS. The average rate of page faults on this process per second over the selected sample period. In non-Windows environments this is hard page faults only.
  • global_lock.current_queue.readers (gauge)
    This is Atlas metric measurement GLOBAL_LOCK_CURRENT_QUEUE_READERS. The number of operations queued waiting for a read lock.
  • global_lock.current_queue.total (gauge)
    This is Atlas metric measurement GLOBAL_LOCK_CURRENT_QUEUE_TOTAL. The number of operations queued waiting for any lock.
  • global_lock.current_queue.writers (gauge)
    This is Atlas metric measurement GLOBAL_LOCK_CURRENT_QUEUE_WRITERS. The number of operations queued waiting for a write lock.
  • index_size (gauge)
    This is Atlas metric measurement DB_INDEX_SIZE_TOTAL. Sum total size in bytes of the index data across all databases.
  • mem.mapped (gauge)
    This is Atlas metric measurement MEMORY_MAPPED. As MMAPv1 memory maps all the data files, this number is likely similar to your total database(s) size. WiredTiger does not use memory mapped files, so this should be 0.
  • mem.resident (gauge)
    This is Atlas metric measurement MEMORY_RESIDENT. The number of megabytes resident. MMAPv1: It is typical over time, on a dedicated database server, for this number to approach the amount of physical ram on the box. WiredTiger: In a standard deployment resident is the amount of memory used by the WiredTiger cache plus the memory dedicated to other in memory structures used by the mongod process. By default, mongod with WiredTiger reserves 50% of the total physical memory on the server for the cache and at steady state, WiredTiger tries to limit cache usage to 80% of that total. For example, if a server has 16GB of memory, WiredTiger will assume it can use 8GB for cache and at steady state should use about 6.5GB.
  • mem.virtual (gauge)
    This is Atlas metric measurement MEMORY_VIRTUAL. The virtual megabytes for the mongod process. MMAPv1: Generally virtual should be a little larger than mapped (or 2x with --journal), but if virtual is many gigabytes larger, it indicates that excessive memory is being used by other aspects than the memory mapping of files -- that would be bad/suboptimal. The most common case of usage of a high amount of memory for non-mapped is that there are very many connections to the database. Each connection has a thread stack and the memory for those stacks can add up to a considerable amount. WiredTiger: Generally virtual should be a little larger than mapped, but if virtual is many gigabytes larger, it indicates that excessive memory is being used by other aspects than the memory mapping of files – that would be bad/suboptimal. The most common case of usage of a high amount of memory for non-mapped is that there are very many connections to the database. Each connection has a thread stack and the memory for those stacks can add up to a considerable amount.
  • network.bytes_in (gauge)
    This is Atlas metric measurement NETWORK_BYTES_IN. The average rate of physical (after any wire compression) bytes sent to this database server per second over the selected sample period.
  • network.bytes_out (gauge)
    This is Atlas metric measurement NETWORK_BYTES_OUT. The average rate of physical (after any wire compression) bytes sent from this database server per second over the selected sample period.
  • network.num_requests (counter)
    This is Atlas metric measurement NETWORK_NUM_REQUESTS. The average rate of requests sent to this database server per second over the selected sample period.
  • op.execution.time.commands (gauge)
    This is Atlas metric measurement OP_EXECUTION_TIME_COMMANDS. The average execution time in milliseconds per command operation over the selected sample period.
  • op.execution.time.reads (gauge)
    This is Atlas metric measurement OP_EXECUTION_TIME_READS. The average execution time in milliseconds per read operation over the selected sample period.
  • op.execution.time.writes (gauge)
    This is Atlas metric measurement OP_EXECUTION_TIME_WRITES. The average execution time in milliseconds per write operation over the selected sample period.
  • opcounter.command (counter)
    This is Atlas metric measurement OPCOUNTER_CMD. The average rate of commands performed per second over the selected sample period.
  • opcounter.delete (counter)
    This is Atlas metric measurement OPCOUNTER_DELETE. The average rate of deletes performed per second over the selected sample period.
  • opcounter.getmore (counter)
    This is Atlas metric measurement OPCOUNTER_GETMORE. The average rate of getMores performed per second on any cursor over the selected sample period. On a primary, this number can be high even if the query count is low as the secondaries 'getMore' from the primary often as part of replication.
  • opcounter.insert (counter)
    This is Atlas metric measurement OPCOUNTER_INSERT. The average rate of inserts performed per second over the selected sample period.
  • opcounter.query (counter)
    This is Atlas metric measurement OPCOUNTER_QUERY. The average rate of queries performed per second over the selected sample period.
  • opcounter.repl.command (counter)
    This is Atlas metric measurement OPCOUNTER_REPL_CMD. The average rate of replicated commands applied per second over the selected sample period.
  • opcounter.repl.delete (counter)
    This is Atlas metric measurement OPCOUNTER_REPL_DELETE. The average rate of replicated deletes applied per second over the selected sample period.
  • opcounter.repl.insert (counter)
    This is Atlas metric measurement OPCOUNTER_REPL_INSERT. The average rate of replicated inserts applied per second over the selected sample period.
  • opcounter.repl.update (counter)
    This is Atlas metric measurement OPCOUNTER_REPL_UPDATE. The average rate of replicated updates applied per second over the selected sample period.
  • opcounter.update (counter)
    This is Atlas metric measurement OPCOUNTER_UPDATE. The average rate of updates performed per second over the selected sample period.
  • operations_scan_and_order (counter)
    This is Atlas metric measurement OPERATIONS_SCAN_AND_ORDER. The average rate per second over the selected sample period of queries that return sorted results that cannot perform the sort operation using an index.
  • oplog.master.lag_time_diff (gauge)
    This is Atlas metric measurement OPLOG_MASTER_LAG_TIME_DIFF. The replication headroom which is the difference between the primary's replication oplog window (i.e. latest minus oldest oplog entry time) and the secondary's replication lag. A secondary can go into RECOVERING if this value goes to zero.
  • oplog.master.time (gauge)
    This is Atlas metric measurement OPLOG_MASTER_TIME. The replication oplog window. The approximate number of hours available in the primary's replication oplog. If a secondary is behind real-time by more than this amount, it cannot catch up and will require a full resync.
  • oplog.rate (gauge)
    This is Atlas metric measurement OPLOG_RATE_GB_PER_HOUR. The average rate of gigabytes of oplog the primary generates per hour.
  • oplog.slave.lag_master_time (gauge)
    This is Atlas metric measurement OPLOG_SLAVE_LAG_MASTER_TIME. The replication lag. The approximate number of seconds the secondary is behind the primary in write application. Only accurate if the lag is larger than 1-2 seconds, as the precision of this statistic is limited.
  • query.executor.scanned (counter)
    This is Atlas metric measurement QUERY_EXECUTOR_SCANNED. The average rate per second over the selected sample period of index items scanned during queries and query-plan evaluation. This rate is driven by the same value as totalKeysExamined in the output of explain().
  • query.executor.scanned_objects (counter)
    This is Atlas metric measurement QUERY_EXECUTOR_SCANNED_OBJECTS. The average rate per second over the selected sample period of documents scanned during queries and query-plan evaluation. This rate is driven by the same value as totalDocsExamined in the output of explain().
  • query.targeting.scanned_objects_per_returned (gauge)
    This is Atlas metric measurement QUERY_TARGETING_SCANNED_OBJECTS_PER_RETURNED. The ratio of the number of documents scanned to the number of documents returned by queries, since the previous data point for the selected sample period.
  • query.targeting.scanned_per_returned (gauge)
    This is Atlas metric measurement QUERY_TARGETING_SCANNED_PER_RETURNED. The ratio of the number of index items scanned to the number of documents returned by queries, since the previous data point for the selected sample period. A value of 1.0 means all documents returned exactly match query criteria for the sample period. A value of 100 means on average for the sample period, a query scans 100 documents to find one that's returned.
  • storage_size (gauge)
    This is Atlas metric measurement DB_STORAGE_TOTAL. Sum total on-disk storage space allocated for document storage across all databases.
  • tickets.available.reads (gauge)
    This is Atlas metric measurement TICKETS_AVAILABLE_READS. The number of read tickets available to the WiredTiger storage engine. Read tickets represent the number of concurrent read operations allowed into the storage engine. When this value reaches zero new read requests may queue until a read ticket becomes available.
  • tickets.available.write (gauge)
    This is Atlas metric measurement TICKETS_AVAILABLE_WRITE. The number of write tickets available to the WiredTiger storage engine. Write tickets represent the number of concurrent write operations allowed into the storage engine. When this value reaches zero new write requests may queue until a write ticket becomes available.

Non-default metrics (version 4.7.0+)

To emit metrics that are not default, you can add those metrics in the generic monitor-level extraMetrics config option. Metrics that are derived from specific configuration options that do not appear in the above list of metrics do not need to be added to extraMetrics.

To see a list of metrics that will be emitted you can run agent-status monitors after configuring this monitor in a running agent instance.