-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a flag to adjust mount timeout #1486
Conversation
@SuperQ @pgier @discordianfish Any chance this could get a look? |
I don't think the 30s was completely arbitrary. Some of the idea was that it was a reasonable value that was a multiple of the scrape interval. I could see changing the value from 30s to 10s, but making it a flag seems unecessary. It also may cause users to unnecessarily tweak it without fully understanding the implications. |
If the idea was to be a multiple of the scrape interval, and the user is able to select their desired scrape interval, shouldn't they be able to update something like this then to coincide? Raising or lowering the scrape interval usually involves updating timeouts in other places too. I know there isn't a direct connection between scrape rate and mount timeouts, but the number of missed scrapes on a timeout is based on a function of these two values. If the concern is that there isn't enough understanding in what the flag will change, I can give it a more verbose description. But I think "because the user might change it without understanding it" isn't a great reason not to make it an option to anyone. It's on the user at that point to not modify settings they don't understand. |
I know you think it's not a great reason, and you're probably right, but with these kinds of tuneables we see far too many support issues already with the existing amount of flexibility. Every tuning dimension we add expands the support matrix a little bit more. I would prefer to dig into the history of the current default and find out why the number was chosen, rather than add another |
It basically just boiled down to #997 (comment)
There wasn't much of a discussion about it at the time, kinda just went with it. The question of a flag was dismissed at the time because I didn't see a need for it then, but i'm finding now that 30s is longer than necessary and causing multiple missed scrapes for us. |
@discordianfish Do you have any opinion here? |
I'd be fine merging this but don't feel strongly. @pgier for tie breaking? :) |
I'm ok with merging. I think @mknapphrt's reasoning makes sense, and I'd probably lower the timeout even more (5s or less) since monitoring fliesystems with such a slow response should be fairly uncommon (I hope). |
ab7d912
to
c11be78
Compare
I reduced the time a little more, I think 5s is reasonable too. What do you think @SuperQ ? |
Seems fine to me. I still think I want to limit end-user exposure to this tunable. As a compromise, what do you think about making this a |
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
c11be78
to
c9603c6
Compare
Yeah, that sounds fair enough to me. I don't really know how many people would care about this functionality so keeping it hidden is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
Any idea when we might see another release cut? |
There are a couple of moderate breaking changes we want to get in before we do a release. The next release we want to do will be a 1.0 release candidate. |
* The netdev collector CLI argument `--collector.netdev.ignored-devices` was renamed to `--collector.netdev.device-blacklist` in order to conform with the systemd collector. #1279 * The label named `state` on `node_systemd_service_restart_total` metrics was changed to `name` to better describe the metric. #1393 * Refactoring of the mdadm collector changes several metrics - `node_md_disks_active` is removed - `node_md_disks` now has a `state` label for "fail", "spare", "active" disks. - `node_md_is_active` is replaced by `node_md_state` with a state set of "active", "inactive", "recovering", "resync". * Additional label `mountaddr` added to NFS device metrics to distinguish mounts from the same URL, but different IP addresses. #1417 * Metrics node_cpu_scaling_frequency_min_hrts and node_cpu_scaling_frequency_max_hrts of the cpufreq collector were renamed to node_cpu_scaling_frequency_min_hertz and node_cpu_scaling_frequency_max_hertz. #1510 * Collectors that are enabled, but are unable to find data to collect, now return 0 for `node_scrape_collector_success`. * [CHANGE] Add `--collector.netdev.device-whitelist`. #1279 * [CHANGE] Ignore iso9600 filesystem on Linux #1355 * [CHANGE] Refactor mdadm collector #1403 * [CHANGE] Add `mountaddr` label to NFS metrics. #1417 * [CHANGE] Don't count empty collectors as success. #1613 * [FEATURE] New flag to disable default collectors #1276 * [FEATURE] Add experimental TLS support #1277, #1687, #1695 * [FEATURE] Add collector for Power Supply Class #1280 * [FEATURE] Add new schedstat collector #1389 * [FEATURE] Add FreeBSD zfs support #1394 * [FEATURE] Add uname support for Darwin and OpenBSD #1433 * [FEATURE] Add new metric node_cpu_info #1489 * [FEATURE] Add new thermal_zone collector #1425 * [FEATURE] Add new cooling_device metrics to thermal zone collector #1445 * [FEATURE] Add swap usage on darwin #1508 * [FEATURE] Add Btrfs collector #1512 * [FEATURE] Add RAPL collector #1523 * [FEATURE] Add new softnet collector #1576 * [FEATURE] Add new udp_queues collector #1503 * [FEATURE] Add basic authentication #1673 * [ENHANCEMENT] Log pid when there is a problem reading the process stats #1341 * [ENHANCEMENT] Collect InfiniBand port state and physical state #1357 * [ENHANCEMENT] Include additional XFS runtime statistics. #1423 * [ENHANCEMENT] Report non-fatal collection errors in the exporter metric. #1439 * [ENHANCEMENT] Expose IPVS firewall mark as a label #1455 * [ENHANCEMENT] Add check for systemd version before attempting to query certain metrics. #1413 * [ENHANCEMENT] Add a flag to adjust mount timeout #1486 * [ENHANCEMENT] Add new counters for flush requests in Linux 5.5 #1548 * [ENHANCEMENT] Add metrics and tests for UDP receive and send buffer errors #1534 * [ENHANCEMENT] The sockstat collector now exposes IPv6 statistics in addition to the existing IPv4 support. #1552 * [ENHANCEMENT] Add infiniband info metric #1563 * [ENHANCEMENT] Add unix socket support for supervisord collector #1592 * [ENHANCEMENT] Implement loadavg on all BSDs without cgo #1584 * [ENHANCEMENT] Add model_name and stepping to node_cpu_info metric #1617 * [ENHANCEMENT] Add `--collector.perf.cpus` to allow setting the CPU list for perf stats. #1561 * [ENHANCEMENT] Add metrics for IO errors and retires on Darwin. #1636 * [ENHANCEMENT] Add perf tracepoint collection flag #1664 * [ENHANCEMENT] ZFS: read contents of objset file #1632 * [ENHANCEMENT] Linux CPU: Cache CPU metrics to make them monotonically increasing #1711 * [BUGFIX] Read /proc/net files with a single read syscall #1380 * [BUGFIX] Renamed label `state` to `name` on `node_systemd_service_restart_total`. #1393 * [BUGFIX] Fix netdev nil reference on Darwin #1414 * [BUGFIX] Strip path.rootfs from mountpoint labels #1421 * [BUGFIX] Fix seconds reported by schedstat #1426 * [BUGFIX] Fix empty string in path.rootfs #1464 * [BUGFIX] Fix typo in cpufreq metric names #1510 * [BUGFIX] Read /proc/stat in one syscall #1538 * [BUGFIX] Fix OpenBSD cache memory information #1542 * [BUGFIX] Refactor textfile collector to avoid looping defer #1549 * [BUGFIX] Fix network speed math #1580 * [BUGFIX] collector/systemd: use regexp to extract systemd version #1647 * [BUGFIX] Fix initialization in perf collector when using multiple CPUs #1665 * [BUGFIX] Fix accidentally empty lines in meminfo_linux #1671 Signed-off-by: Ben Kochie <superq@gmail.com>
* The netdev collector CLI argument `--collector.netdev.ignored-devices` was renamed to `--collector.netdev.device-blacklist` in order to conform with the systemd collector. prometheus#1279 * The label named `state` on `node_systemd_service_restart_total` metrics was changed to `name` to better describe the metric. prometheus#1393 * Refactoring of the mdadm collector changes several metrics - `node_md_disks_active` is removed - `node_md_disks` now has a `state` label for "fail", "spare", "active" disks. - `node_md_is_active` is replaced by `node_md_state` with a state set of "active", "inactive", "recovering", "resync". * Additional label `mountaddr` added to NFS device metrics to distinguish mounts from the same URL, but different IP addresses. prometheus#1417 * Metrics node_cpu_scaling_frequency_min_hrts and node_cpu_scaling_frequency_max_hrts of the cpufreq collector were renamed to node_cpu_scaling_frequency_min_hertz and node_cpu_scaling_frequency_max_hertz. prometheus#1510 * Collectors that are enabled, but are unable to find data to collect, now return 0 for `node_scrape_collector_success`. * [CHANGE] Add `--collector.netdev.device-whitelist`. prometheus#1279 * [CHANGE] Ignore iso9600 filesystem on Linux prometheus#1355 * [CHANGE] Refactor mdadm collector prometheus#1403 * [CHANGE] Add `mountaddr` label to NFS metrics. prometheus#1417 * [CHANGE] Don't count empty collectors as success. prometheus#1613 * [FEATURE] New flag to disable default collectors prometheus#1276 * [FEATURE] Add experimental TLS support prometheus#1277, prometheus#1687, prometheus#1695 * [FEATURE] Add collector for Power Supply Class prometheus#1280 * [FEATURE] Add new schedstat collector prometheus#1389 * [FEATURE] Add FreeBSD zfs support prometheus#1394 * [FEATURE] Add uname support for Darwin and OpenBSD prometheus#1433 * [FEATURE] Add new metric node_cpu_info prometheus#1489 * [FEATURE] Add new thermal_zone collector prometheus#1425 * [FEATURE] Add new cooling_device metrics to thermal zone collector prometheus#1445 * [FEATURE] Add swap usage on darwin prometheus#1508 * [FEATURE] Add Btrfs collector prometheus#1512 * [FEATURE] Add RAPL collector prometheus#1523 * [FEATURE] Add new softnet collector prometheus#1576 * [FEATURE] Add new udp_queues collector prometheus#1503 * [FEATURE] Add basic authentication prometheus#1673 * [ENHANCEMENT] Log pid when there is a problem reading the process stats prometheus#1341 * [ENHANCEMENT] Collect InfiniBand port state and physical state prometheus#1357 * [ENHANCEMENT] Include additional XFS runtime statistics. prometheus#1423 * [ENHANCEMENT] Report non-fatal collection errors in the exporter metric. prometheus#1439 * [ENHANCEMENT] Expose IPVS firewall mark as a label prometheus#1455 * [ENHANCEMENT] Add check for systemd version before attempting to query certain metrics. prometheus#1413 * [ENHANCEMENT] Add a flag to adjust mount timeout prometheus#1486 * [ENHANCEMENT] Add new counters for flush requests in Linux 5.5 prometheus#1548 * [ENHANCEMENT] Add metrics and tests for UDP receive and send buffer errors prometheus#1534 * [ENHANCEMENT] The sockstat collector now exposes IPv6 statistics in addition to the existing IPv4 support. prometheus#1552 * [ENHANCEMENT] Add infiniband info metric prometheus#1563 * [ENHANCEMENT] Add unix socket support for supervisord collector prometheus#1592 * [ENHANCEMENT] Implement loadavg on all BSDs without cgo prometheus#1584 * [ENHANCEMENT] Add model_name and stepping to node_cpu_info metric prometheus#1617 * [ENHANCEMENT] Add `--collector.perf.cpus` to allow setting the CPU list for perf stats. prometheus#1561 * [ENHANCEMENT] Add metrics for IO errors and retires on Darwin. prometheus#1636 * [ENHANCEMENT] Add perf tracepoint collection flag prometheus#1664 * [ENHANCEMENT] ZFS: read contents of objset file prometheus#1632 * [ENHANCEMENT] Linux CPU: Cache CPU metrics to make them monotonically increasing prometheus#1711 * [BUGFIX] Read /proc/net files with a single read syscall prometheus#1380 * [BUGFIX] Renamed label `state` to `name` on `node_systemd_service_restart_total`. prometheus#1393 * [BUGFIX] Fix netdev nil reference on Darwin prometheus#1414 * [BUGFIX] Strip path.rootfs from mountpoint labels prometheus#1421 * [BUGFIX] Fix seconds reported by schedstat prometheus#1426 * [BUGFIX] Fix empty string in path.rootfs prometheus#1464 * [BUGFIX] Fix typo in cpufreq metric names prometheus#1510 * [BUGFIX] Read /proc/stat in one syscall prometheus#1538 * [BUGFIX] Fix OpenBSD cache memory information prometheus#1542 * [BUGFIX] Refactor textfile collector to avoid looping defer prometheus#1549 * [BUGFIX] Fix network speed math prometheus#1580 * [BUGFIX] collector/systemd: use regexp to extract systemd version prometheus#1647 * [BUGFIX] Fix initialization in perf collector when using multiple CPUs prometheus#1665 * [BUGFIX] Fix accidentally empty lines in meminfo_linux prometheus#1671 Signed-off-by: Ben Kochie <superq@gmail.com>
* The netdev collector CLI argument `--collector.netdev.ignored-devices` was renamed to `--collector.netdev.device-blacklist` in order to conform with the systemd collector. prometheus#1279 * The label named `state` on `node_systemd_service_restart_total` metrics was changed to `name` to better describe the metric. prometheus#1393 * Refactoring of the mdadm collector changes several metrics - `node_md_disks_active` is removed - `node_md_disks` now has a `state` label for "fail", "spare", "active" disks. - `node_md_is_active` is replaced by `node_md_state` with a state set of "active", "inactive", "recovering", "resync". * Additional label `mountaddr` added to NFS device metrics to distinguish mounts from the same URL, but different IP addresses. prometheus#1417 * Metrics node_cpu_scaling_frequency_min_hrts and node_cpu_scaling_frequency_max_hrts of the cpufreq collector were renamed to node_cpu_scaling_frequency_min_hertz and node_cpu_scaling_frequency_max_hertz. prometheus#1510 * Collectors that are enabled, but are unable to find data to collect, now return 0 for `node_scrape_collector_success`. * [CHANGE] Add `--collector.netdev.device-whitelist`. prometheus#1279 * [CHANGE] Ignore iso9600 filesystem on Linux prometheus#1355 * [CHANGE] Refactor mdadm collector prometheus#1403 * [CHANGE] Add `mountaddr` label to NFS metrics. prometheus#1417 * [CHANGE] Don't count empty collectors as success. prometheus#1613 * [FEATURE] New flag to disable default collectors prometheus#1276 * [FEATURE] Add experimental TLS support prometheus#1277, prometheus#1687, prometheus#1695 * [FEATURE] Add collector for Power Supply Class prometheus#1280 * [FEATURE] Add new schedstat collector prometheus#1389 * [FEATURE] Add FreeBSD zfs support prometheus#1394 * [FEATURE] Add uname support for Darwin and OpenBSD prometheus#1433 * [FEATURE] Add new metric node_cpu_info prometheus#1489 * [FEATURE] Add new thermal_zone collector prometheus#1425 * [FEATURE] Add new cooling_device metrics to thermal zone collector prometheus#1445 * [FEATURE] Add swap usage on darwin prometheus#1508 * [FEATURE] Add Btrfs collector prometheus#1512 * [FEATURE] Add RAPL collector prometheus#1523 * [FEATURE] Add new softnet collector prometheus#1576 * [FEATURE] Add new udp_queues collector prometheus#1503 * [FEATURE] Add basic authentication prometheus#1673 * [ENHANCEMENT] Log pid when there is a problem reading the process stats prometheus#1341 * [ENHANCEMENT] Collect InfiniBand port state and physical state prometheus#1357 * [ENHANCEMENT] Include additional XFS runtime statistics. prometheus#1423 * [ENHANCEMENT] Report non-fatal collection errors in the exporter metric. prometheus#1439 * [ENHANCEMENT] Expose IPVS firewall mark as a label prometheus#1455 * [ENHANCEMENT] Add check for systemd version before attempting to query certain metrics. prometheus#1413 * [ENHANCEMENT] Add a flag to adjust mount timeout prometheus#1486 * [ENHANCEMENT] Add new counters for flush requests in Linux 5.5 prometheus#1548 * [ENHANCEMENT] Add metrics and tests for UDP receive and send buffer errors prometheus#1534 * [ENHANCEMENT] The sockstat collector now exposes IPv6 statistics in addition to the existing IPv4 support. prometheus#1552 * [ENHANCEMENT] Add infiniband info metric prometheus#1563 * [ENHANCEMENT] Add unix socket support for supervisord collector prometheus#1592 * [ENHANCEMENT] Implement loadavg on all BSDs without cgo prometheus#1584 * [ENHANCEMENT] Add model_name and stepping to node_cpu_info metric prometheus#1617 * [ENHANCEMENT] Add `--collector.perf.cpus` to allow setting the CPU list for perf stats. prometheus#1561 * [ENHANCEMENT] Add metrics for IO errors and retires on Darwin. prometheus#1636 * [ENHANCEMENT] Add perf tracepoint collection flag prometheus#1664 * [ENHANCEMENT] ZFS: read contents of objset file prometheus#1632 * [ENHANCEMENT] Linux CPU: Cache CPU metrics to make them monotonically increasing prometheus#1711 * [BUGFIX] Read /proc/net files with a single read syscall prometheus#1380 * [BUGFIX] Renamed label `state` to `name` on `node_systemd_service_restart_total`. prometheus#1393 * [BUGFIX] Fix netdev nil reference on Darwin prometheus#1414 * [BUGFIX] Strip path.rootfs from mountpoint labels prometheus#1421 * [BUGFIX] Fix seconds reported by schedstat prometheus#1426 * [BUGFIX] Fix empty string in path.rootfs prometheus#1464 * [BUGFIX] Fix typo in cpufreq metric names prometheus#1510 * [BUGFIX] Read /proc/stat in one syscall prometheus#1538 * [BUGFIX] Fix OpenBSD cache memory information prometheus#1542 * [BUGFIX] Refactor textfile collector to avoid looping defer prometheus#1549 * [BUGFIX] Fix network speed math prometheus#1580 * [BUGFIX] collector/systemd: use regexp to extract systemd version prometheus#1647 * [BUGFIX] Fix initialization in perf collector when using multiple CPUs prometheus#1665 * [BUGFIX] Fix accidentally empty lines in meminfo_linux prometheus#1671 Signed-off-by: Ben Kochie <superq@gmail.com>
Allow the user to specify a timeout for considering mounts stale. Previously this was set to a static 30s fairly arbitrarily. I lowered the default because I think it should be lower than the default scrape rate for prometheus so that a timeout doesn't cause a failed scrape. The reason I feel this should be adjustable is that it could be useful to change this value to better match a users scrape rate and the known load on a filesystem.
A couple scenarios that come to mind would be if an nfs mount goes stale and someone has a low scrape rate, say 5-10s, this means that there will be 3-6 failed scrapes from timing out. The user would probably rather set this timeout lower. Or the opposite, if a filesystem is known to be under heavy load at times, a user may not want to timeout quickly.
Signed-off-by: Mark Knapp mknapp@hudson-trading.com