-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add available memory state for hostmetrics receiver #7417
Comments
cc @dmitryax |
/cc @dashpole I think you are using this as well |
I don't think either of those proposals improves the problem with the values not adding up. I'd like to add proposal 3, fixing the set of states to exactly cover the memory on the system. If there are states that double-count memory, they should be separate metrics. Here are the non-overlapping states from Munin and how they calculate them from /proc/meminfo as an inspiration:
(I'm not sure if it makes sense to include swap; I'm guessing that with the current set of states we're not including swap? You could leave that out and I think the remaining states would represent total physical memory.) I think the main difference between this and proposal 2 is that here we're changing the definition of |
@quentinmit I really like your proposal to solve the sum assumption, it provides much more information to the user and we prevent having overlapping states. Tested without the Swap states as they are covered in the paging scraper and the sum usage is correct. Nonetheless, I have a couple of concerns:
In my opinion, the available or non-available state is very useful metric when defining alerts. The metric |
Proposal 3 Another possible solution would be to add a new memory metric only for this purpose, as making any change on the current states would break the sum assumption. The memory system scrapper will generate the following metrics:
I would appreciate any opinion on this |
Whether using the The average difference is around 7/8% . By providing both values we let the user choose whatever it's best. |
Is this the same as your proposed solution 1? I like the idea of adding a new metric; allows those who currently rely on |
@TylerHelmuth No, thanks for pointing that out. It would be a new proposal, let me update the comment. Proposals 1 and 2 are based on the "state" attribute used in usage and utilization metrics. Proposal 3 won't do any modification on the state neither the current metrics, instead, add a new one. |
Adding another metric Potentially, we can solve both problems by introducing a concept of optional metric attribute values to the metrics builder. The attributes values can be enabled or disabled in metadata.yaml (default) and user configs. If attribute value is disabled, it means that datapoints with this attribute value won't be emitted. With this functionality added to the metrics builder, we will have the following sets of
What do you think? @bogdandrutu @djaglowski please share your opinion on the additional feature to the metrics builder |
I think we should aim for a design that adheres to the specification's aggregation expectations (ie. data points sum to a meaningful total). Supporting metrics that violate these expectations seems like the wrong approach to me. There seem to be two layers to the problem here:
I don't have a strong opinion on the first question, but in regards to the second, I think we basically have to decide whether to fix it, or to deprecate and replace it. The right approach depends on stability level. In my opinion, a per-metric stability level is necessary in order to have a clear approach (eg. alpha -> beta -> stable -> deprecated). If this metric is less than stable, then we should fix it. If it is "stable" but realized to be fundamentally flawed in its design, we should deprecate it and add one or more new metrics that amount to a more accurate model. |
That's true, we should always aim to have meaningful total values. Proposing optional attribute values, I imply that the sum of attribute values enabled by default should represent the most reasonable total value. At the same time, I can think of use cases when user wants to disable some attributes values or add others optional ones and have their own custom meaningful sums out of them. In this case, moving |
I think it's okay to keep |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I think this issue is still valid and worth addressing (@dmitryax) as we're clearly missing And so I'd like to reopen the discussion as to me it seems like adding As for fixing the flawed sum, I like this approach:
|
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@fatsheep9146 maybe I could take this one and try implementing those changes? |
@lootek It's yours! |
@fatsheep9146 That's great, thanks! I'll followup soon :) |
Hi folks. I'm interested in having What is doneMainly, discussions on two issues: My impression is that these issues can be solved independently. Hence, I'll describe them in detail separately. Issue A: sum over
|
I propose to walk the "short-term" path. I created open-telemetry/semantic-conventions#257 to add |
open-telemetry/semantic-conventions#257 is done. Next step: I'll implement |
…ric (#27247) Implement a new metric `system.linux.memory.available`, which I defined in open-telemetry/semantic-conventions#257. **Link to tracking Issue:** #7417 **Testing:** added a check to an existing unit test. I had to refactor the test a bit: it assumed that a certain metric will be at position 0, which is not true now. **Documentation:** Added note in `metadata.yaml` --------- Co-authored-by: Curtis Robert <92119472+crobert-1@users.noreply.github.com>
…ric (open-telemetry#27247) Implement a new metric `system.linux.memory.available`, which I defined in open-telemetry/semantic-conventions#257. **Link to tracking Issue:** open-telemetry#7417 **Testing:** added a check to an existing unit test. I had to refactor the test a bit: it assumed that a certain metric will be at position 0, which is not true now. **Documentation:** Added note in `metadata.yaml` --------- Co-authored-by: Curtis Robert <92119472+crobert-1@users.noreply.github.com>
I implemented all steps for "Issue B" from #7417 (comment). So, answering the original question from this bug - "Can we get MemAvailable field from current states?": now yes. I propose to close this bug as done. Feel free to open a separate thread for "Issue A" from #7417 (comment). I see that open-telemetry/semantic-conventions#531 is still open, and IMO it's not critical. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
The linked semconv issue doesn't address this bug - it doesn't fix the |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I would keep this issue open too, the existing memory metrics won't be aligned with the semantic conventions until the SemConv progress update: open-telemetry/semantic-conventions#1078 |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
/unstale |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
The current Linux memory states provided by the
hostmetricsreceiver
might be not the most appropriate ones for an end user. The available memory states for Linux are:[buffered, cached, free, slab_reclaimable, slab_unreclaimable, used]
All of those values are retrieved with gopsutil, which gets them from
/proc/meminfo
file.A user might think that the
free
value represents the amount of memory that the system can use, but free (MemFree) is a count of how many pages are free in the buddy allocator, however there are a lot of ways pages can be returned to the buddy allocator in time of need. The latest is provided by the Linux kernel (>3.16) asMemAvailable
in/proc/meminfo
.Can we get MemAvailable field from current states?
No, it is an independent field not computed with the current values. Gopsutils provides this value regardless the kernel version as it can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the "low" watermarks from /proc/zoneinfo.
Repercussion
The user will be able to know the actual available memory of the system.
It is already broken and quite difficult to achieve it as some of the memory state values are inclusive, for example the cached state. The assumption would be defined as:
total = buffered + cached + free + used + slab_reclaimable + slab_unreclaimable
But the
used
value is computed by gopsutils as: total - free - buffered - cachedThen:
total = buffered + cached + free + (total - free - buffered - cached) + slab_reclaimable + slab_unreclaimable
total = total + slab_reclaimable + slab_unreclaimable
The sum(state) will always be higher than the total.
(The following proposals do not solve the sum assumption, they only add more memory information for the user)
Proposal 1
Add a new memory state for Linux to provide the available memory. State name:
avaialble
Proposal 2
Change
used
state for:total - available
This proposal requires to add the memory.limit metric as we won't be able to compute the total memory from the states. Currently, we can get the total amount of memory with the following promQL query:
scalar(sum without (state) (system_memory_usage{state =~ "free|used|cached|buffered"}))
The text was updated successfully, but these errors were encountered: