Add cgroup support (#1197) #1211

tanquetav · 2020-06-01T18:17:39Z

What does this PR do?

Checklist

My code follows the style guidelines of this project
I have rebased my changes on top of the latest master branch
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated CHANGELOG.asciidoc

Author's Checklist

Related issues

Use cases

When a java program is running on a cgroup limited environment (docker with -m option, k8s with resource limit) cgroup is used to get memory information(system.memory.total, system.memory.actual.free and system.process.memory.rss)

Screenshots

cla-checker-service · 2020-06-01T18:17:42Z

💚 CLA has been signed

eyalkoren · 2020-06-02T06:01:14Z

@tanquetav Thanks for the PR!!
Can you please sign the CLA?

eyalkoren

@tanquetav Thanks for this extremely useful PR! ❤️

A few suggestions:

Let's not assume the cgroup filesystem path is the default (/sys/fs/cgroup). I found that the memory cgroup subdir path can be quite easily parsed from the contents of /proc/self/mounts - lines are space-delimited, where the first part is memory, the second part would be the memory cgroup path (tested on latest Docker images of ubuntu, opensuse/leap and centos). If we fail to find the path this way, let's then fallback to try the default. Of course this should be done during initialization.
What use cases does the cgroup2 route cover? If we apply the suggestion in (1), would that cover it?
if you move the check whether limit is set (and whether we want to use the cgroup info) to the constructor, you can use a final boolean field, thus avoiding the AtomicBoolean accesses. In general, please try to move any initialization logic from the bindTo to construction time.
system.process.memory.rss.bytes is a useful metric, but if we add it, let's add it all across, meaning at least with JMX as well. Also, if we do that, we better document in the metrics documentation.
What do think on adopting the approach suggested in node level memory state from root cgroup is different from /proc/meminfo google/cadvisor#2042 of calculating real used bytes based on memory.usage_in_bytes - total_inactive_file (the latter coming from stat)?

eyalkoren · 2020-06-02T09:25:22Z

An additional thought - google/cadvisor#2042 reports that the real used values are different when obtained through these different approaches. I witnessed the same.

What do you think about calculating the real used based on cgroup even if memory.limit_in_bytes is invalid (unlimited), relying on the host limit coming from /proc/meminfo -> MemTotal to determine the total instead?

apmmachine · 2020-06-02T13:21:11Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request #1211 updated]
Start Time: 2020-08-18T14:54:15.423+0000
Duration: 48 min 46 sec

Test stats 🧪

Test	Results
Failed	0
Passed	1443
Skipped	11
Total	1454

tanquetav · 2020-06-02T13:22:55Z

For 1 - 3 itens, I redo the check/verify of cgroup to contructor. It is much more cleaner code. It use /proc/self/cgroup to verify the cgroup2 folder, if it can be read or if it is unlimited. Try to fallback to cgroup1 .

4-5 itens, I think is better to not use it now, maybe in other PR, because I am using cgroup and more process can be allocated to same cgroup. You suggestion to use jmx is better.

eyalkoren

@tanquetav Thanks for the changes.

I think you misunderstood what I meant to address. I am trying to find the actual cgroup filesystem path, before trying the default path (/sys/fs/cgroup). The way I suggested of doing that is by parsing the /proc/self/mounts file. The line that starts with memory should contain the path to the memory cgroup subdirectory as the second item. Let me know if you see a case where this does not apply.

The handling of the /proc/self/cgroup file contents is not going to work where I tested it. The path retrieved from it is not a valid file or directory path in those cases.
In addition, when is the 0: subsystem is being assigned? The outputs I observed did not assign to it. In which cases does cgroup1 approach is valid and in which it is cgroup2?

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

tanquetav · 2020-06-03T12:20:10Z

I am using fedora 32, and on boot process, I can select if I am using cgroup1 or cgroup2. If I pass systemd.unified_cgroup_hierarchy=0 as a kernel option cgroup1 is used and if I do not put this on kernel boot option cgroup2 is used.

About /proc/self/cgroup, sometimes memory do not appear on it, just the 0: entry. Then I check both lines (memory and 0:) .

root@ml022-476:/# cat /proc/self/cgroup 
1:name=systemd:/
0::/system.slice/docker-1487ca74e6c3206b7d5a16e8a7a3064ef5d817d14b7d1c4794b61d357d7da2a2.scope

About /sys/fs/cgroup, I think that is a misundertood. On cgroup1, /sys/fs/cgroup has the memory files, and when they are read , it gives the cgroup values.

On cgroup2 is a little differente. The memory limits files are available inside a subdirectories with the slice name:

[root@ml022-476 ~]# tree /sys/fs/cgroup/system.slice/docker-1487ca74e6c3206b7d5a16e8a7a3064ef5d817d14b7d1c4794b61d357d7da2a2.scope/
/sys/fs/cgroup/system.slice/docker-1487ca74e6c3206b7d5a16e8a7a3064ef5d817d14b7d1c4794b61d357d7da2a2.scope/
├── cgroup.controllers
├── cgroup.events
├── cgroup.freeze
├── cgroup.max.depth
├── cgroup.max.descendants
├── cgroup.procs
├── cgroup.stat
├── cgroup.subtree_control
├── cgroup.threads
├── cgroup.type
├── cpu.max
├── cpu.pressure
├── cpuset.cpus
├── cpuset.cpus.effective
├── cpuset.cpus.partition
├── cpuset.mems
├── cpuset.mems.effective
├── cpu.stat
├── cpu.weight
├── cpu.weight.nice
├── hugetlb.1GB.current
├── hugetlb.1GB.events
├── hugetlb.1GB.events.local
├── hugetlb.1GB.max
├── hugetlb.2MB.current
├── hugetlb.2MB.events
├── hugetlb.2MB.events.local
├── hugetlb.2MB.max
├── io.bfq.weight
├── io.latency
├── io.max
├── io.pressure
├── io.stat
├── io.weight
├── memory.current
├── memory.events
├── memory.events.local
├── memory.high
├── memory.low
├── memory.max
├── memory.min
├── memory.oom.group
├── memory.pressure
├── memory.stat
├── memory.swap.current
├── memory.swap.events
├── memory.swap.max
├── pids.current
├── pids.events
└── pids.max

I solved the other issues.

eyalkoren · 2020-06-04T04:17:35Z

@tanquetav Thanks for your input. What I am trying to suggest is not assume that the location of cgroup filesystem is under /sys/fs/cgroup. IIUC, while it is the default and most widely used, it is not a rigid specification, and a container runtime can choose different path for it.

In any case, we will have to delay this PR a bit, as we may want to introduce new metricsets especially for containers and keep the current system. metricsets for reporting host metrics. This is a cross APM thing (including other agents, APM server and UI), and we are also aligning it with Metricbeat, so it may take a bit.
Once it is finalized, I will inform you and then we can resume and merge this PR.

tanquetav · 2020-06-04T13:47:53Z

@eyalkoren got your concerns about /sys/fs/cgroup. RHEL 6 seams to mount cgroup on /cgroup instead. I make a logic to figure out the mount point using /proc/self/mountinfo, with a fallback to /sys/fs/cgroup.

It searches for 2 patterns:

39 30 0:35 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,memory

to cgroup1 and

30 23 0:26 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:4 - cgroup2 cgroup2 rw,seclabel

to cgroup2.

I just tested on fedora, can you check on suse , ubuntu and others?

I stated similar item on python agent. I will apply your suggestions to this patch and try to do on dotnet agent.

eyalkoren · 2020-06-04T14:06:50Z

@tanquetav This is awesome, thanks a lot! I will try out, but it will take a bit, so just be informed.

Thanks for looking into /proc/self/mountinfo. Did you try my suggestion to look in /proc/self/mounts? I found it easier to parse and without the need to look for a specific patterns:

cpuset /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cpu /sys/fs/cgroup/cpu cgroup ro,nosuid,nodev,noexec,relatime,cpu 0 0
cpuacct /sys/fs/cgroup/cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpuacct 0 0
blkio /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
memory /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
devices /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
freezer /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
net_cls /sys/fs/cgroup/net_cls cgroup ro,nosuid,nodev,noexec,relatime,net_cls 0 0

Instead, split each line using spaces, and if the first part is memory, the second would be the path.
Looks valid with fedora:32 as well. Do you think it is not a good choice?

tanquetav · 2020-06-04T14:31:45Z

@tanquetav This is awesome, thanks a lot! I will try out, but it will take a bit, so just be informed.

Thanks for looking into /proc/self/mountinfo. Did you try my suggestion to look in /proc/self/mounts? I found it easier to parse and without the need to look for a specific patterns:
cpuset /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cpu /sys/fs/cgroup/cpu cgroup ro,nosuid,nodev,noexec,relatime,cpu 0 0
cpuacct /sys/fs/cgroup/cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpuacct 0 0
blkio /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
memory /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
devices /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
freezer /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
net_cls /sys/fs/cgroup/net_cls cgroup ro,nosuid,nodev,noexec,relatime,net_cls 0 0
Instead, split each line using spaces, and if the first part is memory, the second would be the path.
Looks valid with fedora:32 as well. Do you think it is not a good choice?

Reading https://man7.org/linux/man-pages//man5/procfs.5.html , it have some concerns about mounts has some missing information, and mountinfo is more complete, respecting namespaces that are related with cgroups. It was available since kernel 2.6. About the parser, it will not solve the problem (the memory) because on cgroup 2 it is not available, just the cgroup mount itself.

But if you think is better to use mounts, I can change the code without any problem.

eyalkoren · 2020-06-15T03:00:34Z

@tanquetav Sorry for the delay in response.
Let's stick with the mountinfo. Thanks for all your efforts with that!!
We still need to finalize what would be the exact metricset keys we want to use, and then I will do a full review.
In the mean time, please sign the CLA, otherwise we won't be able to merge.

tanquetav · 2020-06-15T21:51:50Z

@tanquetav Sorry for the delay in response.
Let's stick with the mountinfo. Thanks for all your efforts with that!!
We still need to finalize what would be the exact metricset keys we want to use, and then I will do a full review.
In the mean time, please sign the CLA, otherwise we won't be able to merge.

I filled the CLA several times. I fill it again now. Can you check if it is ok, or you can send me a tutorial to how to handle this?

Thank you.

eyalkoren · 2020-06-16T07:33:05Z

I filled the CLA several times.

Ohh, sorry about that. Did you use the same email as you use in your GitHub account?

tanquetav · 2020-06-16T10:41:00Z

I filled the CLA several times.

Ohh, sorry about that. Did you use the same email as you use in your GitHub account?

yes

felixbarny · 2020-06-16T11:14:48Z

You have signed the CLA with your Gmail address. The commits are signed with your softplan address. Make sure that the latter is also added to your GitHub profile. It doesn't have to be the primary one.

eyalkoren · 2020-07-01T07:01:40Z

@tanquetav sorry for the long wait, we had to outline the new set of metrics - see elastic/apm#291 for details.

Following steps:

We need to adjust this PR to report the new metricsets based on the specifications in the issue lined above. Basically, you did most of the work, but there are some adjustments required like changing the metric keys, adding the inactive_file metric from the memory.stat file and sending the right value when we see the unlimited value.
Please sign the CLA with your softplan email address
CGROUP1_MOUNT_POINT doesn't work for me but ^\\d+? \\d+? .+? .+? (.*?) .*cgroup.*memory.* does. Do you see a problem changing to that?
CGROUP2_MOUNT_POINT doesn't seem memory specific. Is this discovering the memory mount path? What are you using to test cgroup-v2?
Please extract each parsing login into a separate package-private method so we can unit test those whenever we find more patterns.

Let us know if this is too much and you want us to take over.

eyalkoren · 2020-07-05T07:50:13Z

@tanquetav would you like us to take over this one, or are you planning to continue with it?
You contribution has already provided a lot of value, regardless of what you decide, just let us know.

tanquetav · 2020-07-07T22:13:44Z

Yes, but unfortunately I can only fix this on weekend. I am having a busy week

…

On Sun, Jul 5, 2020 at 4:50 AM eyalkoren ***@***.***> wrote: @tanquetav <https://github.com/tanquetav> would you like us to take over this one, or are you planning to continue with it? You contribution has already provided a lot of value, regardless of what you decide, just let us know. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1211 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDLJG6LEYWR3N5XKJY6RD3R2AWEBANCNFSM4NQBBO7Q> .

eyalkoren · 2020-07-08T07:23:23Z

No rush, take the time you need and let us know if you need assistance.
Thanks again for the great contribution!

tanquetav · 2020-07-11T15:38:57Z

Hello @eyalkoren

I worked on PR this weekend, following your sugestion:

I changed the names of variable as suggested, and ignore this if unlimited is set (your last comment on Reporting and showing cgroup-based metrics apm#291 )
Its done
I changed as you suggested. My afraid was to messup with cgroup2 regex. They are very similar. But cgroup2 test is done before cgroup1. It works ok on my tests
Yes, we do not have a memory specific in this item, take a look inside container:

root@ml022-476:/data# cat /proc/self/mountinfo |grep cgroup
1468 1467 0:27 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw,seclabel

What I did is use this regex, to force grab cgroup2:

^\d+? \d+? .+? .+? (.*?) .*cgroup2.*cgroup.*

Then concatenate with slice:

root@ml022-476:/data# cat /proc/self/cgroup
1:name=systemd:/
0::/system.slice/docker-5af29c808916bd2f96ee1902ae140968e2f969d241612a67453c59e11ff9cc0c.scope

root@ml022-476:/data# ls -l /sys/fs/cgroup/system.slice/docker-5af29c808916bd2f96ee1902ae140968e2f969d241612a67453c59e11ff9cc0c.scope/
total 0
-r--r--r--. 1 root root 0 Jul 11 15:15 cgroup.controllers
-r--r--r--. 1 root root 0 Jul 11 15:15 cgroup.events
...
-rw-r--r--. 1 root root 0 Jul 11 15:15 memory.max
-rw-r--r--. 1 root root 0 Jul 11 15:15 memory.min
-rw-r--r--. 1 root root 0 Jul 11 15:15 memory.oom.group
-rw-r--r--. 1 root root 0 Jul 11 15:22 memory.pressure
-r--r--r--. 1 root root 0 Jul 11 15:19 memory.stat
...

Yes, extracted and create a paramerized test for that. Let's populate it with more samples.

eyalkoren

@tanquetav Thanks for applying the changes!!
I will make a pull request tomorrow proposing some changes, would be easier to communicate through code directly 🙂

eyalkoren · 2020-07-13T06:45:40Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

    private final OperatingSystemMXBean operatingSystemBean;

+    final private List<WildcardMatcher> inactiveMemoryRelevantLines = Arrays.asList(caseSensitiveMatcher("inactive_file *"));


No need for a list if we need a single matcher, but in any case, I suggest not using a matcher at all but an exact match - see comment where this is used

eyalkoren · 2020-07-13T08:26:31Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+    }
+
+    public CgroupFiles verifyCgroupEnabled(File procSelfCgroup, File mountInfo) {
+        if (procSelfCgroup.canRead() && mountInfo.canRead()) {


mountInfo is not a must - we have a default way to find the memory cgroup path.

eyalkoren · 2020-07-13T12:51:14Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+                    if (cgroupFilesTest != null) return cgroupFilesTest;
+
+                }
+            } catch (Exception e) {


Log an error

eyalkoren · 2020-07-13T12:54:01Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+                    try(BufferedReader fileMountInfoReader = new BufferedReader(new FileReader(mountInfo))) {
+                        for (String mountLine = fileMountInfoReader.readLine(); mountLine != null && !mountLine.isEmpty(); mountLine = fileMountInfoReader.readLine()) {
+                            String foundRegex = applyCgroup2Regex(mountLine);
+                            if (foundRegex != null) {
+                                cgroupFilesTest = verifyCgroup2Available(lineCgroup, new File(foundRegex));
+                                if (cgroupFilesTest != null) return cgroupFilesTest;
+                            }
+                            foundRegex = applyCgroup1Regex(mountLine);
+                            if (foundRegex != null) {
+                                cgroupFilesTest = verifyCgroup1Available(new File(foundRegex));
+                                if (cgroupFilesTest != null) return cgroupFilesTest;
+                            }
+                        }
+                    }


Only execute if the mountInfo file exists and available for us. Also add a catch clause for any related exception of this try block, because we still have a default to fall back to if any of this fails

eyalkoren · 2020-07-13T13:35:56Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+        if (procSelfCgroup.canRead() && mountInfo.canRead()) {
+            try(BufferedReader fileReader = new BufferedReader(new FileReader(procSelfCgroup))) {
+                String lineCgroup = null;
+                for (String cgroupLine = fileReader.readLine(); cgroupLine != null && !cgroupLine.isEmpty(); cgroupLine = fileReader.readLine()) {


Why is an empty line used as a stop condition for the parsing? Isn't it enough to stop when readLine() produces null?

In fact, isn't it enough to do:

String cgroupLine = fileReader.readLine() while (cgroupLine != null) { ... cgroupLine = fileReader.readLine() }

I used the same logic that was used previously on metricRegistry.addUnlessNan("system.memory.total", ....
If it was previuosly done this way I believe it safe parser this system files in this way

eyalkoren · 2020-07-13T14:00:14Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+    String applyCgroup1Regex(String mountLine) {
+        Matcher matcher = CGROUP1_MOUNT_POINT.matcher(mountLine);
+        if (matcher.matches()) {
+            return matcher.group(1);
+        }
+        return null;
+    }
+    String applyCgroup2Regex(String mountLine) {
+        Matcher matcher = CGROUP2_MOUNT_POINT.matcher(mountLine);
+        if (matcher.matches()) {
+            return matcher.group(1);
+        }
+        return null;
+    }


Same code - should be extracted into one method

eyalkoren · 2020-07-13T14:00:50Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+    private CgroupFiles verifyCgroup2Available(String lineCgroup, File mountDiscovered) throws IOException {
+        final String[] cgroupSplit = StringUtils.split(lineCgroup, ':');
+        // Checking cgroup2
+        File maxMemory = new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_MAX_MEMORY);
+        if (maxMemory.canRead()) {
+            try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
+                String memMaxLine = fileReaderMem.readLine();
+                if (!"max".equalsIgnoreCase(memMaxLine)) {
+                    return new CgroupFiles(maxMemory,
+                        new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_USED_MEMORY),
+                        new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_STAT_MEMORY));
+                }
+            }
+        }
+        return null;
+    }
+
+    private CgroupFiles verifyCgroup1Available(File mountDiscovered) throws IOException {
+        // Checking cgroup1
+        File maxMemory = new File(mountDiscovered, CGROUP1_MAX_MEMORY);
+        if (maxMemory.canRead()) {
+            try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
+                String memMaxLine = fileReaderMem.readLine();
+                long memMax = Long.parseLong(memMaxLine);
+                if (memMax < UNLIMITED) { // Cgroup1 use a contant to disabled limits
+                    return new CgroupFiles(maxMemory,
+                        new File(mountDiscovered, CGROUP1_USED_MEMORY),
+                        new File(mountDiscovered, CGROUP1_STAT_MEMORY));
+                }
+            }
+        }
+        return null;
    }


Very similar. Please extract common code

eyalkoren · 2020-07-13T14:04:32Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+        try(BufferedReader fileReaderStatFile = new BufferedReader(new FileReader(cgroupFiles.getStatMemory()))) {
+            long sum = 0;
+            for (String statLine = fileReaderStatFile.readLine(); statLine != null && !statLine.isEmpty(); statLine = fileReaderStatFile.readLine()) {
+                if (WildcardMatcher.isAnyMatch(inactiveMemoryRelevantLines, statLine)) {


Let's not use a wildcard matcher here, search for exact line instead. Split each line and look at the first part for both inactive_file and total_inactive_file, preferring total_inactive_file if exists.

eyalkoren · 2020-07-13T14:14:10Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+            try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
+                String memMaxLine = fileReaderMem.readLine();
+                long memMax = Long.parseLong(memMaxLine);
+                if (memMax < UNLIMITED) { // Cgroup1 use a contant to disabled limits


No need to look at unlimited

I read about to track cgroup itens even if it is not limited.
My afraid is that on normal use cases, like running on a linux environment these metrics be collected, and it will track all host values, like memory usage, it will be memory usage of all OS, because nothing is limited on cgroup.

eyalkoren · 2020-07-13T14:21:40Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java

+        if (maxMemory.canRead()) {
+            try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
+                String memMaxLine = fileReaderMem.readLine();
+                if (!"max".equalsIgnoreCase(memMaxLine)) {


Ohh, so in cgroup2 max represents unlimited memory for the cgroup?
This is important! We just decided to transfer the special treatment of unlimited max to the UI, but we rely on sending the special numeric representing unlimited. We can't send a string. We can send the unlimited value, but it makes more sense, we better not send anything if we see max and make sure UI properly deals with that.

Yes, on cgroup2 it is a string. what about to use the UNLIMITED constant from cgroup1 ?

Yes, that's what I poorly tried to say in my comment - we could do that, but it doesn't make a lot of sense...
Letting the UI even be aware of the UNLIMITED numeric value was a way to implement once, instead of making all APM agents aware of that. Now that I am aware of the literal UNLIMITED value in v2, we must handle in agents, as the metric value in the intake API and Elasticsearch would be numeric.
I think that making sure agents omit the value when it is max and making sure UI expects that is enough.

tanquetav · 2020-07-16T02:04:56Z

@tanquetav Please take a look at my PR.
What I really wanted to eventually do is extract all cgroup metrics code from SystemMetrics to a new CgroupMetrics class (and do the same with a new CgroupMetricsTest class) but I would then unrightfully get the credit for all the great work you did. So please, once you approve the PR does what it should, please merge to your branch and then do that.
I addition, we need to test any logic we have code for. A few examples:

verify that we do not collect system.process.cgroup.memory.mem.limit.bytes when using cgroup v2 and the limit value is max

if we allow both :memory and 0: in the cgroup file, we need to verify we get what we expect when either exists and when both exist and present in a different order within the file.

if we allow both inactive_file and total_inactive_file in the memory.stat file - we need to make sure we handle both and give priority to total_inactive_file.

Again, if you feel this requires more than you expect to put into it, let me know and I will take over.
Thanks!

Ok, I will take some more time to split in a new file. I am fixing the issues that you quoted before

Ok, implemented all these. About system.process.cgroup.memory.mem.limit.bytes, it always being collected, except on cgroup2 when it is max
About :memoryand0:` it prefers memory line, if it is not available the 0: is selected

total_inactive_file is preferred (using your implementation)

All the logic is now on a new file (CGroupMetric), and I reverted SystemMetrics to the original state

The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected.

eyalkoren · 2020-07-16T05:19:46Z

One more question : I merge your suggestions, and I found that you are not subtracting inactive_file from used memory.
The suggestion on elastic/apm#291 is subtract

elastic/apm#291 suggests that agents send the inactive_file value as a metric and the UI subtract it.

eyalkoren · 2020-07-16T05:31:45Z

About :memoryand0:` it prefers memory line, if it is not available the 0: is selected

That is why we need to test that we do what we want to be doing - select :memory lines if they coexist with 0: lines, both if they are written higher in the file or lower.
The general idea is - anything we wrote code for we should test. Not only to see that it works now, but as a way to guarantee we do not introduce regression when updating the code in the future.

total_inactive_file is preferred (using your implementation)

Same - we wrote the logic, we need to test it is working.

The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected.

That's a very important point I think I am missing. Wouldn't cgroup filesystems with all mount info exist only if something is mounting them (like and implementation of Linux containers)? In other words, these will be collected for any Java process running on Linux even if the process is not running within a control group?

eyalkoren · 2020-07-16T11:02:12Z

@tanquetav the main disadvantage I see with sending cgroup data only when not UNLIMITED is that we will not send usage metrics as well for containers running without memory limitation. This means that memory usage for containerized apps will be reported based on meminfo, which I believe is inferior to using cgroup in such cases.
IIUC, the current implementation may send redundant metrics (not needed when not containerized), but those will always be correct, is that indeed the case? If so, we can start with sending them.
Please let me know if you are aware of somewhere this will provide the wrong metrics or if you see a better option.

Also, please list all systems you used to manually test this PR, so we have a basic list of known supported ones. Thanks!

tanquetav · 2020-07-16T11:11:33Z

About :memoryand0:` it prefers memory line, if it is not available the 0: is selected

That is why we need to test that we do what we want to be doing - select :memory lines if they coexist with 0: lines, both if they are written higher in the file or lower.
The general idea is - anything we wrote code for we should test. Not only to see that it works now, but as a way to guarantee we do not introduce regression when updating the code in the future.

total_inactive_file is preferred (using your implementation)

Same - we wrote the logic, we need to test it is working.

The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected.

That's a very important point I think I am missing. Wouldn't cgroup filesystems with all mount info exist only if something is mounting them (like and implementation of Linux containers)? In other words, these will be collected for any Java process running on Linux even if the process is not running within a control group?

I increased the test coverage to these cases.

About cgroup, it is a hierarchical structure. Modern linux distributions, with it enabled, all the system is cgroupaware. What normally happen is the process are all bind to root cgroup namespace, sharing it. Some modern linux distributions create some slices, like user slice, system slice, and the user process are not using the same slice that system daemons uses.

CGroup can be limited not only by docker. Systemd allow put the process on limited environment too. When you subslices a previous slice you are subdividing the previous slice. It make more sense on pct values like:

system slice - 400 cpushare
user slice - 200 cpshare      ----- proc1 slice - 200cpushar
                                    proc2 slice - 100cpushare
                                    proc3 slice - 100 cpushare

With this example, 1/3 of cpu can be allocated to user slice. From this 1/3 , a half can be allocateds by proc1 slice (1/6) and 1/4 to proc2 and 3 slice(1/12). Each slice can have several process, and user slice sum all subslices , like memory stats

eyalkoren · 2020-07-16T12:12:09Z

@tanquetav thanks for the great explanation!
So it seems that basing the limit metric on cgroup would always be the better option as it specifies the actual limit that the process is restricted to. Is that correct?
I am not sure this is the case for the usage metric though - based on the current code of this PR, what would we see in memory.usage? Would that be sum of usages of all processes in the slice?

eyalkoren · 2020-07-16T13:28:34Z

Please merge with latest master so we can run the tests. CI was upgraded to run the build with Java 11, that requires this change.

tanquetav · 2020-07-16T14:47:58Z

@tanquetav thanks for the great explanation!
So it seems that basing the limit metric on cgroup would always be the better option as it specifies the actual limit that the process is restricted to. Is that correct?
I am not sure this is the case for the usage metric though - based on the current code of this PR, what would we see in memory.usage? Would that be sum of usages of all processes in the slice?

yes, memory.usage will be the sum all process in the slice. Using the limit to enable/disable cgroup metrics will help to avoid collect metrics on cgroup that are not split per service.

On docker/k8s environment in general limitation of memory/cpu is used. On systemd is a optional and we cannot assure if the process is confined or not.

Sum all memory of a correct confined slice is not a bad idea, like we have a python process that fork a celery daemon. It is not usual on java, but on other tecnology may be helpful.

eyalkoren

@tanquetav Sorry for the delayed response again 🙏

I completely agree with your last comment - our "System memory usage" graph aims to show how close is the "system" (not process) to exhausting its memory resources:

So whenever a process is running within a control group of any kind that has memory limits, the stress of the system should be reflected by the percentage of memory that is potentially available to the process.
In other words - always using cgroup-based limit and usage seems like a valid approach for this purpose.

I think we are very close to conclude this great effort, see my single comment for a small required change.

Also, please build and test manually on the environments you setup (v1 and v2) and let us know where it has been tested.

Thanks again for your amazing contribution!!

eyalkoren · 2020-07-27T11:37:56Z

apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/CGroupMetrics.java

+    private CgroupFiles createCgroup1Files(File memoryMountPath) {
+        File maxMemoryFile = new File(memoryMountPath, CGroupMetrics.CGROUP1_MAX_MEMORY);
+        if (maxMemoryFile.canRead()) {
+            // No need for special treatment for the special "unlimited" value (0x7ffffffffffff000) - omitted by the UI


We revisited this, so we need to restore the unlimited check for v1 as well so, like we did in v2, if the value stored in the memory.limit_in_bytes file is the special UNLIMITED value, we shouldn't report it (meaning - use null for the maxMemoryFile).

eyalkoren · 2020-07-27T12:06:53Z

Ohh, and we also need to add to documentation (metrics.asciidoc) and the CHANGELOG (better merge latest master before you do)

tanquetav · 2020-07-30T21:40:22Z

Ohh, and we also need to add to documentation (metrics.asciidoc) and the CHANGELOG (better merge latest master before you do)

Hi.

I made the changes requested, about unlimited memory and add documentation. Can you check if the text are ok (you know I am foreign) and if I need to fix something

I always test my code on a environment before push. I create a new elastic apm from the ground (using openstack and some ansible scripts I created) to not conflict with other data. Then I run 4 tests on my desktop (Fedora 32), with docker using my build of apm agent.

booting my Fedora in cgroup1 mode (using systemd.unified_cgroup_hierarchy=0)
- Test 1: run docker , with all elastic configured e the new apm build with the flag -m 500M
  Then I check kibana, in the "discover" if the metric is collected and the three metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is 500m)
- Test 2: run docker , with all elastic configured e the new apm build without the flag -m 500M
  Then I check kibana, in the "discover" if the metric is collected and the two other metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is not filled)
booting my Fedora in cgroup2 mode (removeing systemd.unified_cgroup_hierarchy=0)
- Test 3: run docker , with all elastic configured e the new apm build with the flag -m 800M
  Then I check kibana, in the "discover" if the metric is collected and the three metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is 800m)
- Test 4: run docker , with all elastic configured e the new apm build without the flag -m 800M
  Then I check kibana, in the "discover" if the metric is collected and the two other metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is not filled)

This work was very nice, I learn a lot about elastic apm and I hope you enjoy this little help.

tanquetav · 2020-07-31T14:22:27Z

@eyalkoren Can you send me a private email to chat? I want to discuss other ideas not related to this issue.

eyalkoren · 2020-08-03T03:43:31Z

@eyalkoren Can you send me a private email to chat? I want to discuss other ideas not related to this issue.

Gladly. Sent to your GH public email.

I'll run the tests and review the final commits (apply changes myself if needed). Thanks!

This work was very nice, I learn a lot about elastic apm and I hope you enjoy this little help

I certainly did enjoy collaborating with you and learned a lot myself about these OS corners.

eyalkoren

🎉 Incredible job @tanquetav !!

subeenn · 2021-03-05T05:44:22Z

@SylvainJuge Could you please let us know in which version this fix is available?

SylvainJuge · 2021-03-05T08:33:23Z

It was merged in August 2020, thus it's included in the first release after that time, which is 1.18.0.

In the general case, using the latest version available is better (and compatible with previous server versions), thus you should use 1.21.0 (latest release as of today).

Add cgroup support (elastic#1197)

b562f0e

eyalkoren reviewed Jun 2, 2020

View reviewed changes

Fix Suggestion

0e16ff0

eyalkoren reviewed Jun 3, 2020

View reviewed changes

More fixes

e0daa47

cgroup mountpoint discover

1b58678

gregkalapos mentioned this pull request Jun 15, 2020

add ability to collect memory from cgroup elastic/apm-agent-dotnet#862

Closed

eyalkoren mentioned this pull request Jul 1, 2020

Reporting and showing cgroup-based metrics elastic/apm#291

Closed

Fix variable names

cbbcb7c

eyalkoren reviewed Jul 13, 2020

View reviewed changes

tanquetav force-pushed the cgroup_support branch from d1315e8 to 69572c0 Compare July 16, 2020 00:08

Split CGroupFile

44d0561

tanquetav force-pushed the cgroup_support branch from 69572c0 to 44d0561 Compare July 16, 2020 01:36

Increase test coverage

e343b2d

Merge branch 'master' into cgroup_support

20d0bf4

eyalkoren reviewed Jul 27, 2020

View reviewed changes

tanquetav added 2 commits July 30, 2020 17:35

Merge branch 'master' into cgroup_support

cf271e1

Cgroup unlimited memory check and documentation

7827295

Small fixes

51e7aba

eyalkoren approved these changes Aug 4, 2020

View reviewed changes

eyalkoren requested a review from SylvainJuge August 4, 2020 05:01

felixbarny added this to the 7.10 milestone Aug 18, 2020

eyalkoren mentioned this pull request Aug 18, 2020

Adding cgroup metrics collection spec elastic/apm#292

Merged

minor style changes

1a4e2db

SylvainJuge approved these changes Aug 18, 2020

View reviewed changes

SylvainJuge merged commit c66313f into elastic:master Aug 18, 2020

		private final OperatingSystemMXBean operatingSystemBean;

		final private List<WildcardMatcher> inactiveMemoryRelevantLines = Arrays.asList(caseSensitiveMatcher("inactive_file *"));

Add cgroup support (#1197) #1211

Add cgroup support (#1197) #1211

Conversation

tanquetav commented Jun 1, 2020 • edited by eyalkoren Loading

What does this PR do?

Checklist

Author's Checklist

Related issues

Use cases

Screenshots

cla-checker-service bot commented Jun 1, 2020 • edited Loading

eyalkoren commented Jun 2, 2020 • edited Loading

eyalkoren left a comment

Choose a reason for hiding this comment

eyalkoren commented Jun 2, 2020

apmmachine commented Jun 2, 2020 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

tanquetav commented Jun 2, 2020

eyalkoren left a comment

Choose a reason for hiding this comment

tanquetav commented Jun 3, 2020

eyalkoren commented Jun 4, 2020

tanquetav commented Jun 4, 2020

eyalkoren commented Jun 4, 2020

tanquetav commented Jun 4, 2020

eyalkoren commented Jun 15, 2020

tanquetav commented Jun 15, 2020

eyalkoren commented Jun 16, 2020

tanquetav commented Jun 16, 2020

felixbarny commented Jun 16, 2020

eyalkoren commented Jul 1, 2020

eyalkoren commented Jul 5, 2020

tanquetav commented Jul 7, 2020 via email

eyalkoren commented Jul 8, 2020

tanquetav commented Jul 11, 2020

eyalkoren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanquetav Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanquetav Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

eyalkoren Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

tanquetav commented Jul 16, 2020

eyalkoren commented Jul 16, 2020

eyalkoren commented Jul 16, 2020

eyalkoren commented Jul 16, 2020

tanquetav commented Jul 16, 2020

eyalkoren commented Jul 16, 2020

eyalkoren commented Jul 16, 2020

tanquetav commented Jul 16, 2020

eyalkoren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyalkoren commented Jul 27, 2020

tanquetav commented Jul 30, 2020

tanquetav commented Jul 31, 2020

eyalkoren commented Aug 3, 2020

eyalkoren left a comment

Choose a reason for hiding this comment

subeenn commented Mar 5, 2021

SylvainJuge commented Mar 5, 2021

tanquetav commented Jun 1, 2020 •

edited by eyalkoren

Loading

cla-checker-service bot commented Jun 1, 2020 •

edited

Loading

eyalkoren commented Jun 2, 2020 •

edited

Loading

apmmachine commented Jun 2, 2020 •

edited

Loading

tanquetav Jul 15, 2020 •

edited

Loading

tanquetav Jul 15, 2020 •

edited

Loading

eyalkoren Jul 16, 2020 •

edited

Loading