Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cgroup support (#1197) #1211

Merged
merged 14 commits into from
Aug 18, 2020
Merged

Conversation

tanquetav
Copy link
Contributor

@tanquetav tanquetav commented Jun 1, 2020

What does this PR do?

closes #1197

Checklist

  • My code follows the style guidelines of this project
  • I have rebased my changes on top of the latest master branch
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated CHANGELOG.asciidoc

Author's Checklist

Related issues

Use cases

When a java program is running on a cgroup limited environment (docker with -m option, k8s with resource limit) cgroup is used to get memory information(system.memory.total, system.memory.actual.free and system.process.memory.rss)

Screenshots

@cla-checker-service
Copy link

cla-checker-service bot commented Jun 1, 2020

💚 CLA has been signed

@eyalkoren
Copy link
Contributor

eyalkoren commented Jun 2, 2020

@tanquetav Thanks for the PR!!
Can you please sign the CLA?

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanquetav Thanks for this extremely useful PR! ❤️

A few suggestions:

  1. Let's not assume the cgroup filesystem path is the default (/sys/fs/cgroup). I found that the memory cgroup subdir path can be quite easily parsed from the contents of /proc/self/mounts - lines are space-delimited, where the first part is memory, the second part would be the memory cgroup path (tested on latest Docker images of ubuntu, opensuse/leap and centos). If we fail to find the path this way, let's then fallback to try the default. Of course this should be done during initialization.
  2. What use cases does the cgroup2 route cover? If we apply the suggestion in (1), would that cover it?
  3. if you move the check whether limit is set (and whether we want to use the cgroup info) to the constructor, you can use a final boolean field, thus avoiding the AtomicBoolean accesses. In general, please try to move any initialization logic from the bindTo to construction time.
  4. system.process.memory.rss.bytes is a useful metric, but if we add it, let's add it all across, meaning at least with JMX as well. Also, if we do that, we better document in the metrics documentation.
  5. What do think on adopting the approach suggested in node level memory state from root cgroup is different from /proc/meminfo google/cadvisor#2042 of calculating real used bytes based on memory.usage_in_bytes - total_inactive_file (the latter coming from stat)?

@eyalkoren
Copy link
Contributor

An additional thought - google/cadvisor#2042 reports that the real used values are different when obtained through these different approaches. I witnessed the same.

What do you think about calculating the real used based on cgroup even if memory.limit_in_bytes is invalid (unlimited), relying on the host limit coming from /proc/meminfo -> MemTotal to determine the total instead?

@apmmachine
Copy link
Contributor

apmmachine commented Jun 2, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #1211 updated]

  • Start Time: 2020-08-18T14:54:15.423+0000

  • Duration: 48 min 46 sec

Test stats 🧪

Test Results
Failed 0
Passed 1443
Skipped 11
Total 1454

@tanquetav
Copy link
Contributor Author

For 1 - 3 itens, I redo the check/verify of cgroup to contructor. It is much more cleaner code. It use /proc/self/cgroup to verify the cgroup2 folder, if it can be read or if it is unlimited. Try to fallback to cgroup1 .

4-5 itens, I think is better to not use it now, maybe in other PR, because I am using cgroup and more process can be allocated to same cgroup. You suggestion to use jmx is better.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanquetav Thanks for the changes.

I think you misunderstood what I meant to address. I am trying to find the actual cgroup filesystem path, before trying the default path (/sys/fs/cgroup). The way I suggested of doing that is by parsing the /proc/self/mounts file. The line that starts with memory should contain the path to the memory cgroup subdirectory as the second item. Let me know if you see a case where this does not apply.

The handling of the /proc/self/cgroup file contents is not going to work where I tested it. The path retrieved from it is not a valid file or directory path in those cases.
In addition, when is the 0: subsystem is being assigned? The outputs I observed did not assign to it. In which cases does cgroup1 approach is valid and in which it is cgroup2?

@tanquetav
Copy link
Contributor Author

I am using fedora 32, and on boot process, I can select if I am using cgroup1 or cgroup2. If I pass systemd.unified_cgroup_hierarchy=0 as a kernel option cgroup1 is used and if I do not put this on kernel boot option cgroup2 is used.

About /proc/self/cgroup, sometimes memory do not appear on it, just the 0: entry. Then I check both lines (memory and 0:) .

root@ml022-476:/# cat /proc/self/cgroup 
1:name=systemd:/
0::/system.slice/docker-1487ca74e6c3206b7d5a16e8a7a3064ef5d817d14b7d1c4794b61d357d7da2a2.scope

About /sys/fs/cgroup, I think that is a misundertood. On cgroup1, /sys/fs/cgroup has the memory files, and when they are read , it gives the cgroup values.

On cgroup2 is a little differente. The memory limits files are available inside a subdirectories with the slice name:

[root@ml022-476 ~]# tree /sys/fs/cgroup/system.slice/docker-1487ca74e6c3206b7d5a16e8a7a3064ef5d817d14b7d1c4794b61d357d7da2a2.scope/
/sys/fs/cgroup/system.slice/docker-1487ca74e6c3206b7d5a16e8a7a3064ef5d817d14b7d1c4794b61d357d7da2a2.scope/
├── cgroup.controllers
├── cgroup.events
├── cgroup.freeze
├── cgroup.max.depth
├── cgroup.max.descendants
├── cgroup.procs
├── cgroup.stat
├── cgroup.subtree_control
├── cgroup.threads
├── cgroup.type
├── cpu.max
├── cpu.pressure
├── cpuset.cpus
├── cpuset.cpus.effective
├── cpuset.cpus.partition
├── cpuset.mems
├── cpuset.mems.effective
├── cpu.stat
├── cpu.weight
├── cpu.weight.nice
├── hugetlb.1GB.current
├── hugetlb.1GB.events
├── hugetlb.1GB.events.local
├── hugetlb.1GB.max
├── hugetlb.2MB.current
├── hugetlb.2MB.events
├── hugetlb.2MB.events.local
├── hugetlb.2MB.max
├── io.bfq.weight
├── io.latency
├── io.max
├── io.pressure
├── io.stat
├── io.weight
├── memory.current
├── memory.events
├── memory.events.local
├── memory.high
├── memory.low
├── memory.max
├── memory.min
├── memory.oom.group
├── memory.pressure
├── memory.stat
├── memory.swap.current
├── memory.swap.events
├── memory.swap.max
├── pids.current
├── pids.events
└── pids.max

I solved the other issues.

@eyalkoren
Copy link
Contributor

@tanquetav Thanks for your input. What I am trying to suggest is not assume that the location of cgroup filesystem is under /sys/fs/cgroup. IIUC, while it is the default and most widely used, it is not a rigid specification, and a container runtime can choose different path for it.

In any case, we will have to delay this PR a bit, as we may want to introduce new metricsets especially for containers and keep the current system. metricsets for reporting host metrics. This is a cross APM thing (including other agents, APM server and UI), and we are also aligning it with Metricbeat, so it may take a bit.
Once it is finalized, I will inform you and then we can resume and merge this PR.

@tanquetav
Copy link
Contributor Author

@eyalkoren got your concerns about /sys/fs/cgroup. RHEL 6 seams to mount cgroup on /cgroup instead. I make a logic to figure out the mount point using /proc/self/mountinfo, with a fallback to /sys/fs/cgroup.

It searches for 2 patterns:

39 30 0:35 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,memory

to cgroup1 and

30 23 0:26 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:4 - cgroup2 cgroup2 rw,seclabel

to cgroup2.

I just tested on fedora, can you check on suse , ubuntu and others?

I stated similar item on python agent. I will apply your suggestions to this patch and try to do on dotnet agent.

@eyalkoren
Copy link
Contributor

@tanquetav This is awesome, thanks a lot! I will try out, but it will take a bit, so just be informed.

Thanks for looking into /proc/self/mountinfo. Did you try my suggestion to look in /proc/self/mounts? I found it easier to parse and without the need to look for a specific patterns:

cpuset /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cpu /sys/fs/cgroup/cpu cgroup ro,nosuid,nodev,noexec,relatime,cpu 0 0
cpuacct /sys/fs/cgroup/cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpuacct 0 0
blkio /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
memory /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
devices /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
freezer /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
net_cls /sys/fs/cgroup/net_cls cgroup ro,nosuid,nodev,noexec,relatime,net_cls 0 0

Instead, split each line using spaces, and if the first part is memory, the second would be the path.
Looks valid with fedora:32 as well. Do you think it is not a good choice?

@tanquetav
Copy link
Contributor Author

@tanquetav This is awesome, thanks a lot! I will try out, but it will take a bit, so just be informed.

Thanks for looking into /proc/self/mountinfo. Did you try my suggestion to look in /proc/self/mounts? I found it easier to parse and without the need to look for a specific patterns:

cpuset /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cpu /sys/fs/cgroup/cpu cgroup ro,nosuid,nodev,noexec,relatime,cpu 0 0
cpuacct /sys/fs/cgroup/cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpuacct 0 0
blkio /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
memory /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
devices /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
freezer /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
net_cls /sys/fs/cgroup/net_cls cgroup ro,nosuid,nodev,noexec,relatime,net_cls 0 0

Instead, split each line using spaces, and if the first part is memory, the second would be the path.
Looks valid with fedora:32 as well. Do you think it is not a good choice?

Reading https://man7.org/linux/man-pages//man5/procfs.5.html , it have some concerns about mounts has some missing information, and mountinfo is more complete, respecting namespaces that are related with cgroups. It was available since kernel 2.6. About the parser, it will not solve the problem (the memory) because on cgroup 2 it is not available, just the cgroup mount itself.

But if you think is better to use mounts, I can change the code without any problem.

@eyalkoren
Copy link
Contributor

@tanquetav Sorry for the delay in response.
Let's stick with the mountinfo. Thanks for all your efforts with that!!
We still need to finalize what would be the exact metricset keys we want to use, and then I will do a full review.
In the mean time, please sign the CLA, otherwise we won't be able to merge.

@tanquetav
Copy link
Contributor Author

@tanquetav Sorry for the delay in response.
Let's stick with the mountinfo. Thanks for all your efforts with that!!
We still need to finalize what would be the exact metricset keys we want to use, and then I will do a full review.
In the mean time, please sign the CLA, otherwise we won't be able to merge.

I filled the CLA several times. I fill it again now. Can you check if it is ok, or you can send me a tutorial to how to handle this?

Thank you.

@eyalkoren
Copy link
Contributor

I filled the CLA several times.

Ohh, sorry about that. Did you use the same email as you use in your GitHub account?

@tanquetav
Copy link
Contributor Author

I filled the CLA several times.

Ohh, sorry about that. Did you use the same email as you use in your GitHub account?

yes

@felixbarny
Copy link
Member

You have signed the CLA with your Gmail address. The commits are signed with your softplan address. Make sure that the latter is also added to your GitHub profile. It doesn't have to be the primary one.

@eyalkoren
Copy link
Contributor

@tanquetav sorry for the long wait, we had to outline the new set of metrics - see elastic/apm#291 for details.

Following steps:

  1. We need to adjust this PR to report the new metricsets based on the specifications in the issue lined above. Basically, you did most of the work, but there are some adjustments required like changing the metric keys, adding the inactive_file metric from the memory.stat file and sending the right value when we see the unlimited value.
  2. Please sign the CLA with your softplan email address
  3. CGROUP1_MOUNT_POINT doesn't work for me but ^\\d+? \\d+? .+? .+? (.*?) .*cgroup.*memory.* does. Do you see a problem changing to that?
  4. CGROUP2_MOUNT_POINT doesn't seem memory specific. Is this discovering the memory mount path? What are you using to test cgroup-v2?
  5. Please extract each parsing login into a separate package-private method so we can unit test those whenever we find more patterns.

Let us know if this is too much and you want us to take over.

@eyalkoren
Copy link
Contributor

@tanquetav would you like us to take over this one, or are you planning to continue with it?
You contribution has already provided a lot of value, regardless of what you decide, just let us know.

@tanquetav
Copy link
Contributor Author

tanquetav commented Jul 7, 2020 via email

@eyalkoren
Copy link
Contributor

No rush, take the time you need and let us know if you need assistance.
Thanks again for the great contribution!

@tanquetav
Copy link
Contributor Author

Hello @eyalkoren

I worked on PR this weekend, following your sugestion:

  1. I changed the names of variable as suggested, and ignore this if unlimited is set (your last comment on Reporting and showing cgroup-based metrics apm#291 )

  2. Its done

  3. I changed as you suggested. My afraid was to messup with cgroup2 regex. They are very similar. But cgroup2 test is done before cgroup1. It works ok on my tests

  4. Yes, we do not have a memory specific in this item, take a look inside container:

root@ml022-476:/data# cat /proc/self/mountinfo |grep cgroup
1468 1467 0:27 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw,seclabel

What I did is use this regex, to force grab cgroup2:

^\d+? \d+? .+? .+? (.*?) .*cgroup2.*cgroup.*

Then concatenate with slice:

root@ml022-476:/data# cat /proc/self/cgroup
1:name=systemd:/
0::/system.slice/docker-5af29c808916bd2f96ee1902ae140968e2f969d241612a67453c59e11ff9cc0c.scope

root@ml022-476:/data# ls -l /sys/fs/cgroup/system.slice/docker-5af29c808916bd2f96ee1902ae140968e2f969d241612a67453c59e11ff9cc0c.scope/
total 0
-r--r--r--. 1 root root 0 Jul 11 15:15 cgroup.controllers
-r--r--r--. 1 root root 0 Jul 11 15:15 cgroup.events
...
-rw-r--r--. 1 root root 0 Jul 11 15:15 memory.max
-rw-r--r--. 1 root root 0 Jul 11 15:15 memory.min
-rw-r--r--. 1 root root 0 Jul 11 15:15 memory.oom.group
-rw-r--r--. 1 root root 0 Jul 11 15:22 memory.pressure
-r--r--r--. 1 root root 0 Jul 11 15:19 memory.stat
...
  1. Yes, extracted and create a paramerized test for that. Let's populate it with more samples.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanquetav Thanks for applying the changes!!
I will make a pull request tomorrow proposing some changes, would be easier to communicate through code directly 🙂

private final OperatingSystemMXBean operatingSystemBean;

final private List<WildcardMatcher> inactiveMemoryRelevantLines = Arrays.asList(caseSensitiveMatcher("inactive_file *"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a list if we need a single matcher, but in any case, I suggest not using a matcher at all but an exact match - see comment where this is used

}

public CgroupFiles verifyCgroupEnabled(File procSelfCgroup, File mountInfo) {
if (procSelfCgroup.canRead() && mountInfo.canRead()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mountInfo is not a must - we have a default way to find the memory cgroup path.

if (cgroupFilesTest != null) return cgroupFilesTest;

}
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log an error

Comment on lines 133 to 146
try(BufferedReader fileMountInfoReader = new BufferedReader(new FileReader(mountInfo))) {
for (String mountLine = fileMountInfoReader.readLine(); mountLine != null && !mountLine.isEmpty(); mountLine = fileMountInfoReader.readLine()) {
String foundRegex = applyCgroup2Regex(mountLine);
if (foundRegex != null) {
cgroupFilesTest = verifyCgroup2Available(lineCgroup, new File(foundRegex));
if (cgroupFilesTest != null) return cgroupFilesTest;
}
foundRegex = applyCgroup1Regex(mountLine);
if (foundRegex != null) {
cgroupFilesTest = verifyCgroup1Available(new File(foundRegex));
if (cgroupFilesTest != null) return cgroupFilesTest;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only execute if the mountInfo file exists and available for us. Also add a catch clause for any related exception of this try block, because we still have a default to fall back to if any of this fails

if (procSelfCgroup.canRead() && mountInfo.canRead()) {
try(BufferedReader fileReader = new BufferedReader(new FileReader(procSelfCgroup))) {
String lineCgroup = null;
for (String cgroupLine = fileReader.readLine(); cgroupLine != null && !cgroupLine.isEmpty(); cgroupLine = fileReader.readLine()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is an empty line used as a stop condition for the parsing? Isn't it enough to stop when readLine() produces null?

In fact, isn't it enough to do:

String cgroupLine = fileReader.readLine()
while (cgroupLine != null) {
  ...
  cgroupLine = fileReader.readLine()
}

Copy link
Contributor Author

@tanquetav tanquetav Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same logic that was used previously on metricRegistry.addUnlessNan("system.memory.total", ....
If it was previuosly done this way I believe it safe parser this system files in this way

Comment on lines 159 to 172
String applyCgroup1Regex(String mountLine) {
Matcher matcher = CGROUP1_MOUNT_POINT.matcher(mountLine);
if (matcher.matches()) {
return matcher.group(1);
}
return null;
}
String applyCgroup2Regex(String mountLine) {
Matcher matcher = CGROUP2_MOUNT_POINT.matcher(mountLine);
if (matcher.matches()) {
return matcher.group(1);
}
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same code - should be extracted into one method

Comment on lines 173 to 205
private CgroupFiles verifyCgroup2Available(String lineCgroup, File mountDiscovered) throws IOException {
final String[] cgroupSplit = StringUtils.split(lineCgroup, ':');
// Checking cgroup2
File maxMemory = new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_MAX_MEMORY);
if (maxMemory.canRead()) {
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
String memMaxLine = fileReaderMem.readLine();
if (!"max".equalsIgnoreCase(memMaxLine)) {
return new CgroupFiles(maxMemory,
new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_USED_MEMORY),
new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_STAT_MEMORY));
}
}
}
return null;
}

private CgroupFiles verifyCgroup1Available(File mountDiscovered) throws IOException {
// Checking cgroup1
File maxMemory = new File(mountDiscovered, CGROUP1_MAX_MEMORY);
if (maxMemory.canRead()) {
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
String memMaxLine = fileReaderMem.readLine();
long memMax = Long.parseLong(memMaxLine);
if (memMax < UNLIMITED) { // Cgroup1 use a contant to disabled limits
return new CgroupFiles(maxMemory,
new File(mountDiscovered, CGROUP1_USED_MEMORY),
new File(mountDiscovered, CGROUP1_STAT_MEMORY));
}
}
}
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very similar. Please extract common code

try(BufferedReader fileReaderStatFile = new BufferedReader(new FileReader(cgroupFiles.getStatMemory()))) {
long sum = 0;
for (String statLine = fileReaderStatFile.readLine(); statLine != null && !statLine.isEmpty(); statLine = fileReaderStatFile.readLine()) {
if (WildcardMatcher.isAnyMatch(inactiveMemoryRelevantLines, statLine)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use a wildcard matcher here, search for exact line instead. Split each line and look at the first part for both inactive_file and total_inactive_file, preferring total_inactive_file if exists.

try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
String memMaxLine = fileReaderMem.readLine();
long memMax = Long.parseLong(memMaxLine);
if (memMax < UNLIMITED) { // Cgroup1 use a contant to disabled limits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to look at unlimited

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read about to track cgroup itens even if it is not limited.
My afraid is that on normal use cases, like running on a linux environment these metrics be collected, and it will track all host values, like memory usage, it will be memory usage of all OS, because nothing is limited on cgroup.

if (maxMemory.canRead()) {
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) {
String memMaxLine = fileReaderMem.readLine();
if (!"max".equalsIgnoreCase(memMaxLine)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, so in cgroup2 max represents unlimited memory for the cgroup?
This is important! We just decided to transfer the special treatment of unlimited max to the UI, but we rely on sending the special numeric representing unlimited. We can't send a string. We can send the unlimited value, but it makes more sense, we better not send anything if we see max and make sure UI properly deals with that.

Copy link
Contributor Author

@tanquetav tanquetav Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on cgroup2 it is a string. what about to use the UNLIMITED constant from cgroup1 ?

Copy link
Contributor

@eyalkoren eyalkoren Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I poorly tried to say in my comment - we could do that, but it doesn't make a lot of sense...
Letting the UI even be aware of the UNLIMITED numeric value was a way to implement once, instead of making all APM agents aware of that. Now that I am aware of the literal UNLIMITED value in v2, we must handle in agents, as the metric value in the intake API and Elasticsearch would be numeric.
I think that making sure agents omit the value when it is max and making sure UI expects that is enough.

@tanquetav
Copy link
Contributor Author

@tanquetav Please take a look at my PR.
What I really wanted to eventually do is extract all cgroup metrics code from SystemMetrics to a new CgroupMetrics class (and do the same with a new CgroupMetricsTest class) but I would then unrightfully get the credit for all the great work you did. So please, once you approve the PR does what it should, please merge to your branch and then do that.
I addition, we need to test any logic we have code for. A few examples:

  • verify that we do not collect system.process.cgroup.memory.mem.limit.bytes when using cgroup v2 and the limit value is max
  • if we allow both :memory and 0: in the cgroup file, we need to verify we get what we expect when either exists and when both exist and present in a different order within the file.
  • if we allow both inactive_file and total_inactive_file in the memory.stat file - we need to make sure we handle both and give priority to total_inactive_file.

Again, if you feel this requires more than you expect to put into it, let me know and I will take over.
Thanks!

Ok, I will take some more time to split in a new file. I am fixing the issues that you quoted before

Ok, implemented all these. About system.process.cgroup.memory.mem.limit.bytes, it always being collected, except on cgroup2 when it is max
About :memoryand0:` it prefers memory line, if it is not available the 0: is selected

total_inactive_file is preferred (using your implementation)

All the logic is now on a new file (CGroupMetric), and I reverted SystemMetrics to the original state

The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected.

@eyalkoren
Copy link
Contributor

One more question : I merge your suggestions, and I found that you are not subtracting inactive_file from used memory.
The suggestion on elastic/apm#291 is subtract

elastic/apm#291 suggests that agents send the inactive_file value as a metric and the UI subtract it.

@eyalkoren
Copy link
Contributor

About :memoryand0:` it prefers memory line, if it is not available the 0: is selected

That is why we need to test that we do what we want to be doing - select :memory lines if they coexist with 0: lines, both if they are written higher in the file or lower.
The general idea is - anything we wrote code for we should test. Not only to see that it works now, but as a way to guarantee we do not introduce regression when updating the code in the future.

total_inactive_file is preferred (using your implementation)

Same - we wrote the logic, we need to test it is working.

The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected.

That's a very important point I think I am missing. Wouldn't cgroup filesystems with all mount info exist only if something is mounting them (like and implementation of Linux containers)? In other words, these will be collected for any Java process running on Linux even if the process is not running within a control group?

@eyalkoren
Copy link
Contributor

@tanquetav the main disadvantage I see with sending cgroup data only when not UNLIMITED is that we will not send usage metrics as well for containers running without memory limitation. This means that memory usage for containerized apps will be reported based on meminfo, which I believe is inferior to using cgroup in such cases.
IIUC, the current implementation may send redundant metrics (not needed when not containerized), but those will always be correct, is that indeed the case? If so, we can start with sending them.
Please let me know if you are aware of somewhere this will provide the wrong metrics or if you see a better option.

Also, please list all systems you used to manually test this PR, so we have a basic list of known supported ones. Thanks!

@tanquetav
Copy link
Contributor Author

About :memoryand0:` it prefers memory line, if it is not available the 0: is selected

That is why we need to test that we do what we want to be doing - select :memory lines if they coexist with 0: lines, both if they are written higher in the file or lower.
The general idea is - anything we wrote code for we should test. Not only to see that it works now, but as a way to guarantee we do not introduce regression when updating the code in the future.

total_inactive_file is preferred (using your implementation)

Same - we wrote the logic, we need to test it is working.

The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected.

That's a very important point I think I am missing. Wouldn't cgroup filesystems with all mount info exist only if something is mounting them (like and implementation of Linux containers)? In other words, these will be collected for any Java process running on Linux even if the process is not running within a control group?

I increased the test coverage to these cases.

About cgroup, it is a hierarchical structure. Modern linux distributions, with it enabled, all the system is cgroupaware. What normally happen is the process are all bind to root cgroup namespace, sharing it. Some modern linux distributions create some slices, like user slice, system slice, and the user process are not using the same slice that system daemons uses.

CGroup can be limited not only by docker. Systemd allow put the process on limited environment too. When you subslices a previous slice you are subdividing the previous slice. It make more sense on pct values like:

system slice - 400 cpushare
user slice - 200 cpshare      ----- proc1 slice - 200cpushar
                                    proc2 slice - 100cpushare
                                    proc3 slice - 100 cpushare

With this example, 1/3 of cpu can be allocated to user slice. From this 1/3 , a half can be allocateds by proc1 slice (1/6) and 1/4 to proc2 and 3 slice(1/12). Each slice can have several process, and user slice sum all subslices , like memory stats

@eyalkoren
Copy link
Contributor

@tanquetav thanks for the great explanation!
So it seems that basing the limit metric on cgroup would always be the better option as it specifies the actual limit that the process is restricted to. Is that correct?
I am not sure this is the case for the usage metric though - based on the current code of this PR, what would we see in memory.usage? Would that be sum of usages of all processes in the slice?

@eyalkoren
Copy link
Contributor

Please merge with latest master so we can run the tests. CI was upgraded to run the build with Java 11, that requires this change.

@tanquetav
Copy link
Contributor Author

@tanquetav thanks for the great explanation!
So it seems that basing the limit metric on cgroup would always be the better option as it specifies the actual limit that the process is restricted to. Is that correct?
I am not sure this is the case for the usage metric though - based on the current code of this PR, what would we see in memory.usage? Would that be sum of usages of all processes in the slice?

yes, memory.usage will be the sum all process in the slice. Using the limit to enable/disable cgroup metrics will help to avoid collect metrics on cgroup that are not split per service.

On docker/k8s environment in general limitation of memory/cpu is used. On systemd is a optional and we cannot assure if the process is confined or not.

Sum all memory of a correct confined slice is not a bad idea, like we have a python process that fork a celery daemon. It is not usual on java, but on other tecnology may be helpful.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanquetav Sorry for the delayed response again 🙏

I completely agree with your last comment - our "System memory usage" graph aims to show how close is the "system" (not process) to exhausting its memory resources:
Screen Shot 2020-07-27 at 14 50 03

So whenever a process is running within a control group of any kind that has memory limits, the stress of the system should be reflected by the percentage of memory that is potentially available to the process.
In other words - always using cgroup-based limit and usage seems like a valid approach for this purpose.

I think we are very close to conclude this great effort, see my single comment for a small required change.

Also, please build and test manually on the environments you setup (v1 and v2) and let us know where it has been tested.

Thanks again for your amazing contribution!!

private CgroupFiles createCgroup1Files(File memoryMountPath) {
File maxMemoryFile = new File(memoryMountPath, CGroupMetrics.CGROUP1_MAX_MEMORY);
if (maxMemoryFile.canRead()) {
// No need for special treatment for the special "unlimited" value (0x7ffffffffffff000) - omitted by the UI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We revisited this, so we need to restore the unlimited check for v1 as well so, like we did in v2, if the value stored in the memory.limit_in_bytes file is the special UNLIMITED value, we shouldn't report it (meaning - use null for the maxMemoryFile).

@eyalkoren
Copy link
Contributor

Ohh, and we also need to add to documentation (metrics.asciidoc) and the CHANGELOG (better merge latest master before you do)

@tanquetav
Copy link
Contributor Author

Ohh, and we also need to add to documentation (metrics.asciidoc) and the CHANGELOG (better merge latest master before you do)

Hi.

I made the changes requested, about unlimited memory and add documentation. Can you check if the text are ok (you know I am foreign) and if I need to fix something

I always test my code on a environment before push. I create a new elastic apm from the ground (using openstack and some ansible scripts I created) to not conflict with other data. Then I run 4 tests on my desktop (Fedora 32), with docker using my build of apm agent.

  1. booting my Fedora in cgroup1 mode (using systemd.unified_cgroup_hierarchy=0)

    • Test 1: run docker , with all elastic configured e the new apm build with the flag -m 500M
      Then I check kibana, in the "discover" if the metric is collected and the three metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is 500m)

    • Test 2: run docker , with all elastic configured e the new apm build without the flag -m 500M
      Then I check kibana, in the "discover" if the metric is collected and the two other metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is not filled)

  2. booting my Fedora in cgroup2 mode (removeing systemd.unified_cgroup_hierarchy=0)

    • Test 3: run docker , with all elastic configured e the new apm build with the flag -m 800M
      Then I check kibana, in the "discover" if the metric is collected and the three metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is 800m)

    • Test 4: run docker , with all elastic configured e the new apm build without the flag -m 800M
      Then I check kibana, in the "discover" if the metric is collected and the two other metrics are filled (and system.process.cgroup.memory.mem.limit.bytes is not filled)

This work was very nice, I learn a lot about elastic apm and I hope you enjoy this little help.

@tanquetav
Copy link
Contributor Author

@eyalkoren Can you send me a private email to chat? I want to discuss other ideas not related to this issue.

@eyalkoren
Copy link
Contributor

@eyalkoren Can you send me a private email to chat? I want to discuss other ideas not related to this issue.

Gladly. Sent to your GH public email.

I'll run the tests and review the final commits (apply changes myself if needed). Thanks!

This work was very nice, I learn a lot about elastic apm and I hope you enjoy this little help

I certainly did enjoy collaborating with you and learned a lot myself about these OS corners.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Incredible job @tanquetav !!

@SylvainJuge SylvainJuge merged commit c66313f into elastic:master Aug 18, 2020
@subeenn
Copy link

subeenn commented Mar 5, 2021

@SylvainJuge Could you please let us know in which version this fix is available?

@SylvainJuge
Copy link
Member

It was merged in August 2020, thus it's included in the first release after that time, which is 1.18.0.

In the general case, using the latest version available is better (and compatible with previous server versions), thus you should use 1.21.0 (latest release as of today).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

APM Agent on K8s take wrong service memory
6 participants