-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cgroup support (#1197) #1211
Conversation
💚 CLA has been signed |
@tanquetav Thanks for the PR!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanquetav Thanks for this extremely useful PR! ❤️
A few suggestions:
- Let's not assume the cgroup filesystem path is the default (
/sys/fs/cgroup
). I found that the memory cgroup subdir path can be quite easily parsed from the contents of/proc/self/mounts
- lines are space-delimited, where the first part ismemory
, the second part would be the memory cgroup path (tested on latest Docker images of ubuntu, opensuse/leap and centos). If we fail to find the path this way, let's then fallback to try the default. Of course this should be done during initialization. - What use cases does the cgroup2 route cover? If we apply the suggestion in (1), would that cover it?
- if you move the check whether limit is set (and whether we want to use the cgroup info) to the constructor, you can use a final boolean field, thus avoiding the
AtomicBoolean
accesses. In general, please try to move any initialization logic from thebindTo
to construction time. system.process.memory.rss.bytes
is a useful metric, but if we add it, let's add it all across, meaning at least with JMX as well. Also, if we do that, we better document in the metrics documentation.- What do think on adopting the approach suggested in node level memory state from root cgroup is different from /proc/meminfo google/cadvisor#2042 of calculating real used bytes based on
memory.usage_in_bytes - total_inactive_file
(the latter coming fromstat
)?
An additional thought - google/cadvisor#2042 reports that the real used values are different when obtained through these different approaches. I witnessed the same. What do you think about calculating the real used based on cgroup even if |
For 1 - 3 itens, I redo the check/verify of cgroup to contructor. It is much more cleaner code. It use /proc/self/cgroup to verify the cgroup2 folder, if it can be read or if it is unlimited. Try to fallback to cgroup1 . 4-5 itens, I think is better to not use it now, maybe in other PR, because I am using cgroup and more process can be allocated to same cgroup. You suggestion to use jmx is better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanquetav Thanks for the changes.
I think you misunderstood what I meant to address. I am trying to find the actual cgroup filesystem path, before trying the default path (/sys/fs/cgroup
). The way I suggested of doing that is by parsing the /proc/self/mounts
file. The line that starts with memory
should contain the path to the memory cgroup subdirectory as the second item. Let me know if you see a case where this does not apply.
The handling of the /proc/self/cgroup
file contents is not going to work where I tested it. The path retrieved from it is not a valid file or directory path in those cases.
In addition, when is the 0:
subsystem is being assigned? The outputs I observed did not assign to it. In which cases does cgroup1
approach is valid and in which it is cgroup2
?
apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java
Outdated
Show resolved
Hide resolved
apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java
Outdated
Show resolved
Hide resolved
apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java
Outdated
Show resolved
Hide resolved
apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java
Outdated
Show resolved
Hide resolved
I am using fedora 32, and on boot process, I can select if I am using cgroup1 or cgroup2. If I pass systemd.unified_cgroup_hierarchy=0 as a kernel option cgroup1 is used and if I do not put this on kernel boot option cgroup2 is used. About /proc/self/cgroup, sometimes memory do not appear on it, just the 0: entry. Then I check both lines (memory and 0:) .
About /sys/fs/cgroup, I think that is a misundertood. On cgroup1, /sys/fs/cgroup has the memory files, and when they are read , it gives the cgroup values. On cgroup2 is a little differente. The memory limits files are available inside a subdirectories with the slice name:
I solved the other issues. |
@tanquetav Thanks for your input. What I am trying to suggest is not assume that the location of cgroup filesystem is under In any case, we will have to delay this PR a bit, as we may want to introduce new metricsets especially for containers and keep the current |
@eyalkoren got your concerns about /sys/fs/cgroup. RHEL 6 seams to mount cgroup on /cgroup instead. I make a logic to figure out the mount point using /proc/self/mountinfo, with a fallback to /sys/fs/cgroup. It searches for 2 patterns:
to cgroup1 and
to cgroup2. I just tested on fedora, can you check on suse , ubuntu and others? I stated similar item on python agent. I will apply your suggestions to this patch and try to do on dotnet agent. |
@tanquetav This is awesome, thanks a lot! I will try out, but it will take a bit, so just be informed. Thanks for looking into
Instead, split each line using spaces, and if the first part is |
Reading https://man7.org/linux/man-pages//man5/procfs.5.html , it have some concerns about mounts has some missing information, and mountinfo is more complete, respecting namespaces that are related with cgroups. It was available since kernel 2.6. About the parser, it will not solve the problem (the memory) because on cgroup 2 it is not available, just the cgroup mount itself. But if you think is better to use mounts, I can change the code without any problem. |
@tanquetav Sorry for the delay in response. |
I filled the CLA several times. I fill it again now. Can you check if it is ok, or you can send me a tutorial to how to handle this? Thank you. |
Ohh, sorry about that. Did you use the same email as you use in your GitHub account? |
yes |
You have signed the CLA with your Gmail address. The commits are signed with your softplan address. Make sure that the latter is also added to your GitHub profile. It doesn't have to be the primary one. |
@tanquetav sorry for the long wait, we had to outline the new set of metrics - see elastic/apm#291 for details. Following steps:
Let us know if this is too much and you want us to take over. |
@tanquetav would you like us to take over this one, or are you planning to continue with it? |
Yes, but unfortunately I can only fix this on weekend. I am having a busy
week
…On Sun, Jul 5, 2020 at 4:50 AM eyalkoren ***@***.***> wrote:
@tanquetav <https://github.com/tanquetav> would you like us to take over
this one, or are you planning to continue with it?
You contribution has already provided a lot of value, regardless of what
you decide, just let us know.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1211 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDLJG6LEYWR3N5XKJY6RD3R2AWEBANCNFSM4NQBBO7Q>
.
|
No rush, take the time you need and let us know if you need assistance. |
Hello @eyalkoren I worked on PR this weekend, following your sugestion:
What I did is use this regex, to force grab cgroup2:
Then concatenate with slice:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanquetav Thanks for applying the changes!!
I will make a pull request tomorrow proposing some changes, would be easier to communicate through code directly 🙂
private final OperatingSystemMXBean operatingSystemBean; | ||
|
||
final private List<WildcardMatcher> inactiveMemoryRelevantLines = Arrays.asList(caseSensitiveMatcher("inactive_file *")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a list if we need a single matcher, but in any case, I suggest not using a matcher at all but an exact match - see comment where this is used
} | ||
|
||
public CgroupFiles verifyCgroupEnabled(File procSelfCgroup, File mountInfo) { | ||
if (procSelfCgroup.canRead() && mountInfo.canRead()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mountInfo
is not a must - we have a default way to find the memory cgroup path.
if (cgroupFilesTest != null) return cgroupFilesTest; | ||
|
||
} | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log an error
try(BufferedReader fileMountInfoReader = new BufferedReader(new FileReader(mountInfo))) { | ||
for (String mountLine = fileMountInfoReader.readLine(); mountLine != null && !mountLine.isEmpty(); mountLine = fileMountInfoReader.readLine()) { | ||
String foundRegex = applyCgroup2Regex(mountLine); | ||
if (foundRegex != null) { | ||
cgroupFilesTest = verifyCgroup2Available(lineCgroup, new File(foundRegex)); | ||
if (cgroupFilesTest != null) return cgroupFilesTest; | ||
} | ||
foundRegex = applyCgroup1Regex(mountLine); | ||
if (foundRegex != null) { | ||
cgroupFilesTest = verifyCgroup1Available(new File(foundRegex)); | ||
if (cgroupFilesTest != null) return cgroupFilesTest; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only execute if the mountInfo
file exists and available for us. Also add a catch
clause for any related exception of this try
block, because we still have a default to fall back to if any of this fails
if (procSelfCgroup.canRead() && mountInfo.canRead()) { | ||
try(BufferedReader fileReader = new BufferedReader(new FileReader(procSelfCgroup))) { | ||
String lineCgroup = null; | ||
for (String cgroupLine = fileReader.readLine(); cgroupLine != null && !cgroupLine.isEmpty(); cgroupLine = fileReader.readLine()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is an empty line used as a stop condition for the parsing? Isn't it enough to stop when readLine()
produces null
?
In fact, isn't it enough to do:
String cgroupLine = fileReader.readLine()
while (cgroupLine != null) {
...
cgroupLine = fileReader.readLine()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same logic that was used previously on metricRegistry.addUnlessNan("system.memory.total", ....
If it was previuosly done this way I believe it safe parser this system files in this way
String applyCgroup1Regex(String mountLine) { | ||
Matcher matcher = CGROUP1_MOUNT_POINT.matcher(mountLine); | ||
if (matcher.matches()) { | ||
return matcher.group(1); | ||
} | ||
return null; | ||
} | ||
String applyCgroup2Regex(String mountLine) { | ||
Matcher matcher = CGROUP2_MOUNT_POINT.matcher(mountLine); | ||
if (matcher.matches()) { | ||
return matcher.group(1); | ||
} | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same code - should be extracted into one method
private CgroupFiles verifyCgroup2Available(String lineCgroup, File mountDiscovered) throws IOException { | ||
final String[] cgroupSplit = StringUtils.split(lineCgroup, ':'); | ||
// Checking cgroup2 | ||
File maxMemory = new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_MAX_MEMORY); | ||
if (maxMemory.canRead()) { | ||
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) { | ||
String memMaxLine = fileReaderMem.readLine(); | ||
if (!"max".equalsIgnoreCase(memMaxLine)) { | ||
return new CgroupFiles(maxMemory, | ||
new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_USED_MEMORY), | ||
new File(mountDiscovered, cgroupSplit[cgroupSplit.length - 1] + "/" + CGROUP2_STAT_MEMORY)); | ||
} | ||
} | ||
} | ||
return null; | ||
} | ||
|
||
private CgroupFiles verifyCgroup1Available(File mountDiscovered) throws IOException { | ||
// Checking cgroup1 | ||
File maxMemory = new File(mountDiscovered, CGROUP1_MAX_MEMORY); | ||
if (maxMemory.canRead()) { | ||
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) { | ||
String memMaxLine = fileReaderMem.readLine(); | ||
long memMax = Long.parseLong(memMaxLine); | ||
if (memMax < UNLIMITED) { // Cgroup1 use a contant to disabled limits | ||
return new CgroupFiles(maxMemory, | ||
new File(mountDiscovered, CGROUP1_USED_MEMORY), | ||
new File(mountDiscovered, CGROUP1_STAT_MEMORY)); | ||
} | ||
} | ||
} | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very similar. Please extract common code
try(BufferedReader fileReaderStatFile = new BufferedReader(new FileReader(cgroupFiles.getStatMemory()))) { | ||
long sum = 0; | ||
for (String statLine = fileReaderStatFile.readLine(); statLine != null && !statLine.isEmpty(); statLine = fileReaderStatFile.readLine()) { | ||
if (WildcardMatcher.isAnyMatch(inactiveMemoryRelevantLines, statLine)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use a wildcard matcher here, search for exact line instead. Split each line and look at the first part for both inactive_file
and total_inactive_file
, preferring total_inactive_file
if exists.
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) { | ||
String memMaxLine = fileReaderMem.readLine(); | ||
long memMax = Long.parseLong(memMaxLine); | ||
if (memMax < UNLIMITED) { // Cgroup1 use a contant to disabled limits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to look at unlimited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read about to track cgroup itens even if it is not limited.
My afraid is that on normal use cases, like running on a linux environment these metrics be collected, and it will track all host values, like memory usage, it will be memory usage of all OS, because nothing is limited on cgroup.
if (maxMemory.canRead()) { | ||
try(BufferedReader fileReaderMem = new BufferedReader(new FileReader(maxMemory))) { | ||
String memMaxLine = fileReaderMem.readLine(); | ||
if (!"max".equalsIgnoreCase(memMaxLine)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, so in cgroup2 max
represents unlimited memory for the cgroup?
This is important! We just decided to transfer the special treatment of unlimited max to the UI, but we rely on sending the special numeric representing unlimited. We can't send a string. We can send the unlimited value, but it makes more sense, we better not send anything if we see max
and make sure UI properly deals with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on cgroup2 it is a string. what about to use the UNLIMITED constant from cgroup1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I poorly tried to say in my comment - we could do that, but it doesn't make a lot of sense...
Letting the UI even be aware of the UNLIMITED numeric value was a way to implement once, instead of making all APM agents aware of that. Now that I am aware of the literal UNLIMITED value in v2, we must handle in agents, as the metric value in the intake API and Elasticsearch would be numeric.
I think that making sure agents omit the value when it is max
and making sure UI expects that is enough.
d1315e8
to
69572c0
Compare
69572c0
to
44d0561
Compare
Ok, implemented all these. About system.process.cgroup.memory.mem.limit.bytes, it always being collected, except on cgroup2 when it is max total_inactive_file is preferred (using your implementation) All the logic is now on a new file (CGroupMetric), and I reverted SystemMetrics to the original state The only thing I mentioned early that seems a bit strange to me is to collect cgroup data always. When running without docker, using java -jar app.jar, Cgroup data is being collected, and cgroup slice is shared with all applications on my machine. It seems a bit weird to me, but it is working as expected. |
elastic/apm#291 suggests that agents send the |
That is why we need to test that we do what we want to be doing - select
Same - we wrote the logic, we need to test it is working.
That's a very important point I think I am missing. Wouldn't cgroup filesystems with all mount info exist only if something is mounting them (like and implementation of Linux containers)? In other words, these will be collected for any Java process running on Linux even if the process is not running within a control group? |
@tanquetav the main disadvantage I see with sending cgroup data only when not UNLIMITED is that we will not send usage metrics as well for containers running without memory limitation. This means that memory usage for containerized apps will be reported based on Also, please list all systems you used to manually test this PR, so we have a basic list of known supported ones. Thanks! |
I increased the test coverage to these cases. About cgroup, it is a hierarchical structure. Modern linux distributions, with it enabled, all the system is cgroupaware. What normally happen is the process are all bind to root cgroup namespace, sharing it. Some modern linux distributions create some slices, like user slice, system slice, and the user process are not using the same slice that system daemons uses. CGroup can be limited not only by docker. Systemd allow put the process on limited environment too. When you subslices a previous slice you are subdividing the previous slice. It make more sense on pct values like:
With this example, 1/3 of cpu can be allocated to user slice. From this 1/3 , a half can be allocateds by proc1 slice (1/6) and 1/4 to proc2 and 3 slice(1/12). Each slice can have several process, and user slice sum all subslices , like memory stats |
@tanquetav thanks for the great explanation! |
Please merge with latest master so we can run the tests. CI was upgraded to run the build with Java 11, that requires this change. |
yes, memory.usage will be the sum all process in the slice. Using the limit to enable/disable cgroup metrics will help to avoid collect metrics on cgroup that are not split per service. On docker/k8s environment in general limitation of memory/cpu is used. On systemd is a optional and we cannot assure if the process is confined or not. Sum all memory of a correct confined slice is not a bad idea, like we have a python process that fork a celery daemon. It is not usual on java, but on other tecnology may be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanquetav Sorry for the delayed response again 🙏
I completely agree with your last comment - our "System memory usage" graph aims to show how close is the "system" (not process) to exhausting its memory resources:
So whenever a process is running within a control group of any kind that has memory limits, the stress of the system should be reflected by the percentage of memory that is potentially available to the process.
In other words - always using cgroup-based limit and usage seems like a valid approach for this purpose.
I think we are very close to conclude this great effort, see my single comment for a small required change.
Also, please build and test manually on the environments you setup (v1 and v2) and let us know where it has been tested.
Thanks again for your amazing contribution!!
private CgroupFiles createCgroup1Files(File memoryMountPath) { | ||
File maxMemoryFile = new File(memoryMountPath, CGroupMetrics.CGROUP1_MAX_MEMORY); | ||
if (maxMemoryFile.canRead()) { | ||
// No need for special treatment for the special "unlimited" value (0x7ffffffffffff000) - omitted by the UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We revisited this, so we need to restore the unlimited check for v1 as well so, like we did in v2, if the value stored in the memory.limit_in_bytes
file is the special UNLIMITED value, we shouldn't report it (meaning - use null
for the maxMemoryFile
).
Ohh, and we also need to add to documentation (metrics.asciidoc) and the CHANGELOG (better merge latest master before you do) |
Hi. I made the changes requested, about unlimited memory and add documentation. Can you check if the text are ok (you know I am foreign) and if I need to fix something I always test my code on a environment before push. I create a new elastic apm from the ground (using openstack and some ansible scripts I created) to not conflict with other data. Then I run 4 tests on my desktop (Fedora 32), with docker using my build of apm agent.
This work was very nice, I learn a lot about elastic apm and I hope you enjoy this little help. |
@eyalkoren Can you send me a private email to chat? I want to discuss other ideas not related to this issue. |
Gladly. Sent to your GH public email. I'll run the tests and review the final commits (apply changes myself if needed). Thanks!
I certainly did enjoy collaborating with you and learned a lot myself about these OS corners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Incredible job @tanquetav !!
@SylvainJuge Could you please let us know in which version this fix is available? |
It was merged in August 2020, thus it's included in the first release after that time, which is 1.18.0. In the general case, using the latest version available is better (and compatible with previous server versions), thus you should use 1.21.0 (latest release as of today). |
What does this PR do?
closes #1197
Checklist
Author's Checklist
Related issues
Use cases
When a java program is running on a cgroup limited environment (docker with -m option, k8s with resource limit) cgroup is used to get memory information(system.memory.total, system.memory.actual.free and system.process.memory.rss)
Screenshots