Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AssertionError in OsProbe.readProcSelfCgroup #77833

Closed
ywangd opened this issue Sep 16, 2021 · 7 comments
Closed

[CI] AssertionError in OsProbe.readProcSelfCgroup #77833

ywangd opened this issue Sep 16, 2021 · 7 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@ywangd
Copy link
Member

ywangd commented Sep 16, 2021

Multiple tests failed with process was found dead while waiting for ports files. The underlying issue is shown in the test cluster log file: The node crashed when starting up because it didn't read any valid content from the /proc/self/cgroup file. I wonder whether there is a subtle racing condition somwhere setting up the test.

Build scan:
https://gradle-enterprise.elastic.co/s/lfrzuigm2obi4
https://gradle-enterprise.elastic.co/s/wruq4uitjwndq

Repro line:
N/A

Reproduces locally?:
Didn't try

Applicable branches:
7.x

Failure history:
N/A

Failure excerpt:

[2021-09-16T01:10:58,208][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [yamlRestTest-0] fatal error in thread [main], exiting
java.lang.AssertionError: null
    at org.elasticsearch.monitor.os.OsProbe.readProcSelfCgroup(OsProbe.java:298) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.monitor.os.OsProbe.areCgroupStatsAvailable(OsProbe.java:579) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.monitor.os.OsProbe.getCgroup(OsProbe.java:637) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.monitor.os.OsProbe.getCgroup(OsProbe.java:857) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.monitor.os.OsProbe.osStats(OsProbe.java:864) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.monitor.os.OsService.<init>(OsService.java:39) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.monitor.MonitorService.<init>(MonitorService.java:33) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.node.Node.<init>(Node.java:509) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.node.Node.<init>(Node.java:288) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:219) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:219) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:399) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:167) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:158) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:75) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:114) ~[elasticsearch-cli-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.cli.Command.main(Command.java:79) ~[elasticsearch-cli-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:123) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:81) ~[elasticsearch-7.16.0-SNAPSHOT.jar:7.16.0-SNAPSHOT]
[2021-09-16T01:11:29.937733503Z] [BUILD] Stopping node
@ywangd ywangd added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Sep 16, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ywelsch
Copy link
Contributor

ywelsch commented Sep 28, 2021

More occurrences of this: https://gradle-enterprise.elastic.co/s/rb3ky43i6j34c

@grcevski grcevski self-assigned this Sep 28, 2021
@valeriy42
Copy link
Contributor

Another occurrence here https://gradle-enterprise.elastic.co/s/jtr5tqzsg2eq2

@grcevski
Copy link
Contributor

grcevski commented Oct 4, 2021

I took a look at this issue and I believe it has to do with Oracle Linux 6. After a fresh VM install I checked the contents of /proc/self/cgroup and it was empty, it was there on the file system, but nothing in it. After a full shutdown and reboot the file was filled in properly with the basic set of entries, like :memory, :cpu, etc. I don't know if this an intermittent issue with the kernel shipped in Oracle Linux 6.10 (4.1.12) or a temporary problem on first boot after install, but the /proc/self/cgroup file can be empty it seems on this version. I haven't been able to reproduce this issue with newer kernel versions (I tried Ubuntu 20.04 and latest RHEL 8.4).

The 3 reported failures that we've had so far were while running Oracle Linux 6.

We've had the assert that fails for a while, but it seems we introduced a test for it the first time in #77128.

I think the fix would be to remove the assert that's causing this failure and treat the empty cgroups file just as the non-existent file case. The second conditional for the assert lines != null will never be false.

@rjernst
Copy link
Member

rjernst commented Oct 4, 2021

We've removed support for OEL-6 in 8.0 (see #51480), so I don't think we should relax the assertion there, but in 7.x I guess it is necessary.

@grcevski
Copy link
Contributor

grcevski commented Oct 4, 2021

OK makes sense, I didn't realize all of these failed on 7.x only. I can make the fix for 7.x only.

grcevski added a commit that referenced this issue Oct 5, 2021
Older versions of the Linux kernel, e.g. 4.1.12 which
is found in OEL-6, can sometimes have empty cgroup file
causing a test assertion. This change removes the
assert and handles the empty file like a non-existent
file.

Closes #77833
@grcevski
Copy link
Contributor

grcevski commented Jan 5, 2022

This was fixed in #78659. I'm not sure why auto-close didn't work, I'll close this for now.

@grcevski grcevski closed this as completed Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants