-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node went offline #265
Comments
Again running 4 ros_comm PR jobs:
Digging into the slave we're running out of memory: https://gist.github.com/tfoote/a0472cf30a4420d67d11 There's an older OOM kill ~ 3036 seconds likely the previous instance |
@dirk-thomas can you link the jobs that failed so we can see what they were running at the time of failure? |
Looking at these, they're all failing in the same test
3
4
1
2
3
4
Note I've only clipped the job names. There are several warnings before the slave goes offline of the type:
From the OOM logs we have a lot of python processes using memory: Below is the OOM traceback with just the memory usage isolated. Note that the units of total_vm is a count of 4k pages: https://unix.stackexchange.com/questions/128642/debug-out-of-memory-with-var-log-messages Looking at the tests I see 384 python processes of which a large fraction appear to be test harnesses using a lot of ram each. And clearly the docker instance and jenkins-slave are also using large chunks too.
|
Since all the builds referenced in this ticket I don't expect that anyone can follow up on this specific case. Since there is #271 I will close this issue. |
This was a manually configured node that had been running for a while:
Exerpt from jenkins log around that time.
The full slave log:
The slave has now recovered and is running just fine
The
Discovering Jenkins master
above gives no reason for the disconnect.The text was updated successfully, but these errors were encountered: